Neural Machine Translation

of three

Neural Machine Translation Overview

Nodes in neural machine translation networks work like neurons in the brain, connecting with each other and working together to produce a final output. The number and configuration of the nodes within a network is important and makes a difference in the type of output at which the system excels, as well as the speed of that output and the time it takes to train the system. Using distributed learning to train neural networks is most successful, though it has some challenges. Adding Attention (a specific communication configuration) to a network can help improve the accuracy of the output produced.

Nodes Are Like Neurons

  • Neural networks, and especially those with “multiple hidden layers” (called “deep”) “are remarkably good at learning input > output mappings.” These networks are comprised of “input nodes, hidden nodes, and output nodes,” which can be loosely compared to biological neurons in the way they work. Just like synapse connections between neurons in the brain work to connect ideas or thoughts, nodes work together to connect the processing pieces of the network.
  • A deep network is a “multi-layered processing network of neuron filters.” Each node in the network works on different aspects fine-tuning and refining the input data “as it is passed through the translation ‘engine’ with a view to achieving a predicted output.” In deep learning, these nodes effectively learn as they work “and can adjust and remember processes” according to that learning.
  • In NMT (neural machine translation), the language engine of the network utilizes “deep neural networks to essentially predict the next word in a sequence.” In NMT networks, the systems can understand full sequences of words (sentences), while SMT networks only understands up to the phrase-level. The latter is replacing the former since SMTs often have syntactical and grammatical errors in their outputs.

Number of Nodes Is Important

  • The number of nodes in a network is important and key to how fast it processes, as well as how fast it learns. Those adding nodes to their networks to increase throughput or refine output need to experiment with different scaling options to find the optimal number of processes for each node network.
  • Additionally, multi-threading the notes can “dramatically improve performance.” This is done by setting specific values and MPI ranks to each thread to equal “the number of available CPU cores per node.”

Node Configuration Makes a Difference

  • Neural networks include a large number of processors operating in parallel but arranged in tier format; each tier is made up of nodes. Each tier’s nodes receive info from the tier before it, process the information, and pass it to the next tier of nodes until the last tier processes and produces the final output. Each node within each tier “has its own sphere of knowledge, including rules that it was programmed with and rules it has learnt by itself.”
  • There are multiple types of neural networks, each having their own node configuration, with the different configurations being successful at different types of processing. The simplest network is called the Feedforward Neural Network, in which the “data passes through the different input nodes til it reaches the output node.” In these networks, data moves in one direction (forward) only in a “front propagated wave,” whereas more complex networks also include backpropagation (backward movement).
  • Another example is the Radial Basis Function Neural Network, which is composed of two layers of nodes. The inner layer includes features “combined with the radial basis function,” the output of which is used in calculating “the same output in the next time-step.” A third example is the Multilayer Perceptron Neural Network, which has three or more layers in a fully-connected network meaning that every node is connected to every other node in the network (not just in a linear set-up). These networks use nonlinear activation functions like "hyperbolic tangent or logistic function.
  • One of the most common neural network configurations is the Convolutional Neural Network, which has shown to be very effective “in image and video recognition, natural language processing, and recommender systems.” This network-type has proven to be effective at “semantic parsing and paraphrase detection.” These networks use a variation on the Multilayer Perceptrons and include “one or more than one convolutional layers” of nodes, which can either be fully interconnected or pooled. Due to their configurations, these networks “can be much deeper but with much fewer parameters”.

Distributed Deep Learning Provides Faster Node Training

  • In large neural networks, training the system can be very time-consuming, “making it difficult to perform rapid prototyping or hyper parameter tuning.” By using distributed training and open source frameworks (of which there are multiple available), the training can be conducted by multiple workers, which will substantially reduce the time needed to train the network.
  • One experiment detailed the throughout of a “transformer (big) model when trained using up to 100 Zenith nodes.” Their test included setting specific environment variables and “the optimal number of MPI processes per node,” and they found that the system learns faster (and processes more quickly) by 79% when two-processes-per-node is used rather than four-processes per node.
  • Notably, their experiments proved that the number of total nodes is not important as this method of training provides “linear performance” results no matter how many nodes were within the network.
  • Although using distributed training across multiple nodes can improve the time needed to train the network, there are challenges associated with doing this. One issue is performance degradation leading to OOM errors (out of memory). Convolutional neural networks using very dense gradient vectors are often designed using an embedded layer, which doesn’t often scale successfully in systems with multiple servers thus causing the issues. Using gradient accumulation and converting sparse tensors to dense sensors allows for a scalable model for training hundreds of nodes.
  • Another challenge comes with using diverged training, “where the produced model becomes less accurate (rather than more accurate),” so experts say that “setting the learning rate to an optimal value is crucial for the model’s convergence,” and therefore, overall performance. This means that measures must be taken that will prevent divergence in the training, and “finding the ideal learning rate for the model is therefore critical.” Reducing the learning rate “(cool down or decay)” or increasing that rate “(warm up)” or even a combination of the two is one solution to this issue.

Adding Attention Can Assist with Output Accuracy

  • In the much-used seq2seq configuration, “two recurrent neural networks (RNNs) with an encoder-decoder architecture: read the input words one by one to obtain a vector representation of a fixed dimensionality (encoder), and, conditioned on these inputs, extract the output words one by one using another RNN (decoder).” The problem with this is that the decoder receives only the last encoder hidden state from the encoder, producing a “vector representation which is like a numerical summary of an input sequence.” With long inputs (like full sentences), the system might experience “catastrophic forgetting” and thus the output will be highly inaccurate.
  • To combat this, Experts at Data Toward Science state that giving the decoders vectors representations “from every encoder time step” will help it more better-informed translations. Attention is an interface “between the encoder and decoder that provides the decoder with information from every encoder hidden state,” thereby increasing the data it has to use in creating the final translated output. This node and network configuration allows the system to “selectively focus on useful parts of the input sequence, and hence, learn the alignment between them,” which helps in translating long inputs (like lengthy sentences and paragraphs).

Research Strategy

To identify these insights, we performed a deep search on how nodes work within neural machine translation networks. Much of what we found was highly technical and was written for coders and computer engineers, though we were able to find a selection of articles and studies that detailed how nodes worked in these systems, and how adding or reconfiguring the nodes makes a huge difference to the throughput and output. From these, we pulled and detailed the insights featured here.

of three

Neural Machine Translation Cost

According to our research findings, a computing node typically costs $16,460. Our research also found that the average cost of a GPU used in neural machine translation is $5,007.

Computing Node Cost

  • Based on the examples of computing node costs that we found, the average cost of a computing node is $16,460.
  • We calculated that average cost by adding all the costs of the various computing node models included below (which equals $197,515.55) and dividing that sum by the total number of node costs included in our calculation (12), which equals $16,460 (rounded to the nearest dollar).

1. General Costs

2. Brand-Specific Costs

  • Quest's General Compute Node (96GB of memory per node) costs $9,522.
  • Quest's General Compute Node (192GB of memory per node) costs $10,814.
  • Quest's High Memory Node (with RAM of 1TB) costs $21,464.

GPU Cost

  • The average cost of a GPU used in neural machine translation, according to our research findings, is $5,007.
  • We calculated that average cost by adding all the costs of the various GPU models included below (which equals $20,028.66) and dividing that sum by the total number of GPU models included in our calculation (4), which equals $5,007 (rounded to the nearest dollar).
  • The NVIDIA Titan RTX GPU costs $2,500.
  • The NVIDIA Titan X GPU costs $1,281.
  • The Tesla v100 GPU costs $5,855.
  • The Radeon Instinct GPU costs $10,392.66.

Research Strategy

To provide data on the cost of a computing node, we looked for prices of various node models. We found computing node prices in publications from Northwestern University, San Diego Supercomputer Center, and Michigan State University. We provided the node prices included in those sources above. After doing so, we calculated the average price by adding all of those prices for computing nodes and dividing that sum by the total number of node models included in our calculation.

To provide the cost of a GPU used in neural machine translation, we first looked for GPU vendors and mentions of GPU models that are used in neural machine translation. We found examples of vendors from a Statista article and articles describing such from sources that included Synced, NVIDIA, and the Vector Institute. We verified that the GPU models we found are used in neural machine translation, based on the information provided in those articles. Next, we looked up the price of those GPUs either directly from the vendor or on sites such as Amazon.
of three

GPUs Overview

The graphic processing unit (GPU) works, in relation to neural machine translation (NMT), by scaling MNT to bigger data sets with faster training, reducing NMT training time through hybrid data-model parallelism, providing greater energy efficiency and better performance, using Google Colab, and using TVM.

GPU Scales MNT to Bigger data sets with Faster Training

  • Facebook trained a robust NMT model in only 32 minutes, down from around 24 hours, or 45 times faster.
  • This was accomplished by using an NVIDIA DGX-1 machine, which has eight Volta GPUs, to first decrease the training time of the model from around 24 hours to less than five hours.
  • The GPU memory footprint was reduced by switching the training from 32-bit precision to 16-bit, allowing them to utilize the "heavily optimized Tensor Cores provided by NVIDIA’s latest Volta GPU architecture".
  • Next, the communication between GPUs was reduced by delaying model updates.
  • The NMT models were trained synchronously to enable each GPU to keep a complete, identical copy of the model while processing different training data portions.
  • After each mini-batch has been processed, the GPUs synchronously communicated their outcomes to one other.

GPU Reduces NMT Training Time Through Hybrid Data-Model Parallelism

  • Researchers in Japan used multiple GPUs to reduce the training time of neural machine translation (NMT).
  • This was done through "data parallelism and model parallelism" using GPUs on one machine.
  • A model parallel method was applied to the RNN encoder-decoder part of a Seq2Seq RNN machine translation (MT) model and a data parallel method was applied to the attention-softmax part of the model.
  • A speed of 4.13 to 4.20 times was achieved using 4 GPUs, compared with the training speed when using 1 GPU, without affecting the accuracy of the machine translation.

GPU Provides Greater Energy Efficiency and Better Performance for Machine Translation

  • The challenges of machine translation through deep learning, which include "vocabulary, data sparseness, and maintaining a history of vector values", can be tackled using GPU.
  • GPU provides parallel processing and faster computation compared to CPU.
  • GPU provides greater energy efficiency and performs significantly better than CPU in the "training of recursive auto encode and recurrent neural network".
  • GPU deals with complex computation and enables good performance for the system through massive parallel computation.

Google Colab Uses GPU for Building NMT Models

GPU Optimizies Neural Machine Translation using TVM