Neural Machine Translation Overview
Nodes in neural machine translation networks work like neurons in the brain, connecting with each other and working together to produce a final output. The number and configuration of the nodes within a network is important and makes a difference in the type of output at which the system excels, as well as the speed of that output and the time it takes to train the system. Using distributed learning to train neural networks is most successful, though it has some challenges. Adding Attention (a specific communication configuration) to a network can help improve the accuracy of the output produced.
Nodes Are Like Neurons
- Neural networks, and especially those with “multiple hidden layers” (called “deep”) “are remarkably good at learning input > output mappings.” These networks are comprised of “input nodes, hidden nodes, and output nodes,” which can be loosely compared to biological neurons in the way they work. Just like synapse connections between neurons in the brain work to connect ideas or thoughts, nodes work together to connect the processing pieces of the network.
- A deep network is a “multi-layered processing network of neuron filters.” Each node in the network works on different aspects fine-tuning and refining the input data “as it is passed through the translation ‘engine’ with a view to achieving a predicted output.” In deep learning, these nodes effectively learn as they work “and can adjust and remember processes” according to that learning.
- In NMT (neural machine translation), the language engine of the network utilizes “deep neural networks to essentially predict the next word in a sequence.” In NMT networks, the systems can understand full sequences of words (sentences), while SMT networks only understands up to the phrase-level. The latter is replacing the former since SMTs often have syntactical and grammatical errors in their outputs.
Number of Nodes Is Important
- The number of nodes in a network is important and key to how fast it processes, as well as how fast it learns. Those adding nodes to their networks to increase throughput or refine output need to experiment with different scaling options to find the optimal number of processes for each node network.
- Additionally, multi-threading the notes can “dramatically improve performance.” This is done by setting specific values and MPI ranks to each thread to equal “the number of available CPU cores per node.”
Node Configuration Makes a Difference
- Neural networks include a large number of processors operating in parallel but arranged in tier format; each tier is made up of nodes. Each tier’s nodes receive info from the tier before it, process the information, and pass it to the next tier of nodes until the last tier processes and produces the final output. Each node within each tier “has its own sphere of knowledge, including rules that it was programmed with and rules it has learnt by itself.”
- There are multiple types of neural networks, each having their own node configuration, with the different configurations being successful at different types of processing. The simplest network is called the Feedforward Neural Network, in which the “data passes through the different input nodes til it reaches the output node.” In these networks, data moves in one direction (forward) only in a “front propagated wave,” whereas more complex networks also include backpropagation (backward movement).
- Another example is the Radial Basis Function Neural Network, which is composed of two layers of nodes. The inner layer includes features “combined with the radial basis function,” the output of which is used in calculating “the same output in the next time-step.” A third example is the Multilayer Perceptron Neural Network, which has three or more layers in a fully-connected network – meaning that every node is connected to every other node in the network (not just in a linear set-up). These networks use nonlinear activation functions like "hyperbolic tangent or logistic function.”
- One of the most common neural network configurations is the Convolutional Neural Network, which has shown to be very effective “in image and video recognition, natural language processing, and recommender systems.” This network-type has proven to be effective at “semantic parsing and paraphrase detection.” These networks use a variation on the Multilayer Perceptrons and include “one or more than one convolutional layers” of nodes, which can either be fully interconnected or pooled. Due to their configurations, these networks “can be much deeper but with much fewer parameters”.
Distributed Deep Learning Provides Faster Node Training
- In large neural networks, training the system can be very time-consuming, “making it difficult to perform rapid prototyping or hyper parameter tuning.” By using distributed training and open source frameworks (of which there are multiple available), the training can be conducted by multiple workers, which will substantially reduce the time needed to train the network.
- One experiment detailed the throughout of a “transformer (big) model when trained using up to 100 Zenith nodes.” Their test included setting specific environment variables and “the optimal number of MPI processes per node,” and they found that the system learns faster (and processes more quickly) by 79% when two-processes-per-node is used rather than four-processes per node.
- Notably, their experiments proved that the number of total nodes is not important as this method of training provides “linear performance” results no matter how many nodes were within the network.
- Although using distributed training across multiple nodes can improve the time needed to train the network, there are challenges associated with doing this. One issue is performance degradation leading to OOM errors (out of memory). Convolutional neural networks using very dense gradient vectors are often designed using an embedded layer, which doesn’t often scale successfully in systems with multiple servers thus causing the issues. Using gradient accumulation and converting sparse tensors to dense sensors allows for a scalable model for training hundreds of nodes.
- Another challenge comes with using diverged training, “where the produced model becomes less accurate (rather than more accurate),” so experts say that “setting the learning rate to an optimal value is crucial for the model’s convergence,” and therefore, overall performance. This means that measures must be taken that will prevent divergence in the training, and “finding the ideal learning rate for the model is therefore critical.” Reducing the learning rate “(cool down or decay)” or increasing that rate “(warm up)” or even a combination of the two is one solution to this issue.
Adding Attention Can Assist with Output Accuracy
- In the much-used seq2seq configuration, “two recurrent neural networks (RNNs) with an encoder-decoder architecture: read the input words one by one to obtain a vector representation of a fixed dimensionality (encoder), and, conditioned on these inputs, extract the output words one by one using another RNN (decoder).” The problem with this is that the decoder receives only the last encoder hidden state from the encoder, producing a “vector representation which is like a numerical summary of an input sequence.” With long inputs (like full sentences), the system might experience “catastrophic forgetting” and thus the output will be highly inaccurate.
- To combat this, Experts at Data Toward Science state that giving the decoders vectors representations “from every encoder time step” will help it more better-informed translations. Attention is an interface “between the encoder and decoder that provides the decoder with information from every encoder hidden state,” thereby increasing the data it has to use in creating the final translated output. This node and network configuration allows the system to “selectively focus on useful parts of the input sequence, and hence, learn the alignment between them,” which helps in translating long inputs (like lengthy sentences and paragraphs).
To identify these insights, we performed a deep search on how nodes work within neural machine translation networks. Much of what we found was highly technical and was written for coders and computer engineers, though we were able to find a selection of articles and studies that detailed how nodes worked in these systems, and how adding or reconfiguring the nodes makes a huge difference to the throughput and output. From these, we pulled and detailed the insights featured here.