An Introduction to the Architecture of Neural Machine Translation

Machine translation is one of the most exciting and rapidly-evolving fields of artificial intelligence. Neural machine translation (NMT) is a particularly hot topic right now, thanks to its impressive ability to produce more accurate translations than traditional algorithms. In this post, we’ll take a look at the basics of NMT architecture and how it works. Stay tuned for future posts where we’ll dive into more detail on individual components of NMT systems!

An Introduction to the Architecture of Neural Machine Translation

Neural Machine Translation (NMT) uses deep learning 1See https://en.wikipedia.org/wiki/Deep_learning for an overview and further references to train an artificial neural network to convert a sequence of text in one language into another sequence of text in another language. The central idea of deep learning is that optimizing a number of subsequent nonlinear processing layers on a global objective, such as translation quality, enables these layers to automatically form intermediate representations, typically on different levels of abstraction, that are helpful in solving the task at hand. These processing layers are often implemented by (but in principle not limited to) artificial neural network layers. NMT networks typically consist of a number of encoder layers, which produce an abstract representation of the source text, and one or more decoder layers, which, using this representation, generate the output text, one token at a time.

For neural networks to be able to operate on text, it has to be encoded into a numeric input first. This is commonly done by segmenting the text into words, characters, or pieces of words, and mapping these segments to integer IDs according to a lookup table (“vocabulary”). A segmentation on subword level has the advantage of being efficient in grouping common sequences of characters, while at the same time remaining flexible enough to handle rare or unknown words2Common examples for subword segmentation algorithms are “Byte Pair Encoding” (https://aclanthology.org/P16-1162/) or “SentencePiece” (https://github.com/google/sentencepiece). The encoded sequence of IDs can then be consumed by a neural network, typically via an initial embedding layer which transforms each ID into a real-valued vector. These vectors then get transformed in the various layers of the network before the final layer  generates some form of probability distribution from which output IDs are generated. Similar to the encoding process, this output ID sequence is finally decoded into the resulting output text, with the same or a different vocabulary.

NMT networks (and more generally networks operating on sequences; “sequence-to-sequence” networks), typically differ from more traditional networks with a purely feed-forward architecture because (a) they have to be able to deal with input and output texts of varying length, (b) in order to produce a syntactically valid text the output at a given position has to depend on the already generated text, demanding sequential decoding, and (c) quality improves if the information flow in the final trained network is not fixed, but rather depends on the input, to account for the varying contextual information present in natural language.

The initial breakthrough in NMT was achieved with recurrent neural networks (RNNs)3https://arxiv.org/abs/1409.3215, https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html, which instead of only feed-forward connections, also allow cyclic connections within the network, e.g., from a neuron onto itself. Whereas feed-forward networks only compute a static function of the input, this setup introduces the concept of time and temporal dynamics into the network, since computations also depend on previous values. This makes recurrent networks more complex and in general very hard to train, unless they have a very specific structure. In NMT and many other applications this structure is given by so-called Long Short-Term Memory (LSTM) cells4https://colah.github.io/posts/2015-08-Understanding-LSTMs/; a common variant of LSTMs are Gated Recurrent Units (GRU), which process the input sequence recursively and at each step can learn to forget or maintain previous information.

Another major progress was made with the introduction of so-called attention mechanisms5https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html, https://arxiv.org/abs/1409.0473. Instead of encoding the input text in a single vector from which the whole output text is generated, an attention mechanism allows the network to focus on a specific part of the input when it generates an output at a specific position. At each step this focus of attention is expressed by a probability distribution over the input positions which is not learned globally, but rather depends on the input itself.

Finally, Transformer6https://jalammar.github.io/illustrated-transformer/, https://arxiv.org/abs/1706.03762 networks leverage attention heavily to overcome a major limitation of RNNs, that the input has to be processed sequentially. A single layer of a Transformer network performs two types of computation: (1) a generalized attention mechanism (“self-attention”) that can arbitrarily route information across positions, and (2) a position-wise feed-forward network that is the same for every sequence position and which performs a nonlinear transformation. Additionally, a Transformer decoder has a “cross attention” part that can pay attention to specific parts of the encoded input. This general computational paradigm has made Transformers a powerful tool for natural language processing (e.g., many famous pre-trained models like GPT7 https://openai.com/blog/better-language-models/ or BERT8 https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html are based on the Transformer architecture), but recently also in speech recognition9https://towardsdatascience.com/breakthroughs-in-speech-recognition-achieved-with-the-use-of-transformers-6aa7c5f8cb02, computer vision10https://ai.googleblog.com/2020/12/transformers-for-image-recognition-at.html, and even protein structure prediction11https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology.

Transformer networks quickly outperformed recurrent neural networks in neural machine translation and nowadays they are the most common architecture used for NMT. iTranslate also uses variants of Transformer networks to translate text into 56 different languages.