An Introduction to the Architecture of Neural Machine Translation

Neural Machine Translation (NMT) uses deep learning 1See for an overview and further references to train an artificial neural network to convert a sequence of text in one language into another sequence of text in another language. The central idea of deep learning is that optimizing a number of subsequent nonlinear processing layers on a global objective, such as translation quality, enables these layers to automatically form intermediate representations, typically on different levels of abstraction, that are helpful in solving the task at hand. These processing layers are often implemented by (but in principle not limited to) artificial neural network layers. NMT networks typically consist of a number of encoder layers, which produce an abstract representation of the source text, and one or more decoder layers, which, using this representation, generate the output text, one token at a time.

For neural networks to be able to operate on text, it has to be encoded into a numeric input first. This is commonly done by segmenting the text into words, characters, or pieces of words, and mapping these segments to integer IDs according to a lookup table (“vocabulary”). A segmentation on subword level has the advantage of being efficient in grouping common sequences of characters, while at the same time remaining flexible enough to handle rare or unknown words2Common examples for subword segmentation algorithms are “Byte Pair Encoding” ( or “SentencePiece” ( The encoded sequence of IDs can then be consumed by a neural network, typically via an initial embedding layer which transforms each ID into a real-valued vector. These vectors then get transformed in the various layers of the network before the final layer  generates some form of probability distribution from which output IDs are generated. Similar to the encoding process, this output ID sequence is finally decoded into the resulting output text, with the same or a different vocabulary.

NMT networks (and more generally networks operating on sequences; “sequence-to-sequence” networks), typically differ from more traditional networks with a purely feed-forward architecture because (a) they have to be able to deal with input and output texts of varying length, (b) in order to produce a syntactically valid text the output at a given position has to depend on the already generated text, demanding sequential decoding, and (c) quality improves if the information flow in the final trained network is not fixed, but rather depends on the input, to account for the varying contextual information present in natural language.

The initial breakthrough in NMT was achieved with recurrent neural networks (RNNs)3,, which instead of only feed-forward connections, also allow cyclic connections within the network, e.g., from a neuron onto itself. Whereas feed-forward networks only compute a static function of the input, this setup introduces the concept of time and temporal dynamics into the network, since computations also depend on previous values. This makes recurrent networks more complex and in general very hard to train, unless they have a very specific structure. In NMT and many other applications this structure is given by so-called Long Short-Term Memory (LSTM) cells4; a common variant of LSTMs are Gated Recurrent Units (GRU), which process the input sequence recursively and at each step can learn to forget or maintain previous information.

Another major progress was made with the introduction of so-called attention mechanisms5, Instead of encoding the input text in a single vector from which the whole output text is generated, an attention mechanism allows the network to focus on a specific part of the input when it generates an output at a specific position. At each step this focus of attention is expressed by a probability distribution over the input positions which is not learned globally, but rather depends on the input itself.

Finally, Transformer6, networks leverage attention heavily to overcome a major limitation of RNNs, that the input has to be processed sequentially. A single layer of a Transformer network performs two types of computation: (1) a generalized attention mechanism (“self-attention”) that can arbitrarily route information across positions, and (2) a position-wise feed-forward network that is the same for every sequence position and which performs a nonlinear transformation. Additionally, a Transformer decoder has a “cross attention” part that can pay attention to specific parts of the encoded input. This general computational paradigm has made Transformers a powerful tool for natural language processing (e.g., many famous pre-trained models like GPT7 or BERT8 are based on the Transformer architecture), but recently also in speech recognition9, computer vision10, and even protein structure prediction11

Transformer networks quickly outperformed recurrent neural networks in neural machine translation and nowadays they are the most common architecture used for NMT. iTranslate also uses variants of Transformer networks to translate text into 56 different languages.

You might also like

Spanish days of the week made easy: Tips and tricks

Who else remembers that catchy “days of the week” song from kindergarten? *Raises hand*. Now, get ready to experience the same burst of childhood nostalgia, but this time, let’s set the stage for Spanish. In this article, we’ll guide you through effective strategies that will help you master the days of the week in Spanish. …

Say Bonjour to the Days of the Week in French: A Beginner’s Guide

Bonjour et bienvenue (hello and welcome) to a handy French lesson that will build your language skills as well as help you get to know culture français (French culture). The days of the week are essential for making plans with friends, scheduling appointments, and even ensuring you attend the right class. Fortunately, we’re here to…

How to say “good morning” in French

Whether you’re interested in learning French or simply want to impress your friends with your smooth French étiquette, you’ve come to the right place. Here we’ll walk through the basics of how to say “good morning” in French, including the proper pronunciation, similar phrases, and cultural nuances of the phrase. So,, let’s dive into the…

How to say “good morning” in Spanish

¡Buenos días! Are you ready to learn some new Spanish words and start your day off on the right foot? In this article, we’ll teach you how to say “good morning” in Spanish and even throw in some tips and tricks to enhance your morning routine in Spanish-speaking cultures. And to make your language learning…

How to say “thank you” in French

Whether you’re traveling through France or befriending a French person abroad, at some point you’ll feel gratitude. (At least, one would hope!) Understanding how to say “thank you” in French and the nuances surrounding it will help you navigate these social situations. From the formal to the casual, the spoken to the signed, there are…

How to say “thank you” & other expressions of gratitude in Spanish

Since Spanish is one of the most widely spoken languages in the world, knowing how to express gratitude in it can be a powerful tool. In this article, we’ll explore the different ways to say “thank you” in Spanish and other expressions of gratitude that can help you connect with Spanish speakers on a deeper…