Many of the mind-boggling applications of modern artificial intelligence solve sequence-to-sequence (seq2seq) problems. These involve elaborating an input, e.g., a sequence of words or sounds, to produce an output, e.g., a new sequence of words or sounds translating the input data. We already introduced these networks when talking about machine translation, a typical seq2seq problem.
This blog post gives some ideas on common difficulties encountered when tackling seq2seq problems and some strategies and architectures used to overcome them. We’ll focus on the question “How do these things work?” – leaving the question “How to train these models?” for another time.
A seq2seq Problem: Translation
In informatics, when dealing with a text in a natural language (e.g., French and English), it is common to split the input into different “parts,” which we call tokens. These are typically characters, words, or parts of words. After choosing a vocabulary, i.e., a list of tokens to be used, it is possible to encode, i.e., transform, the list of words into a list of numbers. These numbers can then be processed by our artificial intelligence, which will produce a new list of numbers that will need to be decoded to obtain the target text.
Please remember that sequence is a keyword here: the model needs to produce a result that consists of the correct tokens in the proper order.
A typical example of a seq2seq problem is machine translation (MT): given a sequence of words in a language, for instance, the following English sentence:
This model, which was released last year, is already outdated.
The model has to produce its translation in another language, for example, Italian:
Questo modello, rilasciato l’anno scorso, è già obsoleto.
This task is challenging for several reasons; consider, for example, the meaning. Even when considering a non-ambiguous sentence, it is often necessary to understand the context to interpret the meaning of a word correctly. In the example above, “model” refers to some product, such as a phone or a laptop, and not to someone working in the fashion industry; it is clear only once we put the word in context.
Another issue is related to the length of the two sequences: when translating, there is no guarantee that the input and output sentences contain the same number of tokens. For instance, the Italian “Vado!” can be translated with three (or more) English words “I’m going (out)!”. Deep learning models consist of a finite number of operations, typically matrix multiplications and some non-linear functions with fancy names such as ReLU or softmax, yielding an output. This approach is perfect for situations where input and output sizes are fixed, e.g., when building a network recognizing which Pokémon is depicted in a 256×256 image but falls short when considering variable lengths. 1http://karpathy.github.io/2015/05/21/rnn-effectiveness/
RNNs To The Rescue
With increased computer power and data availability, recursive neural networks (RNNs) became a feasible solution for the variable-length problem. 2See e.g Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. An RNN typically splits a seq2seq problem into two steps: an encoding and a decoding stage. These stages can be performed by different parts of a network called encoder and decoder. This can be beneficial, for example, by specializing in other parts of the network to perform different parts of the task, but it comes at the cost of network size and complexity. You can find more information in a previous article.
A more straightforward approach involves using the same RNN in both stages: the same network begins by analyzing the input tokens one by one and then produces the output tokens, again one by one. We illustrate this with an example, translating the following sentence into Italian:
I am going
In the pre-processing step, we tokenize it, for example, into words and add a unique token at the end, <END>, signaling the RNN that the encoding phase has ended.
I am going <END>
The tokens are then translated into numbers according to the order found in the vocabulary.
The RNN then iterates over each word taking two values as inputs: a token and a value generated by the RNN at the previous step (at the first step, we might use an empty vector for this task). It then produces an output, a vector encoding the sentence’s meaning, which is then used in the following iteration.
Once the model encodes the unique token <END>, the decoding phase starts: the model uses the vector produced in the first part to produce a translation. Next, decoding is performed by the same model, where the two inputs are the last token created by the model and the previous output vector, and the output is the following token in the target language. This is repeated until the model outputs a unique token, for example <END>.
A visual explanation might be more transparent with the following remarks:
- The same RNN is used at every step.
- It produces an output every time, but we ignore them in the encoding phase.
- In the first step, the RNN receives an h0 vector that we ignore since it contains no information, such as a zero vector.
This approach was the backbone of many of the first tools offering translations of “usable” quality. But unfortunately, it still suffers from significant drawbacks:
- The inherent sequential nature of the approach prevents the ability to parallelize within training examples. To decode the ninth token, the model must first decipher the previous eighth tokens. This limits our ability to use it with modern hardware.
- Long-time dependencies and complex relations are beyond the capabilities of a vanilla RNN.3Wang, Feng, and David MJ Tax. “Survey on the attention based RNN model and its applications in computer vision.” arXiv:1601.06823 (2016). This means that faced with a long or complex sentence, the model will often be unable to recall or select the required information, failing at its task.
The latter problem can be tackled using a strategy called attention. Interestingly, when pushed to its limits, the idea of attention can lead to an entirely new architecture called Transformer, which solves both drawbacks mentioned above.
“Attention” and “Transformers”
As a model for attention, we can think of “attention” in humans. The human brain has several processing bottlenecks; to overcome this limitation, it uses a process called attention to selectively concentrate on a small part of the information we receive while ignoring the rest of it. Just think about the last time you were in a crowded place with many people speaking (for youngsters: this was a relatively common occurrence before the corona pandemic): your ears were receiving a lot of information, but you could easily ignore most of it and listen only to your interlocutor (or eavesdrop only the intended conversation). Similarly, when we read a long text, we cannot remember each word, but we tend to retain enough valuable information to make sense of it, e.g., the topic, the article’s main points, etc.
We can similarly tackle Seq2seq problems: depending on the token currently being considered, different tokens might be more or less relevant. For instance, imagine a model translating the following Spanish sentence:
¡Lola, la chica que te gusta, viene a la fiesta esta noche!
This translates to:
Lola, the girl you have a crush on, is coming to the party tonight!
After some decoding steps, an RNN reaches the following point:
Lola, the girl you have a crush on,
The model now needs to translate the word “viene,” a conjugated form of the verb “to come.” But which form to choose in English? The correct form is the 3rd person singular because the subject is “Lola.” The more appropriate time is the Present Continuous tense because the verb refers to something that Lola plans on doing tonight. To infer this, we look at the following words of the original sentence: the verb “viene,” together with “Lola” and “esta noche” (tonight). These four words span the whole sentence: a word-level model without attention would need to find them among 17 input tokens (14 words and three punctuation symbols).4We simplified the discussion by focussing only on the input tokens; tokens which have been produced by the model up to that step should be considered too.
The Transformative Power of Self-Attention
Transformers were first proposed in the seminal paper “Attention Is All You Need,”5Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017). which have revolutionized the field of Natural Language Processing (NLP) and then expanded their applications even beyond, for instance, computer vision.
This architecture is based on an attention mechanism called “self-attention” to lend itself to parallelization. Parallelization means the possibility of training bigger models on more extensive amounts of data, which are two essential factors for current state-of-the-art machine learning technologies. Two fundamental changes enable this innovation.
Positional encoding: RNNs cannot be parallelized because they inherently need to process the input tokens sequentially. Transformers get around this issue by encoding each token together with its position in the sentence; when encoding the sentence “This blog article is interesting!”6If you read it until this point, and looked for this footnote, we take it you agree on this one., the input to the model could look a bit like this:
(This, 1) (blog, 2) (article, 3) (is, 4) (interesting, 5) (!, 6) (<END>, 6)
The main difference is that the encoding does not use the integer describing the position (values would get too big too soon) but encodes the position by applying mathematical operations to the numbers.
By encoding each token together with its position, the encoding step is no longer dependent on the order of the tokens and can be parallelized.
Self-attention: The idea behind an attention mechanism is to let the model focus only on relevant portions of the data. To clarify how this works, try to think of a search engine: you have a query and something you want to look for. The search engine lists keys and links to possible websites and tries to find the ones relevant to your query. After the search, you are presented with a list of relevant websites and can then click on those of interest, retrieving the “value” or information inside them.
To understand self-attention, let’s look at the computation of a single attention component: each token is associated with three vectors: a query Q, a key K (for the search part), and a value V (encoding the relevant information). Given a token, attention consists of the following steps:
- Find its query vector Q
- For every token in the sentence, consider its key vector K, and compute similarity with the query vector Q.7Typically the operation used is the dot product between vectors. This value describes the relevance of the value V associated with K with respect to the query Q.
- Compute the weighted average of the values V using the previously computed relevance weights.
For simplicity, we call Q, K, and V “vectors” associated with a token; when implementing these operations, the vectors can be grouped into matrices, allowing much faster computations.
The Transformer gives a final spin to this idea: instead of using a single attention mechanism, it uses multi-attention. Each token is associated with multiple queries, keys, and values vectors, allowing the model to focus its attention on various parts simultaneously, like your friend who can chat with you while overhearing the conversation at another table.
Attention: this simple idea rendered through essential mathematical tools, a combination of linear and non-linear operations on matrices, is the backbone of the Transformer architecture, which has sparked a new race in the world of artificial intelligence. This architecture has become the basis for many modern pre-trained models, such as BERT8https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html and GPT9https://openai.com/blog/better-language-models/, which can be refined and quickly deployed to solve many challenging NLP problems10Qiu, Xipeng, et al. “Pre-trained models for natural language processing: A survey.” Science China Technological Sciences 63(10), 2020. and beyond.11Other examples involve speech and image recognition: L. Dong et al. “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition.” ICASSP, 2018; and Arnab, Anurag, et al. “Vivit: A video vision transformer.” ICCV, 2021.
Modern Transformers-like architectures have achieved the unthinkable. They might be behemoths of unfathomable size, but their essence lies in simple building blocks connected in a vast network.12It is worth specifying here that multiple networks can be combined, and complex networks architectures including multiple sub-networks are becoming more common. Their capabilities result from the coordinated symphony played by each part of the network communicating and operating together. Just like single notes and sounds: taken independently, they might look insignificant, but when played in the correct order, they can generate the most beautiful symphonies and songs.
- 2See e.g Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
- 3Wang, Feng, and David MJ Tax. “Survey on the attention based RNN model and its applications in computer vision.” arXiv:1601.06823 (2016).
- 4We simplified the discussion by focussing only on the input tokens; tokens which have been produced by the model up to that step should be considered too.
- 5Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
- 6If you read it until this point, and looked for this footnote, we take it you agree on this one.
- 7Typically the operation used is the dot product between vectors.
- 10Qiu, Xipeng, et al. “Pre-trained models for natural language processing: A survey.” Science China Technological Sciences 63(10), 2020.
- 11Other examples involve speech and image recognition: L. Dong et al. “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition.” ICASSP, 2018; and Arnab, Anurag, et al. “Vivit: A video vision transformer.” ICCV, 2021.
- 12It is worth specifying here that multiple networks can be combined, and complex networks architectures including multiple sub-networks are becoming more common.