The Illustrated Transformer Vocabulary

The Illustrated Transformer

The article discusses the Transformer model, which uses attention to speed up training compared to previous neural machine translation models.
The Transformer model outperforms the Google Neural Machine Translation model in certain tasks and is recommended by Google Cloud for use with their Cloud TPU offering.
The article aims to break down the model and explain its functionality.
The Transformer was introduced in the paper "Attention is All You Need." A TensorFlow implementation is available in the Tensor2Tensor package.
Harvard's NLP group has created a guide that annotates the paper with PyTorch implementation.
The article simplifies the concepts for those without in-depth knowledge.

Model Overview

The model can be viewed as a black box that takes a sentence in one language and outputs its translation in another.
The Transformer model consists of an encoding component, a decoding component, and connections between them.
The encoding component is a stack of encoders, with the paper using six encoders stacked on top of each other. The number six is not fixed and can be experimented with.
The decoding component consists of a stack of decoders with the same number as the encoders.
The encoders have identical structure but do not share weights.
Each encoder is divided into two sub-layers:
- A self-attention layer that allows the encoder to consider other words in the input sentence when encoding a specific word.
- A feed-forward neural network, which is applied independently to each position.
The decoder includes both of the above layers, with an additional attention layer in between.
- This attention layer allows the decoder to focus on relevant parts of the input sentence, similar to attention mechanisms in seq2seq models.

Tensors

The process begins by converting each input word into a vector using an embedding algorithm.
Each word is embedded into a vector of size 512.
The embedding happens in the bottom-most encoder.
All encoders receive a list of vectors of size 512.
- For the bottom encoder, these are word embeddings. For other encoders, these are the outputs of the encoder directly below.
The length of this list (the sequence length) is a hyperparameter that is set according to the length of the longest sentence in the training dataset.
After embedding, each word flows through the two layers of the encoder.
A key property of the Transformer is that each word in each position flows through its own path in the encoder.
The self-attention layer introduces dependencies between these paths.
The feed-forward layer, however, does not have dependencies and can be executed in parallel.

Encoding

An encoder receives a list of vectors as input.
It processes the list by passing the vectors into a self-attention layer and then into a feed-forward neural network.
The output is then sent to the next encoder.
The word at each position passes through a self-attention process before being fed into a feed-forward neural network.
The same feed-forward network is applied separately to each vector.

Self-Attention

Self-attention allows the model to associate different parts of the input sentence with each other, to resolve references.
When processing a word, self-attention enables the model to look at other positions in the input sequence to improve encoding.
Self-attention allows the Transformer to incorporate understanding of relevant words into the word currently being processed.
As the model encodes a word, the attention mechanism focuses on other words and integrates their representations into the encoding of the current word.

Self-Attention Calculations

The first step is to create three vectors (Query, Key, and Value) from each input vector (embedding of each word).
These vectors are created by multiplying the embedding by three weight matrices, which are trained during the training process.
The dimensionality of these vectors is typically smaller than the embedding vector (e.g., 64 vs. 512).
These are abstractions for calculating and thinking about attention.
The second step is to calculate a score by taking the dot product of the query vector with the key vector of the word being scored.
This score determines the amount of focus to place on other parts of the input sentence during encoding.
The third and fourth steps involve dividing the scores by the square root of the dimension of the key vectors (e.g., \sqrt{64} = 8), and then passing the result through a softmax operation.
Softmax normalizes the scores to be positive and sum to 1, determining the weight given to each word.
The fifth step is to multiply each value vector by the softmax score.
- This keeps the values of focused words intact and diminishes irrelevant words.
The sixth step is to sum the weighted value vectors, producing the output of the self-attention layer.
The calculation is done in matrix form for faster processing.
The Query, Key, and Value matrices are calculated by packing embeddings into a matrix X and multiplying it by trained weight matrices \WQ, WK, W_V.
Each row in the X matrix corresponds to a word in the input sentence.
The self-attention calculation can be condensed into one formula:

\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

Multi-Headed Attention

Multi-headed attention improves the performance of the attention layer by:
- Expanding the model’s ability to focus on different positions.
- Providing multiple “representation subspaces”.
With multi-headed attention, there are multiple sets of Query/Key/Value weight matrices (eight in the Transformer).
Each set is randomly initialized and, after training, projects input embeddings into a different representation subspace.
The same self-attention calculation is performed multiple times with different weight matrices.
There are separate Q, K, V weight matrices for each head, resulting in different Q, K, V matrices.
The results (Z matrices) from each head are concatenated and then multiplied by a weight matrix \W_O to condense them into a single matrix.

Positional Encoding

The model needs a way to account for the order of words in the input sequence.
This is achieved by adding a positional encoding vector to each input embedding.
These vectors follow a specific pattern that the model learns.
Adding these values to the embeddings creates meaningful distances between the embedding vectors after being projected into Q, K, V vectors.
The positional encoding vectors provide the model with a sense of word order and distance between words.
Each row corresponds to a positional encoding vector, which is added to the embedding of the first word in the input sequence.
Each row contains 512 values, ranging between 1 and -1.
The values are generated by a sine function for the left half and a cosine function for the right half, then concatenated.
The formula for positional encoding is in section 3.5 of the paper.
The positional encoding method in the paper interweaves the two signals (sine and cosine) instead of directly concatenating them.

Residuals

Each sub-layer (self-attention, feed-forward neural network) in each encoder has a residual connection around it.
Each sub-layer is followed by a layer-normalization step.

Decoder Side

The decoder shares many concepts with the encoder, but let's explore how the components work together.
The encoder starts by processing the input sequence.
The output of the top encoder is transformed into a set of attention vectors K and V.
These are used by each decoder in its “encoder-decoder attention” layer to focus on appropriate places in the input sequence.
Each step in the decoding phase outputs an element from the output sequence.
The process repeats until a special end-of-sentence symbol (\) is reached.
The output of each step is fed to the bottom decoder in the next time step.
The decoders pass their decoding results upward, similar to the encoders.
Decoder inputs are embedded and have positional encoding added.
The self-attention layers in the decoder operate differently:
- They are only allowed to attend to earlier positions in the output sequence.
- Future positions are masked (set to -inf) before the softmax step.
The “Encoder-Decoder Attention” layer is similar to multi-headed self-attention but creates its Queries matrix from the layer below it and takes the Keys and Values matrices from the output of the encoder stack.

Output Layers

The decoder stack outputs a vector of floats, which needs to be converted into a word.
This is done by the final Linear layer, followed by a Softmax Layer.
The Linear layer projects the vector produced by the decoder stack into a much larger vector called a logits vector.
The logits vector has a cell for each word in the model's output vocabulary, with each cell holding a score for its word.
The softmax layer turns the scores into probabilities (positive values that add up to 1.0).
The cell with the highest probability is chosen, and the word associated with it is produced as the output for that time step.

Training Phase

During training, an untrained model goes through the forward pass.
The model's output is compared with the actual correct output from the labeled training dataset.
The output vocabulary is created in the preprocessing phase and each word in the vocabulary is one-hot encoded.

Loss Function

The loss function is the metric optimized during the training phase.
The goal is for the output to be a probability distribution that indicates the correct word.
The model's weights are adjusted using backpropagation to make the output closer to the desired output.
Two probability distributions can be compared by subtracting one from the other.
Cross-entropy and Kullback–Leibler divergence are relevant.
The model successively produces probability distributions, each represented by a vector of width vocab_size.
Each probability distribution has the highest probability at the cell associated with the correct word for that position.
The model is trained against targeted probability distributions for each sample sentence.
After training on a large dataset, the produced probability distributions should match the expected translations.

Decoding

The model produces outputs one at a time, selecting the word with the highest probability (greedy decoding).
Another approach is beam search, where the model holds on to the top several words and runs the model multiple times, assuming each of those words as the first output.
Beam search repeats this for subsequent positions, keeping the versions that produce less error.
Beam search uses hyperparameters beamsize and topbeams, which can be experimented with.