The Illustrated Transformer Vocabulary
- The article discusses the Transformer model, which uses attention to speed up training compared to previous neural machine translation models.
- The Transformer model outperforms the Google Neural Machine Translation model in certain tasks and is recommended by Google Cloud for use with their Cloud TPU offering.
- The article aims to break down the model and explain its functionality.
- The Transformer was introduced in the paper "Attention is All You Need." A TensorFlow implementation is available in the Tensor2Tensor package.
- Harvard's NLP group has created a guide that annotates the paper with PyTorch implementation.
- The article simplifies the concepts for those without in-depth knowledge.
Model Overview
- The model can be viewed as a black box that takes a sentence in one language and outputs its translation in another.
- The Transformer model consists of an encoding component, a decoding component, and connections between them.
- The encoding component is a stack of encoders, with the paper using six encoders stacked on top of each other. The number six is not fixed and can be experimented with.
- The decoding component consists of a stack of decoders with the same number as the encoders.
- The encoders have identical structure but do not share weights.
- Each encoder is divided into two sub-layers:
- A self-attention layer that allows the encoder to consider other words in the input sentence when encoding a specific word.
- A feed-forward neural network, which is applied independently to each position.
- The decoder includes both of the above layers, with an additional attention layer in between.
- This attention layer allows the decoder to focus on relevant parts of the input sentence, similar to attention mechanisms in seq2seq models.
Tensors
- The process begins by converting each input word into a vector using an embedding algorithm.
- Each word is embedded into a vector of size 512.
- The embedding happens in the bottom-most encoder.
- All encoders receive a list of vectors of size 512.
- For the bottom encoder, these are word embeddings. For other encoders, these are the outputs of the encoder directly below.
- The length of this list (the sequence length) is a hyperparameter that is set according to the length of the longest sentence in the training dataset.
- After embedding, each word flows through the two layers of the encoder.
- A key property of the Transformer is that each word in each position flows through its own path in the encoder.
- The self-attention layer introduces dependencies between these paths.
- The feed-forward layer, however, does not have dependencies and can be executed in parallel.
Encoding
- An encoder receives a list of vectors as input.
- It processes the list by passing the vectors into a self-attention layer and then into a feed-forward neural network.
- The output is then sent to the next encoder.
- The word at each position passes through a self-attention process before being fed into a feed-forward neural network.
- The same feed-forward network is applied separately to each vector.
Self-Attention
- Self-attention allows the model to associate different parts of the input sentence with each other, to resolve references.
- When processing a word, self-attention enables the model to look at other positions in the input sequence to improve encoding.
- Self-attention allows the Transformer to incorporate understanding of relevant words into the word currently being processed.
- As the model encodes a word, the attention mechanism focuses on other words and integrates their representations into the encoding of the current word.
Self-Attention Calculations
- The first step is to create three vectors (Query, Key, and Value) from each input vector (embedding of each word).
- These vectors are created by multiplying the embedding by three weight matrices, which are trained during the training process.
- The dimensionality of these vectors is typically smaller than the embedding vector (e.g., 64 vs. 512).
- These are abstractions for calculating and thinking about attention.
- The second step is to calculate a score by taking the dot product of the query vector with the key vector of the word being scored.
- This score determines the amount of focus to place on other parts of the input sentence during encoding.
- The third and fourth steps involve dividing the scores by the square root of the dimension of the key vectors (e.g., \sqrt{64} = 8), and then passing the result through a softmax operation.
- Softmax normalizes the scores to be positive and sum to 1, determining the weight given to each word.
- The fifth step is to multiply each value vector by the softmax score.
- This keeps the values of focused words intact and diminishes irrelevant words.
- The sixth step is to sum the weighted value vectors, producing the output of the self-attention layer.
- The calculation is done in matrix form for faster processing.
- The Query, Key, and Value matrices are calculated by packing embeddings into a matrix X and multiplying it by trained weight matrices \WQ, WK, W_V.
- Each row in the X matrix corresponds to a word in the input sentence.
- The self-attention calculation can be condensed into one formula:
\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V
Multi-Headed Attention
- Multi-headed attention improves the performance of the attention layer by:
- Expanding the model’s ability to focus on different positions.
- Providing multiple “representation subspaces”.
- With multi-headed attention, there are multiple sets of Query/Key/Value weight matrices (eight in the Transformer).
- Each set is randomly initialized and, after training, projects input embeddings into a different representation subspace.
- The same self-attention calculation is performed multiple times with different weight matrices.
- There are separate Q, K, V weight matrices for each head, resulting in different Q, K, V matrices.
- The results (Z matrices) from each head are concatenated and then multiplied by a weight matrix \W_O to condense them into a single matrix.
Positional Encoding
- The model needs a way to account for the order of words in the input sequence.
- This is achieved by adding a positional encoding vector to each input embedding.
- These vectors follow a specific pattern that the model learns.
- Adding these values to the embeddings creates meaningful distances between the embedding vectors after being projected into Q, K, V vectors.
- The positional encoding vectors provide the model with a sense of word order and distance between words.
- Each row corresponds to a positional encoding vector, which is added to the embedding of the first word in the input sequence.
- Each row contains 512 values, ranging between 1 and -1.
- The values are generated by a sine function for the left half and a cosine function for the right half, then concatenated.
- The formula for positional encoding is in section 3.5 of the paper.
- The positional encoding method in the paper interweaves the two signals (sine and cosine) instead of directly concatenating them.
Residuals
- Each sub-layer (self-attention, feed-forward neural network) in each encoder has a residual connection around it.
- Each sub-layer is followed by a layer-normalization step.
Decoder Side
- The decoder shares many concepts with the encoder, but let's explore how the components work together.
- The encoder starts by processing the input sequence.
- The output of the top encoder is transformed into a set of attention vectors K and V.
- These are used by each decoder in its “encoder-decoder attention” layer to focus on appropriate places in the input sequence.
- Each step in the decoding phase outputs an element from the output sequence.
- The process repeats until a special end-of-sentence symbol (\) is reached.
- The output of each step is fed to the bottom decoder in the next time step.
- The decoders pass their decoding results upward, similar to the encoders.
- Decoder inputs are embedded and have positional encoding added.
- The self-attention layers in the decoder operate differently:
- They are only allowed to attend to earlier positions in the output sequence.
- Future positions are masked (set to -inf) before the softmax step.
- The “Encoder-Decoder Attention” layer is similar to multi-headed self-attention but creates its Queries matrix from the layer below it and takes the Keys and Values matrices from the output of the encoder stack.
Output Layers
- The decoder stack outputs a vector of floats, which needs to be converted into a word.
- This is done by the final Linear layer, followed by a Softmax Layer.
- The Linear layer projects the vector produced by the decoder stack into a much larger vector called a logits vector.
- The logits vector has a cell for each word in the model's output vocabulary, with each cell holding a score for its word.
- The softmax layer turns the scores into probabilities (positive values that add up to 1.0).
- The cell with the highest probability is chosen, and the word associated with it is produced as the output for that time step.
Training Phase
- During training, an untrained model goes through the forward pass.
- The model's output is compared with the actual correct output from the labeled training dataset.
- The output vocabulary is created in the preprocessing phase and each word in the vocabulary is one-hot encoded.
Loss Function
- The loss function is the metric optimized during the training phase.
- The goal is for the output to be a probability distribution that indicates the correct word.
- The model's weights are adjusted using backpropagation to make the output closer to the desired output.
- Two probability distributions can be compared by subtracting one from the other.
- Cross-entropy and Kullback–Leibler divergence are relevant.
- The model successively produces probability distributions, each represented by a vector of width vocab_size.
- Each probability distribution has the highest probability at the cell associated with the correct word for that position.
- The model is trained against targeted probability distributions for each sample sentence.
- After training on a large dataset, the produced probability distributions should match the expected translations.
Decoding
- The model produces outputs one at a time, selecting the word with the highest probability (greedy decoding).
- Another approach is beam search, where the model holds on to the top several words and runs the model multiple times, assuming each of those words as the first output.
- Beam search repeats this for subsequent positions, keeping the versions that produce less error.
- Beam search uses hyperparameters beamsize and topbeams, which can be experimented with.