SA

The Illustrated Transformer Vocabulary

The Illustrated Transformer

  • The article discusses the Transformer model, which uses attention to speed up training compared to previous neural machine translation models.
  • The Transformer model outperforms the Google Neural Machine Translation model in certain tasks and is recommended by Google Cloud for use with their Cloud TPU offering.
  • The article aims to break down the model and explain its functionality.
  • The Transformer was introduced in the paper "Attention is All You Need." A TensorFlow implementation is available in the Tensor2Tensor package.
  • Harvard's NLP group has created a guide that annotates the paper with PyTorch implementation.
  • The article simplifies the concepts for those without in-depth knowledge.

Model Overview

  • The model can be viewed as a black box that takes a sentence in one language and outputs its translation in another.
  • The Transformer model consists of an encoding component, a decoding component, and connections between them.
  • The encoding component is a stack of encoders, with the paper using six encoders stacked on top of each other. The number six is not fixed and can be experimented with.
  • The decoding component consists of a stack of decoders with the same number as the encoders.
  • The encoders have identical structure but do not share weights.
  • Each encoder is divided into two sub-layers:
    • A self-attention layer that allows the encoder to consider other words in the input sentence when encoding a specific word.
    • A feed-forward neural network, which is applied independently to each position.
  • The decoder includes both of the above layers, with an additional attention layer in between.
    • This attention layer allows the decoder to focus on relevant parts of the input sentence, similar to attention mechanisms in seq2seq models.

Tensors

  • The process begins by converting each input word into a vector using an embedding algorithm.
  • Each word is embedded into a vector of size 512.
  • The embedding happens in the bottom-most encoder.
  • All encoders receive a list of vectors of size 512.
    • For the bottom encoder, these are word embeddings. For other encoders, these are the outputs of the encoder directly below.
  • The length of this list (the sequence length) is a hyperparameter that is set according to the length of the longest sentence in the training dataset.
  • After embedding, each word flows through the two layers of the encoder.
  • A key property of the Transformer is that each word in each position flows through its own path in the encoder.
  • The self-attention layer introduces dependencies between these paths.
  • The feed-forward layer, however, does not have dependencies and can be executed in parallel.

Encoding

  • An encoder receives a list of vectors as input.
  • It processes the list by passing the vectors into a self-attention layer and then into a feed-forward neural network.
  • The output is then sent to the next encoder.
  • The word at each position passes through a self-attention process before being fed into a feed-forward neural network.
  • The same feed-forward network is applied separately to each vector.

Self-Attention

  • Self-attention allows the model to associate different parts of the input sentence with each other, to resolve references.
  • When processing a word, self-attention enables the model to look at other positions in the input sequence to improve encoding.
  • Self-attention allows the Transformer to incorporate understanding of relevant words into the word currently being processed.
  • As the model encodes a word, the attention mechanism focuses on other words and integrates their representations into the encoding of the current word.

Self-Attention Calculations

  • The first step is to create three vectors (Query, Key, and Value) from each input vector (embedding of each word).
  • These vectors are created by multiplying the embedding by three weight matrices, which are trained during the training process.
  • The dimensionality of these vectors is typically smaller than the embedding vector (e.g., 64 vs. 512).
  • These are abstractions for calculating and thinking about attention.
  • The second step is to calculate a score by taking the dot product of the query vector with the key vector of the word being scored.
  • This score determines the amount of focus to place on other parts of the input sentence during encoding.
  • The third and fourth steps involve dividing the scores by the square root of the dimension of the key vectors (e.g., \sqrt{64} = 8), and then passing the result through a softmax operation.
  • Softmax normalizes the scores to be positive and sum to 1, determining the weight given to each word.
  • The fifth step is to multiply each value vector by the softmax score.
    • This keeps the values of focused words intact and diminishes irrelevant words.
  • The sixth step is to sum the weighted value vectors, producing the output of the self-attention layer.
  • The calculation is done in matrix form for faster processing.
  • The Query, Key, and Value matrices are calculated by packing embeddings into a matrix X and multiplying it by trained weight matrices \WQ, WK, W_V.
  • Each row in the X matrix corresponds to a word in the input sentence.
  • The self-attention calculation can be condensed into one formula:

\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

Multi-Headed Attention

  • Multi-headed attention improves the performance of the attention layer by:
    • Expanding the model’s ability to focus on different positions.
    • Providing multiple “representation subspaces”.
  • With multi-headed attention, there are multiple sets of Query/Key/Value weight matrices (eight in the Transformer).
  • Each set is randomly initialized and, after training, projects input embeddings into a different representation subspace.
  • The same self-attention calculation is performed multiple times with different weight matrices.
  • There are separate Q, K, V weight matrices for each head, resulting in different Q, K, V matrices.
  • The results (Z matrices) from each head are concatenated and then multiplied by a weight matrix \W_O to condense them into a single matrix.

Positional Encoding

  • The model needs a way to account for the order of words in the input sequence.
  • This is achieved by adding a positional encoding vector to each input embedding.
  • These vectors follow a specific pattern that the model learns.
  • Adding these values to the embeddings creates meaningful distances between the embedding vectors after being projected into Q, K, V vectors.
  • The positional encoding vectors provide the model with a sense of word order and distance between words.
  • Each row corresponds to a positional encoding vector, which is added to the embedding of the first word in the input sequence.
  • Each row contains 512 values, ranging between 1 and -1.
  • The values are generated by a sine function for the left half and a cosine function for the right half, then concatenated.
  • The formula for positional encoding is in section 3.5 of the paper.
  • The positional encoding method in the paper interweaves the two signals (sine and cosine) instead of directly concatenating them.

Residuals

  • Each sub-layer (self-attention, feed-forward neural network) in each encoder has a residual connection around it.
  • Each sub-layer is followed by a layer-normalization step.

Decoder Side

  • The decoder shares many concepts with the encoder, but let's explore how the components work together.
  • The encoder starts by processing the input sequence.
  • The output of the top encoder is transformed into a set of attention vectors K and V.
  • These are used by each decoder in its “encoder-decoder attention” layer to focus on appropriate places in the input sequence.
  • Each step in the decoding phase outputs an element from the output sequence.
  • The process repeats until a special end-of-sentence symbol (\) is reached.
  • The output of each step is fed to the bottom decoder in the next time step.
  • The decoders pass their decoding results upward, similar to the encoders.
  • Decoder inputs are embedded and have positional encoding added.
  • The self-attention layers in the decoder operate differently:
    • They are only allowed to attend to earlier positions in the output sequence.
    • Future positions are masked (set to -inf) before the softmax step.
  • The “Encoder-Decoder Attention” layer is similar to multi-headed self-attention but creates its Queries matrix from the layer below it and takes the Keys and Values matrices from the output of the encoder stack.

Output Layers

  • The decoder stack outputs a vector of floats, which needs to be converted into a word.
  • This is done by the final Linear layer, followed by a Softmax Layer.
  • The Linear layer projects the vector produced by the decoder stack into a much larger vector called a logits vector.
  • The logits vector has a cell for each word in the model's output vocabulary, with each cell holding a score for its word.
  • The softmax layer turns the scores into probabilities (positive values that add up to 1.0).
  • The cell with the highest probability is chosen, and the word associated with it is produced as the output for that time step.

Training Phase

  • During training, an untrained model goes through the forward pass.
  • The model's output is compared with the actual correct output from the labeled training dataset.
  • The output vocabulary is created in the preprocessing phase and each word in the vocabulary is one-hot encoded.

Loss Function

  • The loss function is the metric optimized during the training phase.
  • The goal is for the output to be a probability distribution that indicates the correct word.
  • The model's weights are adjusted using backpropagation to make the output closer to the desired output.
  • Two probability distributions can be compared by subtracting one from the other.
  • Cross-entropy and Kullback–Leibler divergence are relevant.
  • The model successively produces probability distributions, each represented by a vector of width vocab_size.
  • Each probability distribution has the highest probability at the cell associated with the correct word for that position.
  • The model is trained against targeted probability distributions for each sample sentence.
  • After training on a large dataset, the produced probability distributions should match the expected translations.

Decoding

  • The model produces outputs one at a time, selecting the word with the highest probability (greedy decoding).
  • Another approach is beam search, where the model holds on to the top several words and runs the model multiple times, assuming each of those words as the first output.
  • Beam search repeats this for subsequent positions, keeping the versions that produce less error.
  • Beam search uses hyperparameters beamsize and topbeams, which can be experimented with.