Transformers: Attention Is All You Need

Natural Language Processing

Learning Representations of Variable Length Data

  • Neural Machine Translation (NMT) uses a single neural network for machine translation in an end-to-end manner.
  • Sequence-to-sequence architecture is used for NMT.
  • RNNs are commonly used for learning variable-length representations and are a natural fit for sentences.
  • LSTMs, GRUs, and their variants are prevalent in recurrent models and are at the core of seq2seq models with attention.
  • However, RNNs have limitations:
    • Sequential processing prohibits parallelization within instances.
    • Long-range dependencies can be tricky despite gating mechanisms.

Attention

  • Attention between encoder and decoder is crucial in NMT.
  • Attention mechanisms can be used for representation learning.

Motivation

  • Design a neural network to encode and process text where words "attend to" other relevant words.
  • Objectives:
    • Establish connections between words.
    • Determine the strength of connections based on the words themselves.

Self-Attention

  • Self-attention learns dependencies between words in a sentence to capture its internal structure.
  • High-Level Explanation:
    • Considers an example sentence: "The animal didn't cross the street because it was too tired."
    • Determines what "it" refers to in the sentence.
    • Self-attention allows the model to look at other positions in the input sequence for clues to better encode a word as it processes each word/position.

Three Ways of Attention

  • Encoder Self-Attention
  • Encoder-Decoder Attention
  • Masked Decoder Self-Attention

The Transformer

  • The Transformer model architecture includes:
    • Encoders
    • Decoders
    • Self-attention mechanism
    • Cross-attention mechanism

Sequence-to-sequence with Attention

  • Attention scores and distribution are used to take a weighted sum of the encoder hidden states.
  • The attention output contains information from hidden states that received high attention.

Self-Attention in Detail

  • The first step involves creating three vectors from each word embedding at the encoder side:
    • Query vector
    • Key vector
    • Value vector
  • These vectors are created by multiplying the embedding by three trained matrices.
  • Query, key, and value vectors are abstractions used for calculating and thinking about attention.
  • Query: What the token is seeking (information needs).
  • Key: The information the token holds (relevance to other tokens' queries).
  • Value: The actual content the token shares if its key is found relevant.
  • Analogy: Searching for a topic in a library:
    • Query: Having a topic in mind.
    • Keys: Checking titles and keywords of books.
    • Values: Retrieving the books that "match".
  • The second step involves calculating a score that determines how much focus to place on other parts of the input sentence while encoding a word at a certain position.
  • The score is the dot product of the query vector with the key vector of the respective word being scored.
  • For example, when processing self-attention for the word in position 1 (Thinking), the first score is the dot product of q1 and k1, and the second score is the dot product of q1 and k2, and so on.
  • The third and fourth steps involve:
    • Dividing the scores by the square root of the dimension of the key vectors for training stability.
    • Passing the result through a softmax operation to normalize the scores so they’re all positive and add up to 1.
  • The fifth and sixth steps involve:
    • Multiplying each value vector by the softmax score.
    • Summing up the weighted value vectors to produce the output of the self-attention layer at that position.

Scaled Dot-Product Attention

  • Attention formula: Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
  • Each vector has three representations:
    • Query: Asking for information.
    • Key: Saying that it has some information.
    • Value: Giving the information.
  • These matrices allow different aspects of the x vectors to be used/emphasized in each of the three roles.
  • Attention matches the key and query by assigning a value to the place the key is most likely to be.

Multi-Head Attention

  • Multi-headed attention refines the self-attention layer by adding a mechanism that:
    • Expands the model’s ability to focus on different positions.
    • Gives the attention layer multiple “representation subspaces”.
  • It works by performing the same self-attention calculation multiple times with different weight matrices, resulting in different output matrices.
  • One attention head may focus on "the animal" while another focuses on "tired" for the word "it".

Multi-Head Attention Steps

  1. Input sentence.
  2. Embed each word.
  3. Split into multiple heads (e.g., 8 heads) and multiply X or R with weight matrices.
  4. Calculate attention using the resulting Q/K/V matrices.
  5. Concatenate the resulting Z matrices, then multiply with weight matrix W^O to produce the output of the layer.

Matrix Form - Self-Attention

  • Queries: q^T
  • Keys: K
  • Values: V
  • Self-attention: Attention, Softmax[K^TQ]
  • Output: V \cdot Softmax[K^TQ]

Multi-Head Self-Attention

  • In parallel:
    • Input: X
    • Queries
    • Keys
    • Values
    • Scaled Dot-Product Attention
  • Concatenate and transform: Nc[Sa1[X]; Sa2[X]]

Cross-Attention

  • In cross-attention, the queries, keys, and values are different and come from different sources (encoder and decoder).

Cross-Attention/Encoder-Decoder Attention

  • Decoder:
    • Queries: Q = \beta{q1}^T + \Omegaq X_d
    • Input: X_d
  • Encoder:
    • Keys: K = \beta{k1}^T + \Omegak X_e
    • Values: V = \beta{v1}^T + \Omegav X_e
    • Input: X_e
  • Cross-attention: Attention, Softmax[K^TQ]
  • Output: V \cdot Softmax[K^TQ]

Masked Multi-Head Attention

  • Decoder has different self-attention => Masked self-attention.
  • We generate one token at a time.
  • During generation, we don't know which tokens we'll generate in the future.
  • To enable parallelization we forbid the decoder to look ahead.
  • Future tokens are masked out (setting them to -inf) before the softmax step in the self-attention calculation.

Attention is Cheap!

  • FLOPs (floating point operations per second) is used to measure computer performance.

Representing The Order of The Sequence

  • Self-attention is equivariant to permuting word order.
  • Word order is important in language.

Using Positional Encoding

  • Positional encoding gives the advantage of being able to scale to unseen lengths of sequences.
  • The Transformer adds a vector to each input embedding.
  • These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
  • Ideally, the following criteria should be satisfied:
    • It should output a unique encoding for each time-step (word’s position in a sentence).
    • Distance between any two time-steps should be consistent across sentences with different lengths.
    • Our model should generalize to longer sentences without any efforts. Its values should be bounded.
    • It must be deterministic.
  • Sine and cosine functions of different frequencies are used.
  • Positional encoding steps:
    • Word embeddings + positional encoding = embedding value.

Attention Is All You Need - The Transformer Architecture

  • Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Aidan N. Gomez, Illia Polosukhin, Jakob Uszkoreit, Łukasz Kaiser
  • "Full package" model:
    • Everything only Attention, delete all RNN components
    • Positional encodings
    • Residual network (ResNet) structure
    • Interspersing of Attention and MLP
    • LayerNorms
    • Multiple heads of attention in parallel
    • Great hyperparameters (e.g. ffw_size=4, isotropic)
  • The transformer has changed remarkably little to this day.
  • The only consistent change is the "pre-norm" formulation, reshuffling the LayerNorms

Transformer Architecture Details

  • Encoder: Task is to read and “understand” the user’s input.
  • Decoder: Task is to generate the output (e.g., answer the user’s query).
  • Input: Tokenization and Embedding, Positional Encoding.

Input Tokenization

  • Input text is split into pieces which can be characters, words, or "tokens".
  • Example: "The detective investigated" -> [The] [detective] [invest] [igat] [ed_].
  • Tokens are indices into the "vocabulary": [The] [detective] [invest] [igat] [ed_] -> [3 721 68 1337 42].
  • Each vocabulary entry corresponds to a learned dmodel-dimensional vector [3 721 68 1337 42] -> [ [0.123, -5.234, …], […], […], […], […] ].
  • Vocab size (32k) dmodel.
  • Attention is permutation invariant, but language is not.
  • Need to encode the position of each word; just add something.

Multi-Headed Self-Attention

  • The input sequence is used to create queries, keys, and values.
  • Each token can "look around" the whole input and decide how to update its representation based on what it sees.

Point-wise MLP

  • A simple MLP applied to each token individually: zi = W2 GeLU(W1x + b1) + b_2
  • Think of it as each token pondering for itself about what it has observed previously.
  • There's some weak evidence this is where "world knowledge" is stored, too.
  • It contains the bulk of the parameters.
  • When people make giant models and sparse/moe, this is what becomes giant.

Residual/Skip Connections

  • Each module's output has the exact same shape as its input.
  • Following ResNets, the module computes a "residual" instead of a new value: zi = Module(xi) + x_i
  • This was shown to dramatically improve trainability.

LayerNorm

  • Normalization also dramatically improves trainability.
  • There's post-norm (original) and pre-norm (modern).
  • Post-norm: zi = LN(Module(xi) + x_i)
  • Pre-norm: zi = Module(LN(xi)) + x_i

Encoding / Encoder

  • Since input and output shapes are identical, we can stack N such blocks.
  • Typically, N=6 ("base"), N=12 ("large") or more.
  • Encoder output is a "heavily processed" (think: "high level, contextualized") version of the input tokens, i.e., a sequence.
  • This has nothing to do with the requested output yet (think: translation).
  • That comes with the decoder.

Decoding / the Decoder

  • Alternatively: Generating / the Generator
  • What we want to model: p(z|x)
  • For example, in translation: p(z | \text{"the detective investigated"}) \forall z
  • Seem impossible at first, but we can exactly decompose into tokens:
    • p(z|x) = p(z1|x) p(z2|z1,x) p(z3|z2,z1,x)…
  • Meaning, we can compute the likelihood of a given output z, or generate/sample an answer z one token at a time.
  • Each p is a full pass through the model.
  • For generating p(z3|z2,z_1,x):
    • x comes from the encoder.
    • z1, z2 is what we have predicted so far, goes into the decoder.
  • Once we have a p(zi|z{:i},x), we still need to actually sample a sentence such as "le détective a enquêté".
  • Many strategies: greedy, beam, …

Masked Self-Attention

  • This is regular self-attention as in the encoder, to process what's been decoded so far, e.g., z2,z1 in p(z3|z2,z_1,x), but with a trick…
  • At training time: Masked self-attention.
  • If we had to train on one single p(z3|z2,z_1,x) at a time: SLOW!
  • Instead, train on all p(zi|z{1:i},x) for all i simultaneously.
  • How? In the attention weights for z_i, set all entries i:N to 0.
  • This way, each token only sees the already generated ones.
  • At generation time, there is no such trick and we need to generate one z_i at a time.
  • This is why autoregressive decoding is extremely slow.

Cross Attention

  • Each decoded token can "look at" the encoder's output: Attn(q=Wq x{dec}, k=Wk x{enc}, v=Wv x{enc})
  • This is the same as in the 2014 paper.
  • This is where |x in p(z3|z2,z1,x) comes from. "Cross" attention x{enc}, x_{dec}
  • Because self-attention is so widely used, people have started just calling it "attention".
  • Hence, we now often need to explicitly call this "cross attention".

Output Layer

  • Assume we have already generated K tokens, generate the next one.
  • The decoder was used to gather all information necessary to predict a probability distribution for the next token (K), over the whole vocabulary.
  • Simple: linear projection of token K SoftMax normalization.

Model Variations

  • Decoder-only (GPT)
  • Encoder-only (BERT)
  • Enc-Dec (T5)

The Transformer's Unification of Communities

  • The classic landscape: One architecture per "community".
  • The Transformer's takeover: One community at a time.

NN Hyperparameters

  • Regularization
  • Loss function
  • Dimensions
  • Activation function
  • Initialization
  • Adagrad
  • Dropout
  • Mini-batch size
  • Initial learning rate
  • Learning rate schedule
  • Momentum
  • Stopping time

Ablations

  • Ablation studies analyze the impact of different components on performance.

Tokenization of Different Modalities

  • Tokenize different modalities each in their own way (some kind of "patching"), and send them all jointly into a Transformer.
  • Seems to just work.
  • Currently, an explosion of works is doing this!
  • Anything you can tokenize, you can feed to Transformer ca 2021 and onwards.