Transformers: Attention Is All You Need
Natural Language Processing
Learning Representations of Variable Length Data
- Neural Machine Translation (NMT) uses a single neural network for machine translation in an end-to-end manner.
- Sequence-to-sequence architecture is used for NMT.
- RNNs are commonly used for learning variable-length representations and are a natural fit for sentences.
- LSTMs, GRUs, and their variants are prevalent in recurrent models and are at the core of seq2seq models with attention.
- However, RNNs have limitations:
- Sequential processing prohibits parallelization within instances.
- Long-range dependencies can be tricky despite gating mechanisms.
Attention
- Attention between encoder and decoder is crucial in NMT.
- Attention mechanisms can be used for representation learning.
Motivation
- Design a neural network to encode and process text where words "attend to" other relevant words.
- Objectives:
- Establish connections between words.
- Determine the strength of connections based on the words themselves.
Self-Attention
- Self-attention learns dependencies between words in a sentence to capture its internal structure.
- High-Level Explanation:
- Considers an example sentence: "The animal didn't cross the street because it was too tired."
- Determines what "it" refers to in the sentence.
- Self-attention allows the model to look at other positions in the input sequence for clues to better encode a word as it processes each word/position.
Three Ways of Attention
- Encoder Self-Attention
- Encoder-Decoder Attention
- Masked Decoder Self-Attention
- The Transformer model architecture includes:
- Encoders
- Decoders
- Self-attention mechanism
- Cross-attention mechanism
Sequence-to-sequence with Attention
- Attention scores and distribution are used to take a weighted sum of the encoder hidden states.
- The attention output contains information from hidden states that received high attention.
Self-Attention in Detail
- The first step involves creating three vectors from each word embedding at the encoder side:
- Query vector
- Key vector
- Value vector
- These vectors are created by multiplying the embedding by three trained matrices.
- Query, key, and value vectors are abstractions used for calculating and thinking about attention.
- Query: What the token is seeking (information needs).
- Key: The information the token holds (relevance to other tokens' queries).
- Value: The actual content the token shares if its key is found relevant.
- Analogy: Searching for a topic in a library:
- Query: Having a topic in mind.
- Keys: Checking titles and keywords of books.
- Values: Retrieving the books that "match".
- The second step involves calculating a score that determines how much focus to place on other parts of the input sentence while encoding a word at a certain position.
- The score is the dot product of the query vector with the key vector of the respective word being scored.
- For example, when processing self-attention for the word in position 1 (Thinking), the first score is the dot product of q1 and k1, and the second score is the dot product of q1 and k2, and so on.
- The third and fourth steps involve:
- Dividing the scores by the square root of the dimension of the key vectors for training stability.
- Passing the result through a softmax operation to normalize the scores so they’re all positive and add up to 1.
- The fifth and sixth steps involve:
- Multiplying each value vector by the softmax score.
- Summing up the weighted value vectors to produce the output of the self-attention layer at that position.
Scaled Dot-Product Attention
- Attention formula: Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
- Each vector has three representations:
- Query: Asking for information.
- Key: Saying that it has some information.
- Value: Giving the information.
- These matrices allow different aspects of the x vectors to be used/emphasized in each of the three roles.
- Attention matches the key and query by assigning a value to the place the key is most likely to be.
Multi-Head Attention
- Multi-headed attention refines the self-attention layer by adding a mechanism that:
- Expands the model’s ability to focus on different positions.
- Gives the attention layer multiple “representation subspaces”.
- It works by performing the same self-attention calculation multiple times with different weight matrices, resulting in different output matrices.
- One attention head may focus on "the animal" while another focuses on "tired" for the word "it".
Multi-Head Attention Steps
- Input sentence.
- Embed each word.
- Split into multiple heads (e.g., 8 heads) and multiply X or R with weight matrices.
- Calculate attention using the resulting Q/K/V matrices.
- Concatenate the resulting Z matrices, then multiply with weight matrix W^O to produce the output of the layer.
- Queries: q^T
- Keys: K
- Values: V
- Self-attention: Attention, Softmax[K^TQ]
- Output: V \cdot Softmax[K^TQ]
Multi-Head Self-Attention
- In parallel:
- Input: X
- Queries
- Keys
- Values
- Scaled Dot-Product Attention
- Concatenate and transform: Nc[Sa1[X]; Sa2[X]]
Cross-Attention
- In cross-attention, the queries, keys, and values are different and come from different sources (encoder and decoder).
Cross-Attention/Encoder-Decoder Attention
- Decoder:
- Queries: Q = \beta{q1}^T + \Omegaq X_d
- Input: X_d
- Encoder:
- Keys: K = \beta{k1}^T + \Omegak X_e
- Values: V = \beta{v1}^T + \Omegav X_e
- Input: X_e
- Cross-attention: Attention, Softmax[K^TQ]
- Output: V \cdot Softmax[K^TQ]
Masked Multi-Head Attention
- Decoder has different self-attention => Masked self-attention.
- We generate one token at a time.
- During generation, we don't know which tokens we'll generate in the future.
- To enable parallelization we forbid the decoder to look ahead.
- Future tokens are masked out (setting them to -inf) before the softmax step in the self-attention calculation.
Attention is Cheap!
- FLOPs (floating point operations per second) is used to measure computer performance.
Representing The Order of The Sequence
- Self-attention is equivariant to permuting word order.
- Word order is important in language.
Using Positional Encoding
- Positional encoding gives the advantage of being able to scale to unseen lengths of sequences.
- The Transformer adds a vector to each input embedding.
- These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
- Ideally, the following criteria should be satisfied:
- It should output a unique encoding for each time-step (word’s position in a sentence).
- Distance between any two time-steps should be consistent across sentences with different lengths.
- Our model should generalize to longer sentences without any efforts. Its values should be bounded.
- It must be deterministic.
- Sine and cosine functions of different frequencies are used.
- Positional encoding steps:
- Word embeddings + positional encoding = embedding value.
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Aidan N. Gomez, Illia Polosukhin, Jakob Uszkoreit, Łukasz Kaiser
- "Full package" model:
- Everything only Attention, delete all RNN components
- Positional encodings
- Residual network (ResNet) structure
- Interspersing of Attention and MLP
- LayerNorms
- Multiple heads of attention in parallel
- Great hyperparameters (e.g. ffw_size=4, isotropic)
- The transformer has changed remarkably little to this day.
- The only consistent change is the "pre-norm" formulation, reshuffling the LayerNorms
- Encoder: Task is to read and “understand” the user’s input.
- Decoder: Task is to generate the output (e.g., answer the user’s query).
- Input: Tokenization and Embedding, Positional Encoding.
- Input text is split into pieces which can be characters, words, or "tokens".
- Example: "The detective investigated" -> [The] [detective] [invest] [igat] [ed_].
- Tokens are indices into the "vocabulary": [The] [detective] [invest] [igat] [ed_] -> [3 721 68 1337 42].
- Each vocabulary entry corresponds to a learned dmodel-dimensional vector [3 721 68 1337 42] -> [ [0.123, -5.234, …], […], […], […], […] ].
- Vocab size (32k) dmodel.
- Attention is permutation invariant, but language is not.
- Need to encode the position of each word; just add something.
Multi-Headed Self-Attention
- The input sequence is used to create queries, keys, and values.
- Each token can "look around" the whole input and decide how to update its representation based on what it sees.
Point-wise MLP
- A simple MLP applied to each token individually: zi = W2 GeLU(W1x + b1) + b_2
- Think of it as each token pondering for itself about what it has observed previously.
- There's some weak evidence this is where "world knowledge" is stored, too.
- It contains the bulk of the parameters.
- When people make giant models and sparse/moe, this is what becomes giant.
Residual/Skip Connections
- Each module's output has the exact same shape as its input.
- Following ResNets, the module computes a "residual" instead of a new value: zi = Module(xi) + x_i
- This was shown to dramatically improve trainability.
LayerNorm
- Normalization also dramatically improves trainability.
- There's post-norm (original) and pre-norm (modern).
- Post-norm: zi = LN(Module(xi) + x_i)
- Pre-norm: zi = Module(LN(xi)) + x_i
Encoding / Encoder
- Since input and output shapes are identical, we can stack N such blocks.
- Typically, N=6 ("base"), N=12 ("large") or more.
- Encoder output is a "heavily processed" (think: "high level, contextualized") version of the input tokens, i.e., a sequence.
- This has nothing to do with the requested output yet (think: translation).
- That comes with the decoder.
Decoding / the Decoder
- Alternatively: Generating / the Generator
- What we want to model: p(z|x)
- For example, in translation: p(z | \text{"the detective investigated"}) \forall z
- Seem impossible at first, but we can exactly decompose into tokens:
- p(z|x) = p(z1|x) p(z2|z1,x) p(z3|z2,z1,x)…
- Meaning, we can compute the likelihood of a given output z, or generate/sample an answer z one token at a time.
- Each p is a full pass through the model.
- For generating p(z3|z2,z_1,x):
- x comes from the encoder.
- z1, z2 is what we have predicted so far, goes into the decoder.
- Once we have a p(zi|z{:i},x), we still need to actually sample a sentence such as "le détective a enquêté".
- Many strategies: greedy, beam, …
Masked Self-Attention
- This is regular self-attention as in the encoder, to process what's been decoded so far, e.g., z2,z1 in p(z3|z2,z_1,x), but with a trick…
- At training time: Masked self-attention.
- If we had to train on one single p(z3|z2,z_1,x) at a time: SLOW!
- Instead, train on all p(zi|z{1:i},x) for all i simultaneously.
- How? In the attention weights for z_i, set all entries i:N to 0.
- This way, each token only sees the already generated ones.
- At generation time, there is no such trick and we need to generate one z_i at a time.
- This is why autoregressive decoding is extremely slow.
Cross Attention
- Each decoded token can "look at" the encoder's output: Attn(q=Wq x{dec}, k=Wk x{enc}, v=Wv x{enc})
- This is the same as in the 2014 paper.
- This is where |x in p(z3|z2,z1,x) comes from. "Cross" attention x{enc}, x_{dec}
- Because self-attention is so widely used, people have started just calling it "attention".
- Hence, we now often need to explicitly call this "cross attention".
Output Layer
- Assume we have already generated K tokens, generate the next one.
- The decoder was used to gather all information necessary to predict a probability distribution for the next token (K), over the whole vocabulary.
- Simple: linear projection of token K SoftMax normalization.
Model Variations
- Decoder-only (GPT)
- Encoder-only (BERT)
- Enc-Dec (T5)
- The classic landscape: One architecture per "community".
- The Transformer's takeover: One community at a time.
NN Hyperparameters
- Regularization
- Loss function
- Dimensions
- Activation function
- Initialization
- Adagrad
- Dropout
- Mini-batch size
- Initial learning rate
- Learning rate schedule
- Momentum
- Stopping time
Ablations
- Ablation studies analyze the impact of different components on performance.
Tokenization of Different Modalities
- Tokenize different modalities each in their own way (some kind of "patching"), and send them all jointly into a Transformer.
- Seems to just work.
- Currently, an explosion of works is doing this!
- Anything you can tokenize, you can feed to Transformer ca 2021 and onwards.