Transformers: Attention Is All You Need

Natural Language Processing

Learning Representations of Variable Length Data

Neural Machine Translation (NMT) uses a single neural network for machine translation in an end-to-end manner.
Sequence-to-sequence architecture is used for NMT.
RNNs are commonly used for learning variable-length representations and are a natural fit for sentences.
LSTMs, GRUs, and their variants are prevalent in recurrent models and are at the core of seq2seq models with attention.
However, RNNs have limitations:
- Sequential processing prohibits parallelization within instances.
- Long-range dependencies can be tricky despite gating mechanisms.

Attention

Attention between encoder and decoder is crucial in NMT.
Attention mechanisms can be used for representation learning.

Motivation

Design a neural network to encode and process text where words "attend to" other relevant words.
Objectives:
- Establish connections between words.
- Determine the strength of connections based on the words themselves.

Self-Attention

Self-attention learns dependencies between words in a sentence to capture its internal structure.
High-Level Explanation:
- Considers an example sentence: "The animal didn't cross the street because it was too tired."
- Determines what "it" refers to in the sentence.
- Self-attention allows the model to look at other positions in the input sequence for clues to better encode a word as it processes each word/position.

Three Ways of Attention

Encoder Self-Attention
Encoder-Decoder Attention
Masked Decoder Self-Attention

The Transformer

The Transformer model architecture includes:
- Encoders
- Decoders
- Self-attention mechanism
- Cross-attention mechanism

Sequence-to-sequence with Attention

Attention scores and distribution are used to take a weighted sum of the encoder hidden states.
The attention output contains information from hidden states that received high attention.

Self-Attention in Detail

The first step involves creating three vectors from each word embedding at the encoder side:
- Query vector
- Key vector
- Value vector
These vectors are created by multiplying the embedding by three trained matrices.
Query, key, and value vectors are abstractions used for calculating and thinking about attention.
Query: What the token is seeking (information needs).
Key: The information the token holds (relevance to other tokens' queries).
Value: The actual content the token shares if its key is found relevant.
Analogy: Searching for a topic in a library:
- Query: Having a topic in mind.
- Keys: Checking titles and keywords of books.
- Values: Retrieving the books that "match".
The second step involves calculating a score that determines how much focus to place on other parts of the input sentence while encoding a word at a certain position.
The score is the dot product of the query vector with the key vector of the respective word being scored.
For example, when processing self-attention for the word in position 1 (Thinking), the first score is the dot product of q1 and k1, and the second score is the dot product of q1 and k2, and so on.
The third and fourth steps involve:
- Dividing the scores by the square root of the dimension of the key vectors for training stability.
- Passing the result through a softmax operation to normalize the scores so they’re all positive and add up to 1.
The fifth and sixth steps involve:
- Multiplying each value vector by the softmax score.
- Summing up the weighted value vectors to produce the output of the self-attention layer at that position.

Scaled Dot-Product Attention

Attention formula: Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
Each vector has three representations:
- Query: Asking for information.
- Key: Saying that it has some information.
- Value: Giving the information.
These matrices allow different aspects of the x vectors to be used/emphasized in each of the three roles.
Attention matches the key and query by assigning a value to the place the key is most likely to be.

Multi-Head Attention

Multi-headed attention refines the self-attention layer by adding a mechanism that:
- Expands the model’s ability to focus on different positions.
- Gives the attention layer multiple “representation subspaces”.
It works by performing the same self-attention calculation multiple times with different weight matrices, resulting in different output matrices.
One attention head may focus on "the animal" while another focuses on "tired" for the word "it".

Multi-Head Attention Steps

Input sentence.
Embed each word.
Split into multiple heads (e.g., 8 heads) and multiply X or R with weight matrices.
Calculate attention using the resulting Q/K/V matrices.
Concatenate the resulting Z matrices, then multiply with weight matrix W^O to produce the output of the layer.

Matrix Form - Self-Attention

Queries: q^T
Keys: K
Values: V
Self-attention: Attention, Softmax[K^TQ]
Output: V \cdot Softmax[K^TQ]

Multi-Head Self-Attention

In parallel:
- Input: X
- Queries
- Keys
- Values
- Scaled Dot-Product Attention
Concatenate and transform: Nc[Sa1[X]; Sa2[X]]

Cross-Attention

In cross-attention, the queries, keys, and values are different and come from different sources (encoder and decoder).

Cross-Attention/Encoder-Decoder Attention

Decoder:
- Queries: Q = \beta{q1}^T + \Omegaq X_d
- Input: X_d
Encoder:
- Keys: K = \beta{k1}^T + \Omegak X_e
- Values: V = \beta{v1}^T + \Omegav X_e
- Input: X_e
Cross-attention: Attention, Softmax[K^TQ]
Output: V \cdot Softmax[K^TQ]

Masked Multi-Head Attention

Decoder has different self-attention => Masked self-attention.
We generate one token at a time.
During generation, we don't know which tokens we'll generate in the future.
To enable parallelization we forbid the decoder to look ahead.
Future tokens are masked out (setting them to -inf) before the softmax step in the self-attention calculation.

Attention is Cheap!

FLOPs (floating point operations per second) is used to measure computer performance.

Representing The Order of The Sequence

Self-attention is equivariant to permuting word order.
Word order is important in language.

Using Positional Encoding

Positional encoding gives the advantage of being able to scale to unseen lengths of sequences.
The Transformer adds a vector to each input embedding.
These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
Ideally, the following criteria should be satisfied:
- It should output a unique encoding for each time-step (word’s position in a sentence).
- Distance between any two time-steps should be consistent across sentences with different lengths.
- Our model should generalize to longer sentences without any efforts. Its values should be bounded.
- It must be deterministic.
Sine and cosine functions of different frequencies are used.
Positional encoding steps:
- Word embeddings + positional encoding = embedding value.

Attention Is All You Need - The Transformer Architecture

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Aidan N. Gomez, Illia Polosukhin, Jakob Uszkoreit, Łukasz Kaiser
"Full package" model:
- Everything only Attention, delete all RNN components
- Positional encodings
- Residual network (ResNet) structure
- Interspersing of Attention and MLP
- LayerNorms
- Multiple heads of attention in parallel
- Great hyperparameters (e.g. ffw_size=4, isotropic)
The transformer has changed remarkably little to this day.
The only consistent change is the "pre-norm" formulation, reshuffling the LayerNorms

Transformer Architecture Details

Encoder: Task is to read and “understand” the user’s input.
Decoder: Task is to generate the output (e.g., answer the user’s query).
Input: Tokenization and Embedding, Positional Encoding.

Input Tokenization

Input text is split into pieces which can be characters, words, or "tokens".
Example: "The detective investigated" -> [The] [detective] [invest] [igat] [ed_].
Tokens are indices into the "vocabulary": [The] [detective] [invest] [igat] [ed_] -> [3 721 68 1337 42].
Each vocabulary entry corresponds to a learned dmodel-dimensional vector [3 721 68 1337 42] -> [ [0.123, -5.234, …], […], […], […], […] ].
Vocab size (32k) dmodel.
Attention is permutation invariant, but language is not.
Need to encode the position of each word; just add something.

Multi-Headed Self-Attention

The input sequence is used to create queries, keys, and values.
Each token can "look around" the whole input and decide how to update its representation based on what it sees.

Point-wise MLP

A simple MLP applied to each token individually: zi = W2 GeLU(W1x + b1) + b_2
Think of it as each token pondering for itself about what it has observed previously.
There's some weak evidence this is where "world knowledge" is stored, too.
It contains the bulk of the parameters.
When people make giant models and sparse/moe, this is what becomes giant.

Residual/Skip Connections

Each module's output has the exact same shape as its input.
Following ResNets, the module computes a "residual" instead of a new value: zi = Module(xi) + x_i
This was shown to dramatically improve trainability.

LayerNorm

Normalization also dramatically improves trainability.
There's post-norm (original) and pre-norm (modern).
Post-norm: zi = LN(Module(xi) + x_i)
Pre-norm: zi = Module(LN(xi)) + x_i

Encoding / Encoder

Since input and output shapes are identical, we can stack N such blocks.
Typically, N=6 ("base"), N=12 ("large") or more.
Encoder output is a "heavily processed" (think: "high level, contextualized") version of the input tokens, i.e., a sequence.
This has nothing to do with the requested output yet (think: translation).
That comes with the decoder.

Decoding / the Decoder

Alternatively: Generating / the Generator
What we want to model: p(z|x)
For example, in translation: p(z | \text{"the detective investigated"}) \forall z
Seem impossible at first, but we can exactly decompose into tokens:
- p(z|x) = p(z1|x) p(z2|z1,x) p(z3|z2,z1,x)…
Meaning, we can compute the likelihood of a given output z, or generate/sample an answer z one token at a time.
Each p is a full pass through the model.
For generating p(z3|z2,z_1,x):
- x comes from the encoder.
- z1, z2 is what we have predicted so far, goes into the decoder.
Once we have a p(zi|z{:i},x), we still need to actually sample a sentence such as "le détective a enquêté".
Many strategies: greedy, beam, …

Masked Self-Attention

This is regular self-attention as in the encoder, to process what's been decoded so far, e.g., z2,z1 in p(z3|z2,z_1,x), but with a trick…
At training time: Masked self-attention.
If we had to train on one single p(z3|z2,z_1,x) at a time: SLOW!
Instead, train on all p(zi|z{1:i},x) for all i simultaneously.
How? In the attention weights for z_i, set all entries i:N to 0.
This way, each token only sees the already generated ones.
At generation time, there is no such trick and we need to generate one z_i at a time.
This is why autoregressive decoding is extremely slow.

Cross Attention

Each decoded token can "look at" the encoder's output: Attn(q=Wq x{dec}, k=Wk x{enc}, v=Wv x{enc})
This is the same as in the 2014 paper.
This is where |x in p(z3|z2,z1,x) comes from. "Cross" attention x{enc}, x_{dec}
Because self-attention is so widely used, people have started just calling it "attention".
Hence, we now often need to explicitly call this "cross attention".

Output Layer

Assume we have already generated K tokens, generate the next one.
The decoder was used to gather all information necessary to predict a probability distribution for the next token (K), over the whole vocabulary.
Simple: linear projection of token K SoftMax normalization.

Model Variations

Decoder-only (GPT)
Encoder-only (BERT)
Enc-Dec (T5)

The Transformer's Unification of Communities

The classic landscape: One architecture per "community".
The Transformer's takeover: One community at a time.

NN Hyperparameters

Regularization
Loss function
Dimensions
Activation function
Initialization
Adagrad
Dropout
Mini-batch size
Initial learning rate
Learning rate schedule
Momentum
Stopping time

Ablations

Ablation studies analyze the impact of different components on performance.

Tokenization of Different Modalities

Tokenize different modalities each in their own way (some kind of "patching"), and send them all jointly into a Transformer.
Seems to just work.
Currently, an explosion of works is doing this!
Anything you can tokenize, you can feed to Transformer ca 2021 and onwards.