Transformer Architecture and Components
Transformer Architecture
- The transformer architecture includes:
- Input embeddings
- Output embeddings
- Positional encoding
- Nx blocks, each containing:
- Add & Norm layers
- Feed Forward networks
- Multi-Head Attention mechanisms (potentially masked)
- Linear layer
- Softmax layer for output probabilities
Transformer Components
Tokenization:
- Splitting input text into individual words or sub-word units.
- Example: "I ate an apple" becomes
I,ate,an,apple. - An end-of-sequence token
<eos>is often appended.
Input Embeddings:
- Generating vector representations for each token.
- Maps each token to a d-dimensional vector space (d_model).
Position Encodings:
Adding information about the position of tokens in the sequence.
Requirements:
- Represent the sequential order (like seq2seq models).
- Unique for each position (not cyclic).
- Bounded values.
Possible Candidates:
Learnable Matrix: P(t + t’) = M_{t’} \times P(t)
- M should be a unitary matrix.
- Eigenvalues should have a magnitude of 1 (norm-preserving).
Rotary Position Embedding (RoFormer):
- f{q, k}(xm, m) = \begin{pmatrix} \text{Cos }m\theta & -\text{sin }m\theta \ \text{sin }m\theta & \text{cos }m\theta \end{pmatrix} \begin{pmatrix} W^{{1}{q, k}} & W^{{12}{q, k}} \ W^{{21}{q, k}} & W^{{22}{q, k}} \end{pmatrix} \begin{pmatrix} xm^{(1)} \ x_m^{(2)} \end{pmatrix}
Sinusoidal Position Encodings:
- Must have the same dimensions as input embeddings.
- Must produce overall unique encodings.
- Formulas:
- PE(pos, 2i) = sin(pos / 10000^{2i / d_{model}})
- PE(pos, 2i+1) = cos(pos / 10000^{2i / d_{model}})
- Where:
posis the position/index of the token in the input sentence.iis the i^{th} dimension out of d_{model}.- d_{model} is the embedding dimension of each token.
- Different calculations for odd and even embedding indices.
Attention Mechanism
Formula:
- Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
- Where:
- Q = Query
- K = Key
- V = Value
- d_k = dimension of the Key.
Query, Key, and Value Explained:
- Analogy to a Database:
- Database = {Key, Value store}
- Query = Request for information (e.g., "Order details of order_104")
- Key = Interacts directly with Queries, distinguishes objects, identifies relevance
- Value = Actual details of the object; more fine-grained.
- Analogy to a Database:
Attention Process:
- Queries interact with Keys to find the most relevant values.
- All done in parallel.
Detailed Attention Calculation
- Given input sequence: I1, I2, ate, an, apple,
- WQ, WK, WV are weight matrices for Query, Key, and Value projections.
- Q1 = WQ * I1, K1 = WK * I1, V1 = WV * I1 (and similarly for other tokens)
- Attention scores: e1,1, e1,2, e1,3 …
- Applying Softmax: α1,1, α1,2, α1,3 …
- Contextually rich embedding: ∑(αi,j * Vj), normalized by Z1.
- Given input sequence: I1, I2, ate, an, apple,
Self-Attention
Using the attention mechanism to relate different positions of a single input sequence.
Example:
- "The animal didn’t cross the street because it was too wide."
- Self-attention helps resolve coreferences (What does "it" refer to?).
Process:
- Query Inputs = Key Inputs = Value Inputs = Input Embeddings.
- WQ, WK, WV matrices are used to generate Query, Key, and Value projections.
- Complexity: O(T^2 * d_{model}). T = sequence length.
Multi-Head Attention
Running the attention mechanism multiple times in parallel (“multiple heads”).
Allows the model to capture different types of relationships.
- Coreference resolution
- Semantic relationships
- Part of speech
- Sentence boundaries
- Comparisons
- Context
Process:
- H parallel attention heads.
- Each head has its own WQi, WKi, WVi matrices.
- Output:
- Z = concat(Z1, Z2, …, Z_h)W^O
- Where dh = d{model} / h
Feed Forward Network
- Applied to each position separately and identically.
- Introduces non-linearity and learns complex relationships in the data.
Add & Norm
Add (Residuals):
- Adding the input of a sub-layer to its output (residual connection).
- Helps avoid vanishing gradients.
- Enables training of deeper networks.
Norm (Normalization):
- Normalization of the layer outputs.
- Stabilizes training.
- Provides a regularization effect.
Encoders
- Composed of multiple layers.
- Each layer consists of Multi-Head Attention and Feed Forward sub-layers, each surrounded by Add & Norm.
- The output of one encoder layer becomes the input to the next.
Decoders
- Used in sequence-to-sequence tasks like machine translation.
- Includes masked multi-head attention to prevent attending to future tokens.
- Also includes encoder-decoder attention to attend to the input sequence.
Masked Multi-Head Attention
- Used in the decoder to prevent attending to future tokens during training.
- Masking ensures that the prediction for a token only depends on the known tokens before it.
- Process:
- Attention Mask: M.
- Z' = Attention(Q, K, V) + M
- Softmax --> 0 for masked values.
Encoder-Decoder Attention
- Allows the decoder to attend to the input sequence encoded by the encoder.
- Queries come from the previous decoder layer, while keys and values come from the encoder output.
- NOTE: Every decoder block receives the same FINAL encoder output
Linear and Softmax Layers
Linear Layer:
- A fully connected layer that projects the output of the decoder to the vocabulary size.
Softmax Layer:
- Converts the output of the linear layer into probabilities, indicating the likelihood of each token being the next token in the sequence.
Generative AI
- Overview: Generative AI uses machine learning to create new content across various modalities (text, images, audio, code, etc).
- Now possible - Innovations in ML and cloud tech stack, coupled with the popularity of apps like ChatGPT and DALL-E2.
- Business Impact: Can reduce the marginal cost of producing knowledge-intensive content.
- Generative AI can produce a wide range of outputs depending on the specific application and type of data that is needed.