Transformer Architecture and Components

Transformer Architecture

The transformer architecture includes:
- Input embeddings
- Output embeddings
- Positional encoding
- Nx blocks, each containing:
  - Add & Norm layers
  - Feed Forward networks
  - Multi-Head Attention mechanisms (potentially masked)
- Linear layer
- Softmax layer for output probabilities

Transformer Components

Tokenization:
- Splitting input text into individual words or sub-word units.
- Example: "I ate an apple" becomes I, ate, an, apple.
- An end-of-sequence token <eos> is often appended.
Input Embeddings:
- Generating vector representations for each token.
- Maps each token to a d-dimensional vector space (d_model).
Position Encodings:
- Adding information about the position of tokens in the sequence.
- Requirements:
  - Represent the sequential order (like seq2seq models).
  - Unique for each position (not cyclic).
  - Bounded values.
- Possible Candidates:
  - Learnable Matrix: P(t + t’) = M_{t’} \times P(t)
    - M should be a unitary matrix.
    - Eigenvalues should have a magnitude of 1 (norm-preserving).
  - Rotary Position Embedding (RoFormer):
    - f{q, k}(xm, m) = \begin{pmatrix} \text{Cos }m\theta & -\text{sin }m\theta \ \text{sin }m\theta & \text{cos }m\theta \end{pmatrix} \begin{pmatrix} W^{{1}{q, k}} & W^{{12}{q, k}} \ W^{{21}{q, k}} & W^{{22}{q, k}} \end{pmatrix} \begin{pmatrix} xm^{(1)} \ x_m^{(2)} \end{pmatrix}
  - Sinusoidal Position Encodings:
    - Must have the same dimensions as input embeddings.
    - Must produce overall unique encodings.
    - Formulas:
      - PE(pos, 2i) = sin(pos / 10000^{2i / d_{model}})
      - PE(pos, 2i+1) = cos(pos / 10000^{2i / d_{model}})
    - Where:
      - pos is the position/index of the token in the input sentence.
      - i is the i^{th} dimension out of d_{model}.
      - d_{model} is the embedding dimension of each token.
    - Different calculations for odd and even embedding indices.

Attention Mechanism

Formula:
- Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
- Where:
  - Q = Query
  - K = Key
  - V = Value
  - d_k = dimension of the Key.
Query, Key, and Value Explained:
- Analogy to a Database:
  - Database = {Key, Value store}
  - Query = Request for information (e.g., "Order details of order_104")
  - Key = Interacts directly with Queries, distinguishes objects, identifies relevance
  - Value = Actual details of the object; more fine-grained.
Attention Process:
- Queries interact with Keys to find the most relevant values.
- All done in parallel.
Detailed Attention Calculation
- Given input sequence: I1, I2, ate, an, apple,
- WQ, WK, WV are weight matrices for Query, Key, and Value projections.
- Q1 = WQ * I1, K1 = WK * I1, V1 = WV * I1 (and similarly for other tokens)
- Attention scores: e1,1, e1,2, e1,3 …
- Applying Softmax: α1,1, α1,2, α1,3 …
- Contextually rich embedding: ∑(αi,j * Vj), normalized by Z1.

Self-Attention

Using the attention mechanism to relate different positions of a single input sequence.
Example:
- "The animal didn’t cross the street because it was too wide."
- Self-attention helps resolve coreferences (What does "it" refer to?).
Process:
- Query Inputs = Key Inputs = Value Inputs = Input Embeddings.
- WQ, WK, WV matrices are used to generate Query, Key, and Value projections.
- Complexity: O(T^2 * d_{model}). T = sequence length.

Multi-Head Attention

Running the attention mechanism multiple times in parallel (“multiple heads”).
Allows the model to capture different types of relationships.
- Coreference resolution
- Semantic relationships
- Part of speech
- Sentence boundaries
- Comparisons
- Context
Process:
- H parallel attention heads.
- Each head has its own WQi, WKi, WVi matrices.
- Output:
  - Z = concat(Z1, Z2, …, Z_h)W^O
  - Where dh = d{model} / h

Feed Forward Network

Applied to each position separately and identically.
Introduces non-linearity and learns complex relationships in the data.

Add & Norm

Add (Residuals):
- Adding the input of a sub-layer to its output (residual connection).
- Helps avoid vanishing gradients.
- Enables training of deeper networks.
Norm (Normalization):
- Normalization of the layer outputs.
- Stabilizes training.
- Provides a regularization effect.

Encoders

Composed of multiple layers.
Each layer consists of Multi-Head Attention and Feed Forward sub-layers, each surrounded by Add & Norm.
The output of one encoder layer becomes the input to the next.

Decoders

Used in sequence-to-sequence tasks like machine translation.
Includes masked multi-head attention to prevent attending to future tokens.
Also includes encoder-decoder attention to attend to the input sequence.

Masked Multi-Head Attention

Used in the decoder to prevent attending to future tokens during training.
Masking ensures that the prediction for a token only depends on the known tokens before it.
Process:
- Attention Mask: M.
- Z' = Attention(Q, K, V) + M
- Softmax --> 0 for masked values.

Encoder-Decoder Attention

Allows the decoder to attend to the input sequence encoded by the encoder.
Queries come from the previous decoder layer, while keys and values come from the encoder output.
NOTE: Every decoder block receives the same FINAL encoder output

Linear and Softmax Layers

Linear Layer:
- A fully connected layer that projects the output of the decoder to the vocabulary size.
Softmax Layer:
- Converts the output of the linear layer into probabilities, indicating the likelihood of each token being the next token in the sequence.

Generative AI

Overview: Generative AI uses machine learning to create new content across various modalities (text, images, audio, code, etc).
Now possible - Innovations in ML and cloud tech stack, coupled with the popularity of apps like ChatGPT and DALL-E2.
Business Impact: Can reduce the marginal cost of producing knowledge-intensive content.
Generative AI can produce a wide range of outputs depending on the specific application and type of data that is needed.