Transformer Architecture and Components

Transformer Architecture

  • The transformer architecture includes:
    • Input embeddings
    • Output embeddings
    • Positional encoding
    • Nx blocks, each containing:
      • Add & Norm layers
      • Feed Forward networks
      • Multi-Head Attention mechanisms (potentially masked)
    • Linear layer
    • Softmax layer for output probabilities

Transformer Components

  • Tokenization:

    • Splitting input text into individual words or sub-word units.
    • Example: "I ate an apple" becomes I, ate, an, apple.
    • An end-of-sequence token <eos> is often appended.
  • Input Embeddings:

    • Generating vector representations for each token.
    • Maps each token to a d-dimensional vector space (d_model).
  • Position Encodings:

    • Adding information about the position of tokens in the sequence.

    • Requirements:

      • Represent the sequential order (like seq2seq models).
      • Unique for each position (not cyclic).
      • Bounded values.
    • Possible Candidates:

      • Learnable Matrix: P(t + t’) = M_{t’} \times P(t)

        • M should be a unitary matrix.
        • Eigenvalues should have a magnitude of 1 (norm-preserving).
      • Rotary Position Embedding (RoFormer):

        • f{q, k}(xm, m) = \begin{pmatrix} \text{Cos }m\theta & -\text{sin }m\theta \ \text{sin }m\theta & \text{cos }m\theta \end{pmatrix} \begin{pmatrix} W^{{1}{q, k}} & W^{{12}{q, k}} \ W^{{21}{q, k}} & W^{{22}{q, k}} \end{pmatrix} \begin{pmatrix} xm^{(1)} \ x_m^{(2)} \end{pmatrix}
      • Sinusoidal Position Encodings:

        • Must have the same dimensions as input embeddings.
        • Must produce overall unique encodings.
        • Formulas:
          • PE(pos, 2i) = sin(pos / 10000^{2i / d_{model}})
          • PE(pos, 2i+1) = cos(pos / 10000^{2i / d_{model}})
        • Where:
          • pos is the position/index of the token in the input sentence.
          • i is the i^{th} dimension out of d_{model}.
          • d_{model} is the embedding dimension of each token.
        • Different calculations for odd and even embedding indices.

Attention Mechanism

  • Formula:

    • Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
    • Where:
      • Q = Query
      • K = Key
      • V = Value
      • d_k = dimension of the Key.
  • Query, Key, and Value Explained:

    • Analogy to a Database:
      • Database = {Key, Value store}
      • Query = Request for information (e.g., "Order details of order_104")
      • Key = Interacts directly with Queries, distinguishes objects, identifies relevance
      • Value = Actual details of the object; more fine-grained.
  • Attention Process:

    • Queries interact with Keys to find the most relevant values.
    • All done in parallel.
  • Detailed Attention Calculation

    • Given input sequence: I1, I2, ate, an, apple,
    • WQ, WK, WV are weight matrices for Query, Key, and Value projections.
    • Q1 = WQ * I1, K1 = WK * I1, V1 = WV * I1 (and similarly for other tokens)
    • Attention scores: e1,1, e1,2, e1,3 …
    • Applying Softmax: α1,1, α1,2, α1,3 …
    • Contextually rich embedding: ∑(αi,j * Vj), normalized by Z1.

Self-Attention

  • Using the attention mechanism to relate different positions of a single input sequence.

  • Example:

    • "The animal didn’t cross the street because it was too wide."
    • Self-attention helps resolve coreferences (What does "it" refer to?).
  • Process:

    • Query Inputs = Key Inputs = Value Inputs = Input Embeddings.
    • WQ, WK, WV matrices are used to generate Query, Key, and Value projections.
    • Complexity: O(T^2 * d_{model}). T = sequence length.

Multi-Head Attention

  • Running the attention mechanism multiple times in parallel (“multiple heads”).

  • Allows the model to capture different types of relationships.

    • Coreference resolution
    • Semantic relationships
    • Part of speech
    • Sentence boundaries
    • Comparisons
    • Context
  • Process:

    • H parallel attention heads.
    • Each head has its own WQi, WKi, WVi matrices.
    • Output:
      • Z = concat(Z1, Z2, …, Z_h)W^O
      • Where dh = d{model} / h

Feed Forward Network

  • Applied to each position separately and identically.
  • Introduces non-linearity and learns complex relationships in the data.

Add & Norm

  • Add (Residuals):

    • Adding the input of a sub-layer to its output (residual connection).
    • Helps avoid vanishing gradients.
    • Enables training of deeper networks.
  • Norm (Normalization):

    • Normalization of the layer outputs.
    • Stabilizes training.
    • Provides a regularization effect.

Encoders

  • Composed of multiple layers.
  • Each layer consists of Multi-Head Attention and Feed Forward sub-layers, each surrounded by Add & Norm.
  • The output of one encoder layer becomes the input to the next.

Decoders

  • Used in sequence-to-sequence tasks like machine translation.
  • Includes masked multi-head attention to prevent attending to future tokens.
  • Also includes encoder-decoder attention to attend to the input sequence.

Masked Multi-Head Attention

  • Used in the decoder to prevent attending to future tokens during training.
  • Masking ensures that the prediction for a token only depends on the known tokens before it.
  • Process:
    • Attention Mask: M.
    • Z' = Attention(Q, K, V) + M
    • Softmax --> 0 for masked values.

Encoder-Decoder Attention

  • Allows the decoder to attend to the input sequence encoded by the encoder.
  • Queries come from the previous decoder layer, while keys and values come from the encoder output.
  • NOTE: Every decoder block receives the same FINAL encoder output

Linear and Softmax Layers

  • Linear Layer:

    • A fully connected layer that projects the output of the decoder to the vocabulary size.
  • Softmax Layer:

    • Converts the output of the linear layer into probabilities, indicating the likelihood of each token being the next token in the sequence.

Generative AI

  • Overview: Generative AI uses machine learning to create new content across various modalities (text, images, audio, code, etc).
  • Now possible - Innovations in ML and cloud tech stack, coupled with the popularity of apps like ChatGPT and DALL-E2.
  • Business Impact: Can reduce the marginal cost of producing knowledge-intensive content.
  • Generative AI can produce a wide range of outputs depending on the specific application and type of data that is needed.