Natural Language Processing – Transformers & Attention

Instructor & Course Context

Andrew Castillo (PhD Mathematics, Ohio State; BS & MS Mathematics, Texas A&M) is a practicing data-scientist at CoverMyMeds/McKesson. His daily work centers on NLP, unsupervised techniques, statistical modeling of question-set behaviour, and anomaly detection.

Optimizing Your Learning Experience

The program mixes Sunday live classes with Thursday review/feedback sessions. Students are encouraged to: actively interact, ask clarifying questions, complete assignments/MCQs, and leverage all provided resources—"what you put in is what you get out."

NLP Road-Map in the IK Series

NLP 1: Word Embeddings
NLP 2: Seq2Seq Modelling
NLP 3: Transformers → BERT

Session-Specific Learning Objectives

  1. Build intuition for Attention.

  2. Understand Transformer architecture and its importance across ML domains.

  3. Dissect key components: Encoder, Multi-head Self-Attention (QKV, Positional Encoding), Decoder, and Masked Multi-head Attention.

Motivation: Why Another Model?

Seq2Seq RNNs/LSTMs compress an entire input sequence into a single fixed-length vector, leading to information bottlenecks—especially with long sentences. Attention was proposed to alleviate this by allowing the decoder to "peek" at every encoder hidden state.

Transformers, introduced in the 2017 paper "Attention Is All You Need," discard the RNN entirely, relying solely on self-attention. They have since become the backbone of Large Language Models (LLMs) and have impacted vision, audio, and multi-modal tasks as well.

Quick Recap: Traditional Seq2Seq Translation

Example pipeline: English "nice to meet you" → French "ravi de rencontrer". Encoder LSTMs map the whole source sentence to a context vector h_T and pass it to the decoder LSTMs. Limitations: sequential computation, long-range dependency degradation, and difficult parallelization.

Conceptual Definition of Attention

Borrowed from cognitive psychology: "concentration of awareness on some phenomenon to the exclusion of other stimuli." In transformers, attention is a learned mechanism that, for every output token, allocates continuous weights over all input tokens, letting the model focus adaptively on relevant context.

Illustrative sentence: "The cat sat on the mat because it was soft." When interpreting "it," the model’s relevance score leans heavily toward "the mat," producing higher weights for that segment.

Mathematical Formulation of Attention

Given query matrix Q, key matrix K, and value matrix V (all of shape [l, dk] for a sequence length l): Attention(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{dk}} \right) V
Softmax is row-wise, so each row’s weights sum to 1.

For a length-9 sentence with d_k = 64, the dot-product matrix is 9 \times 9; softmax ensures probabilistic weighting.

Self-Attention in Action

Every token produces its own Query, Key, and Value vectors through learned linear projections Wq, Wk, W_v applied to the original embedding. Example problem: resolving pronoun reference in "She poured water into the glass because she was thirsty"—"she" (query) must attend strongly to earlier "She" and to "thirsty."

Multi-Head Self-Attention (MHSA)

Transformers use several independent attention heads (e.g., h=8). Each head learns distinct representation sub-spaces, enabling the model to capture multiple types of relationships simultaneously (e.g., syntactic vs. semantic, positional vs. coreferential). Concatenated head outputs are linearly transformed to the original embedding dimension.

Positional Encoding

Because transformers lack recurrence, they inject order information via deterministic sine/cosine positional vectors:
PE(pos, 2i) = \sin\left( \frac{pos}{10000^{\frac{2i}{d{model}}}} \right) PE(pos, 2i+1) = \cos\left( \frac{pos}{10000^{\frac{2i}{d{model}}}} \right)
Properties: smooth variation encodes relative distance, multi-frequency captures patterns at different scales, periodicity generalizes to longer sequences, and bounded values avoid numerical instability.

Input Tensor Shape

Transformer consumes a 3-D tensor [b, l, e] where b = batch size, l = sequence length, e = embedding dimension.

Encoder Architecture (per layer)

  1. Multi-Head Self-Attention + residual connection.

  2. Layer Normalization.

  3. Position-wise Feed-Forward Network (two linear layers with ReLU/GeLU) + residual.
    LayerNorm normalizes across feature dimensions per token; unlike BatchNorm, it is independent of batch statistics, ideal for sequence processing.

Decoder Architecture (per layer)

  1. Masked Multi-Head Self-Attention (prevents future token leakage).

  2. Encoder-Decoder (cross) Attention—Queries from decoder, Keys & Values from encoder’s final output.

  3. Feed-Forward + residual + LayerNorm.

Masking Details

During training, teacher forcing feeds the entire ground-truth target sentence but applies a triangular causal mask so that position t cannot attend to >t. During inference, tokens are generated autoregressively: the previously predicted token is appended at each step. Padding masks in both encoder and decoder zero-out attention to tokens.

Training vs. Inference Workflows

Training: complete source goes through stacked encoders (≈6). Teacher-forced full target passes through stacked decoders (≈6) in one forward pass, enabling parallelism.
Inference: encoder still processes full source in parallel, but decoder runs step-by-step, re-feeding its own outputs because future targets are unknown.

Output Projection & Loss

Decoder’s final hidden states are linearly mapped to vocabulary logits, then softmaxed to probabilities. Predicted token = \arg\max probability. Training minimizes categorical cross-entropy:
\mathcal{L}{CE} = -\sum{i=1}^{n} ti \log(pi)

Advantages of Transformer

• Parallelizable sequence processing → faster training/inference on GPUs.
• Explicit modelling of long-range dependencies without vanishing gradients.
• Scalable model capacity; depth and width can be increased.
• Handles variable-length sequences via positional encoding and masking.

Limitations & Practical Challenges

• Quadratic O(l^2) memory/computation in sequence length due to full attention matrix.
• Data-hungry; pre-training often requires billions of tokens.
• Risk of over-fitting smaller tasks; requires careful regularization.
• Many hyper-parameters (heads, layers, d_{model}, dropout) require tuning.

Evolution Beyond the Original Transformer

The architecture underpins the modern LLM tree—from GPT-1 → GPT-4, BERT, T5, PaLM, LLaMA, OPT, BLOOM, Chinchilla, UL2, Claude, Bard, etc. Variants apply encoder-only, decoder-only, or encoder-decoder patterns; introduce sparse/expert routing, retrieval, or modality fusion.

Key Take-Away Summary

Self-attention computes token-to-token correlations. Transformers replace RNNs by stacking self-attention + feed-forward blocks with residual and LayerNorm. Encoders read the entire source; decoders generate targets with causal masking and cross-attention. Architectural parallelism plus attention flexibility make transformers the default backbone for state-of-the-art language, vision, and multi-modal systems.

Instructor & Course Context

Andrew Castillo, holding a PhD in Mathematics from Ohio State and BS & MS in Mathematics from Texas A&M, is a data-scientist at CoverMyMeds/McKesson. His professional expertise spans NLP, unsupervised techniques, statistical modeling of question-set behavior, and anomaly detection.

Optimizing Your Learning Experience

The program incorporates Sunday live classes and Thursday review/feedback sessions. Students are encouraged to actively interact, ask clarifying questions, complete assignments and MCQs, and fully utilize all provided resources, emphasizing that "what you put in is what you get out."

NLP Road-Map in the IK Series

The IK series will cover a structured NLP roadmap, starting with NLP 1: Word Embeddings, progressing to NLP 2: Seq2Seq Modelling, and concluding with NLP 3: Transformers, specifically focusing on BERT.

Session-Specific Learning Objectives

This session aims to build an intuitive understanding of Attention, comprehend the Transformer architecture and its significance across various ML domains, and dissect its key components. These components include the Encoder, Multi-head Self-Attention (covering QKV and Positional Encoding), the Decoder, and Masked Multi-head Attention.

Motivation: Why Another Model?

Seq2Seq RNNs/LSTMs face limitations due to compressing an entire input sequence into a single fixed-length vector, which creates information bottlenecks, particularly with longer sentences. Attention was introduced to mitigate this issue by enabling the decoder to access every encoder hidden state. The 2017 paper "Attention Is All You Need" introduced Transformers, a model that entirely replaces RNNs, relying exclusively on self-attention. Since their inception, Transformers have become the foundational architecture for Large Language Models (LLMs) and have also significantly influenced vision, audio, and multi-modal tasks.

Quick Recap: Traditional Seq2Seq Translation

A typical example of a traditional Seq2Seq pipeline involves translating an English phrase like "nice to meet you" into French, "ravi de rencontrer." In this process, Encoder LSTMs map the entire source sentence into a context vector, denoted as h_T, which is then passed to the Decoder LSTMs. However, this traditional approach suffers from sequential computation, degradation of long-range dependency information, and difficulties in parallelization.

Conceptual Definition of Attention

Attention, a concept borrowed from cognitive psychology, refers to the "concentration of awareness on some phenomenon to the exclusion of other stimuli." In the context of transformers, attention functions as a learned mechanism. For every output token, it assigns continuous weights across all input tokens, allowing the model to adaptively focus on the most relevant contextual information. For instance, in the sentence "The cat sat on the mat because it was soft," when interpreting the pronoun "it," the model assigns significantly higher relevance scores and corresponding weights to "the mat" segment.

Mathematical Formulation of Attention

The attention mechanism is mathematically formulated given a query matrix Q, a key matrix K, and a value matrix V. All these matrices are typically of shape [l, dk], where l represents sequence length and dk represents the dimension of the keys. The Attention output is calculated using the formula: Attention(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d*k}} \right) V. It's important to note that the softmax operation is applied row-wise, ensuring that the weights in each row sum to 1. For a sentence with a length of 9 tokens and a key dimension d_k = 64, the dot-product matrix will be 9 \times 9, and the softmax function ensures a probabilistic weighting of tokens.

Self-Attention in Action

In self-attention, each token in the input sequence generates its own Query, Key, and Value vectors. This is achieved through learned linear projections, specifically by applying weight matrices such as Wq, Wk, and W_v to the original embedding of the token. A common problem that self-attention helps resolve is pronoun coreference. For example, in the sentence "She poured water into the glass because she was thirsty," the query corresponding to the second "she" must strongly attend to the initial "She" and the word "thirsty" to correctly resolve the reference.

Multi-Head Self-Attention (MHSA)

Transformers enhance the attention mechanism by employing multiple independent attention heads, often denoted as h, such as h=8. Each of these heads is designed to learn distinct representation sub-spaces simultaneously. This parallelism allows the model to capture a variety of relationships within the data, including syntactic, semantic, positional, and coreferential associations. The outputs from these individual attention heads are then concatenated and linearly transformed back to the original embedding dimension, thus integrating the diverse information learned by each head.

Positional Encoding

Given that transformers inherently lack recurrence, they inject information about the order or position of tokens in a sequence through deterministic sine and cosine positional vectors. These are calculated using the formulas: PE(pos, 2i) = \sin\left( \frac{pos}{10000^{\frac{2i}{d{model}}}} \right) for even dimensions and PE(pos, 2i+1) = \cos\left( \frac{pos}{10000^{\frac{2i}{d{model}}}} \right) for odd dimensions. This encoding method offers several crucial properties: it provides smooth variation that effectively encodes relative distance between tokens, its multi-frequency nature allows for capturing patterns at different scales, its periodicity enables generalization to sequences longer than those seen during training, and its bounded values prevent numerical instability in the model.

Input Tensor Shape

A Transformer model processes input data in the form of a 3-dimensional tensor. This tensor has the shape [b, l, e], where b denotes the batch size (the number of sequences processed together), l represents the sequence length (the number of tokens in each sequence), and e signifies the embedding dimension (the size of the vector representation for each token).

Encoder Architecture (per layer)

Each layer within the Transformer's encoder comprises a sequence of operations. First, it performs Multi-Head Self-Attention, followed by a residual connection that adds the input of the sub-layer to its output. Next, Layer Normalization is applied, which normalizes features across the embedding dimension for each token, independent of batch statistics, making it well-suited for sequence processing. Finally, a Position-wise Feed-Forward Network is used, consisting of two linear layers with a non-linear activation function (ReLU or GeLU), also followed by another residual connection.

Decoder Architecture (per layer)

Each layer in the Transformer's decoder features a more complex architecture. It begins with Masked Multi-Head Self-Attention, which is designed to prevent the decoder from "cheating" by looking at future tokens during training. This is followed by a residual connection and Layer Normalization. The next key component is Encoder-Decoder (cross) Attention, where queries are derived from the decoder's previous layer, while keys and values come from the final output of the encoder. This allows the decoder to focus on relevant parts of the input sequence. Finally, a Feed-Forward Network, residual connection, and Layer Normalization are applied, similar to the encoder.

Masking Details

During the training phase, teacher forcing is utilized, meaning the entire ground-truth target sentence is provided to the decoder. However, a triangular causal mask is applied to the Masked Multi-Head Self-Attention to prevent any position t from attending to tokens at positions greater than t. During inference, tokens are generated autoregressively, where the newly predicted token is appended at each subsequent step to form the growing output sequence. In both the encoder and decoder, padding masks are employed to zero-out attention weights corresponding to <pad> tokens, ensuring they do not influence the attention mechanism.

Training vs. Inference Workflows

In the training workflow, the complete source sequence is processed in parallel through a stack of encoders, typically around six layers. Simultaneously, the full target sequence is teacher-forced through a similar stack of decoders (also around six layers) in a single forward pass, which enables significant parallelism and faster computation. Conversely, during inference, while the encoder still processes the full source in parallel, the decoder operates step-by-step. This is because at each step, the decoder must re-feed its own previously predicted outputs as future target tokens are unknown, making the process inherently sequential.

Output Projection & Loss

The decoder's final hidden states are subjected to a linear transformation, mapping them to a set of vocabulary logits. These logits are then passed through a softmax function to convert them into probabilities over the entire vocabulary. The predicted token is determined by taking the \arg\max of these probabilities. For training, the model minimizes the categorical cross-entropy loss, defined by the formula: \mathcal{L}{CE} = -\sum{i=1}^{n} ti \log(pi), where ti represents the true probability of token i and pi is the predicted probability.

Advantages of Transformer

Transformers offer several significant advantages in sequence processing. They enable parallelizable sequence processing, leading to considerably faster training and inference times, particularly on GPUs. The architecture explicitly models long-range dependencies, effectively mitigating issues like vanishing gradients commonly found in recurrent networks. Furthermore, Transformers possess scalable model capacity, allowing for increased depth and width to handle more complex tasks. They can also effectively manage variable-length sequences through the integration of positional encoding and masking mechanisms.

Limitations & Practical Challenges

Despite their advantages, Transformers come with certain limitations and practical challenges. One significant concern is their quadratic O(l^2) memory and computational complexity with respect to sequence length, primarily due to the full attention matrix calculation. Consequently, they are often data-hungry models, with pre-training frequently requiring billions of tokens. There is also a risk of overfitting on smaller tasks, necessitating careful regularization strategies. Finally, the architecture involves numerous hyper-parameters
esuch as the number of heads, layers, model dimension (d_{model}), and dropout rates
esall of which require meticulous tuning for optimal performance.

Evolution Beyond the Original Transformer

The foundational Transformer architecture has underpinned the development of the modern Large Language Model (LLM) ecosystem. This includes a broad lineage of models such as GPT-1 through GPT-4, BERT, T5, PaLM, LLaMA, OPT, BLOOM, Chinchilla, UL2, Claude, and Bard, among others. Variants of the original model employ encoder-only, decoder-only, or full encoder-decoder patterns, and have introduced innovations like sparse attention, expert routing, retrieval mechanisms, or modality fusion to address various computational and application needs.

Key Take-Away Summary

In summary, self-attention serves to compute correlations between tokens within a sequence. Transformers revolutionize traditional models by replacing Recurrent Neural Networks with a stack of self-attention and feed-forward blocks, augmented with residual connections and Layer Normalization for stability. The inherent architectural parallelism and the flexibility offered by attention have positioned Transformers as the default backbone for achieving state-of-the-art results across language, vision, and multi-modal systems.