7. Advanced Sequence Learning with Transformers

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Last updated 4:20 PM on 5/29/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

48 Terms

New cards

What is the primary limitation of static/shallow word embeddings (e.g., Word2Vec, GloVe)? Give an example.

They are entirely context-free, assigning the exact same static vector representation to a word regardless of its surrounding context. For example, the word "bank" receives the same mathematical vector in both "bank account" (finance) and "bank of a river" (geography).

New cards

Explain the core structural architectural differences between Recurrence, Shallow Embeddings, and 1D CNNs for sequence processing.

Recurrence: Tracks temporal dependencies step-by-step over time using a dynamic, hidden context layer.

Shallow Embedding: Applies a static vector transformation matrix ( $W_e$ ) to project discrete vocabulary tokens into dense continuous vectors.
1D CNN: Slides a fixed-width window across time slices, applying 1D convolution operations and temporal pooling (e.g., Max-pooling over time) to extract local context features.

New cards

Describe the multi-channel architectural design of TextCNN (Kim, 2014).

TextCNN represents words using D > 1 channels (e.g., 6D embeddings). It processes text by applying multiple parallel kernel widths simultaneously (e.g., a width-2 kernel with 4 channels alongside a width-4 kernel with 5 channels) to capture diverse n-gram features before routing them into fully connected classification layers.

New cards

What convolution variant does WaveNet employ to capture long-range dependencies, and how does it function?

WaveNet employs Atrous (dilated) convolutions. It steps through hidden network layers while exponentially increasing a dilation factor ( $r$ ). This exponentially expands the model's receptive field over time without losing sequential resolution or requiring massive parameter increases.

New cards

What is the "information bottleneck" problem in standard unaligned Seq2Seq models?

Unaligned Seq2Seq architectures force an entire source sequence into a single, fixed-length context vector ( $c$ ). Because history decays logarithmically during sequential processing, the network suffers from a heavy information bottleneck that causes a steep drop in performance on longer sequences.

New cards

How did Bahdanau et al. (2015) solve the sequence bottleneck, and what was its performance impact?

They introduced Joint Alignment and Translation, which replaces the static context vector with a dynamic soft search mechanism. The model dynamically locates relevant parts of the source sentence while predicting each target word.

Unlike traditional architectures (RNNenc) whose translation quality (BLEU score) collapses on sentences longer than 30 words, this joint alignment method (RNNsearch) maintains stable, high BLEU scores even as sentence lengths approach 60 words.

New cards

Define the roles of the three main vectors used in an Attention Mechanism: Keys ( $k$ ), Query ( $q$ ), and Values ( $v$ ).

Keys ( $k$ ): The input content features that describe what information is available.
Query ( $q$ ): The current target state or tracking token that dictates what specific features to search for next.
Values ( $v$ ): The actual content vectors aggregated to produce the final weighted output representation.

New cards

Write out the general mathematical equation for an Attention mechanism step, including the calculation of the attention weight $\alpha_i$ .

$\text{Attn}(q, (k_{1:m}, v_{1:m})) = \sum_{i=1}^{m} \alpha_i(q, k_{1:m})v_i \in \mathbb{R}^{v}$

Where weights are calculated via a softmax function over an alignment score function $a(q, k_i)$ :

$\alpha_i(q, k_{1:m}) = \frac{\exp(a(q, k_i))}{\sum_{j=1}^{m} \exp(a(q, k_j))}$

New cards

Match the following Attention variations to their exact Alignment Score Functions: Content-based, Additive, Location-Based, General Dot-Product, Dot-Product, Scaled Dot-Product.

Content-based (Graves 2014): $\text{score}(s_t, h_i) = \text{cosine}[s_t, h_i]$

Additive (Bahdanau 2015): $\text{score}(s_t, h_i) = v_a^{\top} \tanh(W_a [s_t; h_i])$
Location-Based (Luong 2015): $\alpha_{t,i} = \text{softmax}(W_a s_t)$
General Dot-Product (Luong 2015): $\text{score}(s_t, h_i) = s_t^{\top} W_a h_i$
Dot-Product (Luong 2015): $\text{score}(s_t, h_i) = s_t^{\top} h_i$
Scaled Dot-Product (Vaswani 2017): $\text{score}(s_t, h_i) = \frac{s_t^{\top} h_i}{\sqrt{n}}$

New cards

Write out the loss function minimized during the unsupervised/self-supervised training of ELMo (Peters et al., 2018).

$\mathcal{L}(\theta) = -\sum_{t=1}^{T} \left[ \log p(x_t \vert x_{1:t-1}; \theta_e, \theta^{\rightarrow}, \theta_s) + \log p(x_t \vert x_{t+1:T}; \theta_e, \theta^{\leftarrow}, \theta_s^K) \right]$

It jointly optimizes a forward and backward language model to build deep contextualized word representations from an entire sentence.

New cards

Compare Recurrence, Convolution, and Self-Attention on Context Span and Computational Efficiency.

Recurrence: Logarithmic decay of context window; computationally expensive due to its sequential step-by-step nature.
Convolution: Context window restricted strictly to a fixed kernel width; computationally efficient due to parallelization.
Self-Attention: Provides a direct connection between every output and all inputs simultaneously; computational complexity depends entirely on sequence length $N$ ( $O(N^2)$ ).

New cards

Detail the precise sub-layer layout of a standard Transformer Encoder block versus a Decoder block.

Encoder Block: A Self-Attention layer followed by an Add & Norm (residual connection + layer normalization) stage $\rightarrow$ Feed Forward network with an independent Add & Norm stage.
Decoder Block: A Masked Self-Attention layer (blocking future tokens) + Add & Norm $\rightarrow$ Encoder-Decoder Attention layer (linking decoder queries to encoder keys/values) + Add & Norm $\rightarrow$ Feed Forward network + final Add & Norm stage.

New cards

What is Positional Encoding, and why is it mandatory for Transformer architectures?

Because self-attention completely dispenses with recurrence and convolutions, it processes all input tokens simultaneously and discards any inherent concept of sequence structure or word order. A positional encoding adds a unique mathematical spatial fingerprint directly to the input embeddings (e.g., using sine/cosine wave patterns across dimensions) to restore token order.

New cards

Write the exact equation for Scaled Dot-Product Attention. What is the explicit mathematical purpose of the scale factor ( $\sqrt{d_k}$ )?

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The scale factor $\sqrt{d_k}$ (the dimension of keys) stabilizes training. As dimensions grow, dot products grow large, pushing the softmax function into regions with dangerously small gradient variations. Dividing by $\sqrt{d_k}$ prevents these vanishing gradients.

New cards

What is Multi-Head Attention, and what structural processing benefit does it provide over single attention?

It splits queries, keys, and values into multiple low-dimensional projections processed completely in parallel. This allows the model to simultaneously focus on different types of structural, grammatical, and semantic relationships across the text (e.g., mapping a verb directly to its subject in one head while tracking next-word alignment in another).

New cards

Explain the concept of the "Residual Stream" and residual connections within a deep Transformer architecture.

Residual connections pass un-transformed baseline signals directly forward across blocks ( $\mathcal{F}_l(x) + x$ ), helping to prevent vanishing gradients during backpropagation. From an information processing perspective, the network functions as a long, continuous residual stream where subsequent transformer layers act as modules that read from and write to the shared representation space.

New cards

Contrast the core structural differences and pre-training objectives of BERT versus GPT.

BERT: Utilizes Transformer Encoder blocks. It is a deeply bidirectional model pre-trained via autoencoding objectives like Masked Language Modeling (Mask LM) and Next Sentence Prediction (NSP).

GPT: Utilizes Transformer Decoder blocks. It is a strictly autoregressive, unidirectional model that processes tokens left-to-right, looping its own generation outputs back into inputs.

New cards

How does a Vision Transformer (ViT) adapt standard Transformer architecture to process two-dimensional images?

Instead of using traditional CNN filtering, ViT chops an image into fixed $16 \times 16$ patches. These patches are flattened, passed through a linear projection layer, augmented with positional encodings, and combined with an extra learnable [class] embedding before being fed into a standard Transformer Encoder.

New cards

State the three power-law relationships that describe LLM Scaling Laws (Kaplan et al., 2020).

Model performance scales predictably via power laws when performance is not bottlenecked by any of the other factors:

$\text{Parameters: } L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}; \quad \text{Compute: } L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}; \quad \text{Data: } L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}$

New cards

Write out the structural parameter estimation equation relating parameters ( $N$ ) to transformer depth ( $n_{\text{layer}}$ ) and width ( $d$ ).

Assuming the standard design configuration ( $d_{\text{attn}} = d_{\text{ff}}/4 = d$ ):

$N \approx 2 \cdot d \cdot n_{\text{layer}}(2d_{\text{attn}} + d_{\text{ff}}) \approx 12 \cdot n_{\text{layer}} \cdot d^2$

This identity helps engineers balance depth-to-width scaling boundaries (e.g., setting a baseline target of $\sim 80$ layers for a 178B parameter model).

New cards

Name the three core strategic categories of Efficient Transformer variants designed to mitigate the $O(N^2)$ computational complexity bottleneck.

1. Fixed/Factorized & Learnable Sparse Patterns: Attending only to sparse subsets of data (e.g., Longformer, Big Bird, Reformer).

2. Low Rank & Kernel Methods: Approximating the full attention matrix via low-rank mappings or random Gaussian projections (e.g., Linformer, Performer).

3. Memory & Recurrence Methods: Accessing inputs locally via localized recurrence or global context slots (e.g., Transformer-XL).

New cards

Explain the core limitation of dominant sequence models (RNNs, LSTMs, GRUs) before Transformers.

They suffer from a sequential bottleneck. Because recurrent states must be computed step-by-step along the temporal dimension ($t$), the operations cannot be parallelized within a training example. This significantly limits batching efficiency across longer sequences.

New cards

What are the components, layer count, sub-layers, and normalization mechanics of the Transformer Encoder Stack?

Structure: Consists of a stack of $N = 6$ identical layers.
Sub-layers: Each layer contains two sub-layers: a Multi-Head Self-Attention mechanism and a Position-wise Feed-Forward Network.
Normalization: Every sub-layer uses a residual connection followed by Layer Normalization:
$\text{LayerNorm}(x + \text{Sublayer}(x))$
Dimensions: All sub-layers and embedding layers yield a fixed vector dimensionality of $d_{\text{model}} = 512$ .

New cards

What extra sub-layers exist in the Transformer Decoder Stack compared to the Encoder, and what are their functions?

The Decoder also uses $N = 6$ identical layers but introduces two specialized configurations:

Masked Multi-Head Self-Attention: Modifies standard self-attention so that the prediction at position $i$ can only depend on known outputs at positions less than $i$ . This prevents the model from looking ahead at future tokens during training.
Encoder-Decoder Attention: Performs multi-head attention over the final output stack of the encoder, linking the decoder's queries to the encoder's keys and values.

New cards

Write the exact equation for Scaled Dot-Product Attention and explain the mathematical risk of removing the scale factor $\sqrt{d_k}$ .

Formula:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Scale Factor Function: For large values of $d_k$ , dot products grow excessively large in magnitude. This pushes the softmax function into regions with extremely flat, small gradients. Dividing by $\sqrt{d_k}$ counteracts this, mitigating vanishing gradients.

New cards

Why is Dot-Product attention preferred over Additive attention in the Transformer architecture?

While structurally similar in expressive capability, dot-product attention can be implemented using highly optimized matrix multiplication algorithms. This makes it significantly faster and more space-efficient in practice.

New cards

Write out the complete Multi-Head Attention composition equations, including the learned projection matrix dimensions.

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O$

$\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Projection Matrix Dimensions:
- $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$
- $W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}$

New cards

State the exact Multi-Head Attention hyperparameters utilized in the baseline paper.

Parallel attention heads: $h = 8$
Model dimension: $d_{\text{model}} = 512$
Internal head dimensions: $d_k = d_v = d_{\text{model}} / h = 512 / 8 = 64$ .
This allows the network to jointly process features from different representation subspaces concurrently.

New cards

Map the precise origins of the Queries ($Q$), Keys ($K$), and Values ($V$) vectors across the three distinct attention configurations in the Transformer.

1. Encoder Self-Attention: $Q, K, V$ all originate from the output of the previous layer in the encoder. Every position can attend to all past/future positions in that layer.

2. Encoder-Decoder Attention: Queries ( $Q$ ) come from the previous decoder layer; Keys ( $K$ ) and Values ( $V$ ) come from the final output stack of the encoder.

3. Decoder Self-Attention: $Q, K, V$ come from the previous decoder layer, but keys/values for future token positions are explicitly masked out by setting their softmax inputs to $-\infty$ .

New cards

Write the calculation equation for the Position-Wise Feed-Forward Network (FFN) and state its layer dimensions.

Formula:

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

It consists of two linear transformations with a ReLU activation in between.
Dimensions: Input/output dimensions match $d_{\text{model}} = 512$ , while the inner hidden layer scales out to $d_{ff} = 2048$ .

New cards

How are token embedding weights handled during initialization and processing in the Transformer paper?

Weight matrices are fully shared between both the input and output embedding layers, as well as the final pre-softmax linear transformation. Additionally, before combining them with positional encodings, the embedding weights are explicitly multiplied by $\sqrt{d_{\text{model}}}$ .

New cards

Write the sine and cosine formulas for Positional Encodings. Why did the authors choose this specific geometric function?

Formulas:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

Rationale: The authors hypothesized that this geometric formula allows the model to easily learn to track relative positions, since for any fixed offset $k$ , $PE_{pos+k}$ can be computed as a direct linear function of $PE_{pos}$ .

New cards

Compare Self-Attention and Recurrent layers regarding Complexity per Layer and Sequential Operations.

Complexity per Layer:
Self-Attention: $O(N^2 \cdot d)$ (where $N$ is sequence length and $d$ is dimension).
Recurrent: $O(N \cdot d^2)$
Sequential Operations:
- Self-Attention: $O(1)$ (Allows massive parallelization).
- Recurrent: $O(N)$ (Forces step-by-step dependency processing).

New cards

Compare Self-Attention, Recurrent, and Convolutional layers regarding Maximum Path Length for long-range dependencies.

Self-Attention: $O(1)$ (Direct constant step-size between any two tokens in the sequence).
Recurrent: $O(N)$ (Signals must traverse the entire length of the recurrent chain).
Convolutional: $O(\log_K(N))$ (Traversed hierarchically through layers based on kernel width $K$ ).

New cards

Write out the exact mathematical Learning Rate Schedule implemented with the Adam optimizer.

Formula:

$\text{lrate} = d_{\text{model}}^{-0.5} \cdot \min\left(\text{step\_num}^{-0.5}, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5}\right)$

The schedule increases the learning rate linearly for the first $\text{warmup\_steps} = 4000$ , then drops it proportionally to the inverse square root of the training step number.

New cards

Detail the two core Regularization techniques used to stabilize Transformer training.

1. Residual Dropout ( $P_{drop} = 0.1$ ): Applied directly to the output of each sub-layer before addition and normalization, as well as to the combined sums of embeddings and positional encodings.

2. Label Smoothing ( $\epsilon_{ls} = 0.1$ ): Intentionally pushes the model to be less certain of its categorical predictions. While this hurts training perplexity, it consistently improves overall evaluation accuracy and final BLEU scores.

New cards

What architectural adjustments to Heads and Attention Dimensions were empirically discovered in the ablation studies?

ltering Heads: Setting $h=8$ was found to be optimal. Changing the head count while keeping $d_k$ constant causes performance degradation if it is too high or too low.
Altering Attention Dimensions: Reducing $d_k$ directly hurts translation quality, confirming that a sophisticated, high-dimensional key-value projection space is critical for soft alignment.

New cards

What is the key structural characteristic of token processing in an autoregressive Transformer Decoder layer?

very token is processed through its own distinct structural column. For a sequence length of $N$ tokens, the architecture maps an entire window of input vectors $(x_1, \dots, x_n)$ simultaneously to an equivalent window of output vectors $(h_1, \dots, h_n)$ of the exact same length.

New cards

Explain how the Language Modeling Head (Unembedding Matrix $U$ ) converts the final block output into a next-token prediction.

he output embedding from the very final transformer block column is passed through a linear unembedding matrix $U$ (which matches the shape of the vocabulary space) and a softmax function:#

$p(y_{t+1} \mid y_{1:t}) = \text{softmax}(h_t U)$

This produces a valid probability distribution over all possible next tokens.

New cards

Describe the Residual Stream Viewpoint of information processing within stacked transformer blocks.

nstead of viewing blocks as sequential transformations, the residual stream treats the network as a continuous communication highway. Processing components (Attention, FFN) read their inputs from the stream, perform their operations, and write their outputs back into it via skip connections.

At early blocks, the stream represents the current token; at the highest blocks, it shifts to represent the following token.

New cards

State the exact mathematical shapes of the learned weight matrices ( $W^{Q_c}, W^{K_c}, W^{V_c}, W^O$ ) and outputs for a single attention head $c$ given model dimension $d$ , head query/key dimension $d_k$ , and head value dimension $d_v$ .

$W^{Q_c}$ shape: $[d \times d_k]$
$W^{K_c}$ shape: $[d \times d_k]$
$W^{V_c}$ shape: $[d \times d_v]$
$\text{head}_i^c$ shape: $[1 \times d_v]$
Concatenated Heads shape: $[1 \times A \cdot d_v]$ (where $A$ is head count)
$W^O$ shape: $[A \cdot d_v \times d]$

New cards

Why is the final projection matrix $W^O$ in Multi-Head Attention typically a square matrix of shape $[d \times d]$ ?

To maintain modular consistency across blocks, the value dimension $d_v$ is typically set to $d/A$ . This causes the concatenated heads matrix shape $[1 \times A \cdot d_v]$ to simplify to $[1 \times d]$ . $W^O$ is then structured as a square matrix of shape $[d \times d]$ to project the unified heads back into the model dimension.

New cards

What is the critical distinction regarding how Layer Normalization is applied in a Transformer compared to normalization in sequence processing?

Despite its name, Layer Normalization in a transformer is applied individually to the embedding vector of a single token column at a specific time step. It does not compute statistics across the temporal multi-token sequence layer.

New cards

Write out the complete mathematical operations for Layer Normalization on a single token embedding vector $x$ of dimensionality $d$ .

1. Mean ( $\mu$ ): $\mu = \frac{1}{d} \sum_{i=1}^d x_i$

2. Standard Deviation ( $\sigma$ ): $\sigma = \sqrt{\frac{1}{d} \sum_{i=1}^d (x_i - \mu)^2}$

3. Z-Score Normalization: $\hat{x} = \frac{x - \mu}{\sigma}$

4. Gain ( $\gamma$ ) & Offset ($\beta$) scaling:

$\text{LayerNorm}(x) = \gamma \left( \frac{x - \mu}{\sigma} \right) + \beta$

New cards

Detail the step-by-step mathematical flow of a modern Pre-Norm Architecture block for token $i$ .

1. t_i^1 = \text{LayerNorm}(x_i)$$

2. t_i^2 = \text{MultiHeadAttention}(t_i^1, t_1^1, \dots, t_N^1) $ 3. $ t_i^3 = t_i^2 + x_i $ (First Residual Connection) 4. $ t_i^4 = \text{LayerNorm}(t_i^3) $ 5. $ t_i^5 = \text{FFN}(t_i^4)$$

6. $h_i = t_i^5 + t_i^3$ (Second Residual Connection)

New cards

What final architectural layer requirement is mandatory when using a Pre-Norm block stack layout?

Because Pre-Norm leaves the final block output unnormalized, an extra single standalone Layer Norm block must be executed on the final $h_i$ vector at the very top of the last block, right before passing the signal to the Language Modeling Head.

New cards

Explain how the inputs and operations of an entire token sequence of length $N$ are packed to compute attention projections simultaneously.

Embeddings for $N$ input tokens are packed into a single matrix $X$ of size $[N \times d]$ , where each row represents a token embedding. Multiplying $X$ directly by the projection matrices generates the combined sequence matrices $Q$ , $K$ , and $V$ simultaneously:

$Q = XW^Q \quad [Shape: N \times d_k]$

$K = XW^K \quad [Shape: N \times d_k]$

$V = XW^V \quad [Shape: N \times d_v]$

New cards

What matrix multiplication computes all pairwise token similarity scores across a context window simultaneously, and what is its resulting tensor shape?

Multiplying the query matrix $Q$ by the transpose of the key matrix $K^T$ :

$\text{Scores} = QK^T$

This operation evaluates all pairwise token comparisons in parallel, resulting in a square matrix of shape $[N \times N]$ (where $N$ is the sequence context length).