1/47
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is the primary limitation of static/shallow word embeddings (e.g., Word2Vec, GloVe)? Give an example.
They are entirely context-free, assigning the exact same static vector representation to a word regardless of its surrounding context. For example, the word "bank" receives the same mathematical vector in both "bank account" (finance) and "bank of a river" (geography).
Explain the core structural architectural differences between Recurrence, Shallow Embeddings, and 1D CNNs for sequence processing.
Recurrence: Tracks temporal dependencies step-by-step over time using a dynamic, hidden context layer.
Shallow Embedding: Applies a static vector transformation matrix (We) to project discrete vocabulary tokens into dense continuous vectors.
1D CNN: Slides a fixed-width window across time slices, applying 1D convolution operations and temporal pooling (e.g., Max-pooling over time) to extract local context features.
Describe the multi-channel architectural design of TextCNN (Kim, 2014).
TextCNN represents words using D > 1 channels (e.g., 6D embeddings). It processes text by applying multiple parallel kernel widths simultaneously (e.g., a width-2 kernel with 4 channels alongside a width-4 kernel with 5 channels) to capture diverse n-gram features before routing them into fully connected classification layers.
What convolution variant does WaveNet employ to capture long-range dependencies, and how does it function?
WaveNet employs Atrous (dilated) convolutions. It steps through hidden network layers while exponentially increasing a dilation factor (r). This exponentially expands the model's receptive field over time without losing sequential resolution or requiring massive parameter increases.
What is the "information bottleneck" problem in standard unaligned Seq2Seq models?
Unaligned Seq2Seq architectures force an entire source sequence into a single, fixed-length context vector (c). Because history decays logarithmically during sequential processing, the network suffers from a heavy information bottleneck that causes a steep drop in performance on longer sequences.
How did Bahdanau et al. (2015) solve the sequence bottleneck, and what was its performance impact?
They introduced Joint Alignment and Translation, which replaces the static context vector with a dynamic soft search mechanism. The model dynamically locates relevant parts of the source sentence while predicting each target word.
Unlike traditional architectures (RNNenc) whose translation quality (BLEU score) collapses on sentences longer than 30 words, this joint alignment method (RNNsearch) maintains stable, high BLEU scores even as sentence lengths approach 60 words.
Define the roles of the three main vectors used in an Attention Mechanism: Keys (k), Query (q), and Values (v).
Keys (k): The input content features that describe what information is available.
Query (q): The current target state or tracking token that dictates what specific features to search for next.
Values (v): The actual content vectors aggregated to produce the final weighted output representation.
Write out the general mathematical equation for an Attention mechanism step, including the calculation of the attention weight αi.
Attn(q,(k1:m,v1:m))=i=1∑mαi(q,k1:m)vi∈Rv
Where weights are calculated via a softmax function over an alignment score function a(q,ki):
αi(q,k1:m)=∑j=1mexp(a(q,kj))exp(a(q,ki))
Match the following Attention variations to their exact Alignment Score Functions: Content-based, Additive, Location-Based, General Dot-Product, Dot-Product, Scaled Dot-Product.
Content-based (Graves 2014): score(st,hi)=cosine[st,hi]
Additive (Bahdanau 2015): score(st,hi)=va⊤tanh(Wa[st;hi])
Location-Based (Luong 2015): αt,i=softmax(Wast)
General Dot-Product (Luong 2015): score(st,hi)=st⊤Wahi
Dot-Product (Luong 2015): score(st,hi)=st⊤hi
Scaled Dot-Product (Vaswani 2017): score(st,hi)=nst⊤hi
Write out the loss function minimized during the unsupervised/self-supervised training of ELMo (Peters et al., 2018).
L(θ)=−t=1∑T[logp(xt∣x1:t−1;θe,θ→,θs)+logp(xt∣xt+1:T;θe,θ←,θsK)]
It jointly optimizes a forward and backward language model to build deep contextualized word representations from an entire sentence.
Compare Recurrence, Convolution, and Self-Attention on Context Span and Computational Efficiency.
Recurrence: Logarithmic decay of context window; computationally expensive due to its sequential step-by-step nature.
Convolution: Context window restricted strictly to a fixed kernel width; computationally efficient due to parallelization.
Self-Attention: Provides a direct connection between every output and all inputs simultaneously; computational complexity depends entirely on sequence length N (O(N2)).
Detail the precise sub-layer layout of a standard Transformer Encoder block versus a Decoder block.
Encoder Block: A Self-Attention layer followed by an Add & Norm (residual connection + layer normalization) stage → Feed Forward network with an independent Add & Norm stage.
Decoder Block: A Masked Self-Attention layer (blocking future tokens) + Add & Norm → Encoder-Decoder Attention layer (linking decoder queries to encoder keys/values) + Add & Norm → Feed Forward network + final Add & Norm stage.
What is Positional Encoding, and why is it mandatory for Transformer architectures?
Because self-attention completely dispenses with recurrence and convolutions, it processes all input tokens simultaneously and discards any inherent concept of sequence structure or word order. A positional encoding adds a unique mathematical spatial fingerprint directly to the input embeddings (e.g., using sine/cosine wave patterns across dimensions) to restore token order.
Write the exact equation for Scaled Dot-Product Attention. What is the explicit mathematical purpose of the scale factor (dk)?
Attention(Q,K,V)=softmax(dkQKT)V
The scale factor dk (the dimension of keys) stabilizes training. As dimensions grow, dot products grow large, pushing the softmax function into regions with dangerously small gradient variations. Dividing by dk prevents these vanishing gradients.
What is Multi-Head Attention, and what structural processing benefit does it provide over single attention?
It splits queries, keys, and values into multiple low-dimensional projections processed completely in parallel. This allows the model to simultaneously focus on different types of structural, grammatical, and semantic relationships across the text (e.g., mapping a verb directly to its subject in one head while tracking next-word alignment in another).
Explain the concept of the "Residual Stream" and residual connections within a deep Transformer architecture.
Residual connections pass un-transformed baseline signals directly forward across blocks (Fl(x)+x), helping to prevent vanishing gradients during backpropagation. From an information processing perspective, the network functions as a long, continuous residual stream where subsequent transformer layers act as modules that read from and write to the shared representation space.
Contrast the core structural differences and pre-training objectives of BERT versus GPT.
BERT: Utilizes Transformer Encoder blocks. It is a deeply bidirectional model pre-trained via autoencoding objectives like Masked Language Modeling (Mask LM) and Next Sentence Prediction (NSP).
GPT: Utilizes Transformer Decoder blocks. It is a strictly autoregressive, unidirectional model that processes tokens left-to-right, looping its own generation outputs back into inputs.
How does a Vision Transformer (ViT) adapt standard Transformer architecture to process two-dimensional images?
Instead of using traditional CNN filtering, ViT chops an image into fixed 16×16 patches. These patches are flattened, passed through a linear projection layer, augmented with positional encodings, and combined with an extra learnable [class] embedding before being fed into a standard Transformer Encoder.
State the three power-law relationships that describe LLM Scaling Laws (Kaplan et al., 2020).
Model performance scales predictably via power laws when performance is not bottlenecked by any of the other factors:
Parameters: L(N)≈(NNc)αN;Compute: L(C)≈(CCc)αC;Data: L(D)≈(DDc)αD
Write out the structural parameter estimation equation relating parameters (N) to transformer depth (nlayer) and width (d).
Assuming the standard design configuration (dattn=dff/4=d):
N≈2⋅d⋅nlayer(2dattn+dff)≈12⋅nlayer⋅d2
This identity helps engineers balance depth-to-width scaling boundaries (e.g., setting a baseline target of $\sim 80$ layers for a 178B parameter model).
Name the three core strategic categories of Efficient Transformer variants designed to mitigate the O(N2) computational complexity bottleneck.
1. Fixed/Factorized & Learnable Sparse Patterns: Attending only to sparse subsets of data (e.g., Longformer, Big Bird, Reformer).
2. Low Rank & Kernel Methods: Approximating the full attention matrix via low-rank mappings or random Gaussian projections (e.g., Linformer, Performer).
3. Memory & Recurrence Methods: Accessing inputs locally via localized recurrence or global context slots (e.g., Transformer-XL).
Explain the core limitation of dominant sequence models (RNNs, LSTMs, GRUs) before Transformers.
They suffer from a sequential bottleneck. Because recurrent states must be computed step-by-step along the temporal dimension ($t$), the operations cannot be parallelized within a training example. This significantly limits batching efficiency across longer sequences.
What are the components, layer count, sub-layers, and normalization mechanics of the Transformer Encoder Stack?
Structure: Consists of a stack of N=6 identical layers.
Sub-layers: Each layer contains two sub-layers: a Multi-Head Self-Attention mechanism and a Position-wise Feed-Forward Network.
Normalization: Every sub-layer uses a residual connection followed by Layer Normalization:
LayerNorm(x+Sublayer(x))
Dimensions: All sub-layers and embedding layers yield a fixed vector dimensionality of dmodel=512.
What extra sub-layers exist in the Transformer Decoder Stack compared to the Encoder, and what are their functions?
The Decoder also uses $N = 6$ identical layers but introduces two specialized configurations:
Masked Multi-Head Self-Attention: Modifies standard self-attention so that the prediction at position i can only depend on known outputs at positions less than i. This prevents the model from looking ahead at future tokens during training.
Encoder-Decoder Attention: Performs multi-head attention over the final output stack of the encoder, linking the decoder's queries to the encoder's keys and values.
Write the exact equation for Scaled Dot-Product Attention and explain the mathematical risk of removing the scale factor dk.
Formula:
Attention(Q,K,V)=softmax(dkQKT)V
Scale Factor Function: For large values of dk, dot products grow excessively large in magnitude. This pushes the softmax function into regions with extremely flat, small gradients. Dividing by dk counteracts this, mitigating vanishing gradients.
Why is Dot-Product attention preferred over Additive attention in the Transformer architecture?
While structurally similar in expressive capability, dot-product attention can be implemented using highly optimized matrix multiplication algorithms. This makes it significantly faster and more space-efficient in practice.
Write out the complete Multi-Head Attention composition equations, including the learned projection matrix dimensions.
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
where headi=Attention(QWiQ,KWiK,VWiV)
Projection Matrix Dimensions:
WiQ∈Rdmodel×dk
WiK∈Rdmodel×dk
WiV∈Rdmodel×dv
WO∈Rhdv×dmodel
State the exact Multi-Head Attention hyperparameters utilized in the baseline paper.
Parallel attention heads: h=8
Model dimension: dmodel=512
Internal head dimensions: dk=dv=dmodel/h=512/8=64.
This allows the network to jointly process features from different representation subspaces concurrently.
Map the precise origins of the Queries ($Q$), Keys ($K$), and Values ($V$) vectors across the three distinct attention configurations in the Transformer.
1. Encoder Self-Attention: Q,K,V all originate from the output of the previous layer in the encoder. Every position can attend to all past/future positions in that layer.
2. Encoder-Decoder Attention: Queries (Q) come from the previous decoder layer; Keys (K) and Values (V) come from the final output stack of the encoder.
3. Decoder Self-Attention: Q,K,V come from the previous decoder layer, but keys/values for future token positions are explicitly masked out by setting their softmax inputs to −∞.
Write the calculation equation for the Position-Wise Feed-Forward Network (FFN) and state its layer dimensions.
Formula:
FFN(x)=max(0,xW1+b1)W2+b2
It consists of two linear transformations with a ReLU activation in between.
Dimensions: Input/output dimensions match dmodel=512, while the inner hidden layer scales out to dff=2048.
How are token embedding weights handled during initialization and processing in the Transformer paper?
Weight matrices are fully shared between both the input and output embedding layers, as well as the final pre-softmax linear transformation. Additionally, before combining them with positional encodings, the embedding weights are explicitly multiplied by dmodel.
Write the sine and cosine formulas for Positional Encodings. Why did the authors choose this specific geometric function?
Formulas:
PE(pos,2i)=sin(100002i/dmodelpos)
PE(pos,2i+1)=cos(100002i/dmodelpos)
Rationale: The authors hypothesized that this geometric formula allows the model to easily learn to track relative positions, since for any fixed offset k, PEpos+k can be computed as a direct linear function of PEpos.
Compare Self-Attention and Recurrent layers regarding Complexity per Layer and Sequential Operations.
Complexity per Layer:
Self-Attention: O(N2⋅d) (where N is sequence length and d is dimension).
Recurrent: O(N⋅d2)
Sequential Operations:
Self-Attention: O(1) (Allows massive parallelization).
Recurrent: O(N) (Forces step-by-step dependency processing).
Compare Self-Attention, Recurrent, and Convolutional layers regarding Maximum Path Length for long-range dependencies.
Self-Attention: O(1) (Direct constant step-size between any two tokens in the sequence).
Recurrent: O(N) (Signals must traverse the entire length of the recurrent chain).
Convolutional: O(logK(N)) (Traversed hierarchically through layers based on kernel width K).
Write out the exact mathematical Learning Rate Schedule implemented with the Adam optimizer.
Formula:
lrate=dmodel−0.5⋅min(step_num−0.5,step_num⋅warmup_steps−1.5)
The schedule increases the learning rate linearly for the first warmup_steps=4000, then drops it proportionally to the inverse square root of the training step number.
Detail the two core Regularization techniques used to stabilize Transformer training.
1. Residual Dropout (Pdrop=0.1): Applied directly to the output of each sub-layer before addition and normalization, as well as to the combined sums of embeddings and positional encodings.
2. Label Smoothing (ϵls=0.1): Intentionally pushes the model to be less certain of its categorical predictions. While this hurts training perplexity, it consistently improves overall evaluation accuracy and final BLEU scores.
What architectural adjustments to Heads and Attention Dimensions were empirically discovered in the ablation studies?
ltering Heads: Setting h=8 was found to be optimal. Changing the head count while keeping dk constant causes performance degradation if it is too high or too low.
Altering Attention Dimensions: Reducing dk directly hurts translation quality, confirming that a sophisticated, high-dimensional key-value projection space is critical for soft alignment.
What is the key structural characteristic of token processing in an autoregressive Transformer Decoder layer?
very token is processed through its own distinct structural column. For a sequence length of N tokens, the architecture maps an entire window of input vectors (x1,…,xn) simultaneously to an equivalent window of output vectors (h1,…,hn) of the exact same length.
Explain how the Language Modeling Head (Unembedding Matrix U) converts the final block output into a next-token prediction.
he output embedding from the very final transformer block column is passed through a linear unembedding matrix U (which matches the shape of the vocabulary space) and a softmax function:#
p(yt+1∣y1:t)=softmax(htU)
This produces a valid probability distribution over all possible next tokens.
Describe the Residual Stream Viewpoint of information processing within stacked transformer blocks.
nstead of viewing blocks as sequential transformations, the residual stream treats the network as a continuous communication highway. Processing components (Attention, FFN) read their inputs from the stream, perform their operations, and write their outputs back into it via skip connections.
At early blocks, the stream represents the current token; at the highest blocks, it shifts to represent the following token.
State the exact mathematical shapes of the learned weight matrices (WQc,WKc,WVc,WO) and outputs for a single attention head c given model dimension d, head query/key dimension dk, and head value dimension dv.
WQc shape: [d×dk]
WKc shape: [d×dk]
WVc shape: [d×dv]
headic shape: [1×dv]
Concatenated Heads shape: [1×A⋅dv] (where A is head count)
$W^O$ shape: [A⋅dv×d]
Why is the final projection matrix WO in Multi-Head Attention typically a square matrix of shape [d×d]?
To maintain modular consistency across blocks, the value dimension dv is typically set to d/A. This causes the concatenated heads matrix shape [1×A⋅dv] to simplify to [1×d]. WO is then structured as a square matrix of shape [d×d] to project the unified heads back into the model dimension.
What is the critical distinction regarding how Layer Normalization is applied in a Transformer compared to normalization in sequence processing?
Despite its name, Layer Normalization in a transformer is applied individually to the embedding vector of a single token column at a specific time step. It does not compute statistics across the temporal multi-token sequence layer.
Write out the complete mathematical operations for Layer Normalization on a single token embedding vector x of dimensionality d.
1. Mean (μ): μ=d1∑i=1dxi
2. Standard Deviation (σ): σ=d1∑i=1d(xi−μ)2
3. Z-Score Normalization: x^=σx−μ
4. Gain (γ) & Offset ($\beta$) scaling:
LayerNorm(x)=γ(σx−μ)+β
Detail the step-by-step mathematical flow of a modern Pre-Norm Architecture block for token i.
1. t_i^1 = \text{LayerNorm}(x_i)$$
2. t_i^2 = \text{MultiHeadAttention}(t_i^1, t_1^1, \dots, t_N^1)</span></p><p>3.<span>t_i^3 = t_i^2 + x_i</span><em>(FirstResidualConnection)</em></p><p>4.<span>t_i^4 = \text{LayerNorm}(t_i^3)</span></p><p>5.<span>t_i^5 = \text{FFN}(t_i^4)$$
6. hi=ti5+ti3 (Second Residual Connection)
What final architectural layer requirement is mandatory when using a Pre-Norm block stack layout?
Because Pre-Norm leaves the final block output unnormalized, an extra single standalone Layer Norm block must be executed on the final hi vector at the very top of the last block, right before passing the signal to the Language Modeling Head.
Explain how the inputs and operations of an entire token sequence of length $N$ are packed to compute attention projections simultaneously.
Embeddings for N input tokens are packed into a single matrix X of size [N×d], where each row represents a token embedding. Multiplying X directly by the projection matrices generates the combined sequence matrices Q, K, and V simultaneously:
Q=XWQ[Shape:N×dk]
K=XWK[Shape:N×dk]
V=XWV[Shape:N×dv]
What matrix multiplication computes all pairwise token similarity scores across a context window simultaneously, and what is its resulting tensor shape?
Multiplying the query matrix Q by the transpose of the key matrix KT:
Scores=QKT
This operation evaluates all pairwise token comparisons in parallel, resulting in a square matrix of shape [N×N] (where N is the sequence context length).