1/20
Vocabulary-style flashcards covering key terms and concepts from the transformer and GPT lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
GPT
Generative Pretrained Transformer; a family of models that generate text by predicting the next token in a sequence, trained on massive data and capable of sequential word-by-word generation.
Transformer
A neural network architecture built around attention and feed-forward (MLP) blocks that processes input tokens to produce context-rich representations; core to the modern AI boom.
Token
A piece of input (word, subword, punctuation, or patch) that is mapped to a vector in the model’s embedding space.
Embedding
The process of converting a token into a high-dimensional vector; the embedding space encodes semantic relationships between tokens.
Embedding matrix (W_E)
The matrix that maps each token in the vocabulary to its initial embedding vector; GPT-3 example: 50,257 tokens × 12,288 dimensions.
Embedding dimension
The dimensionality of token vectors in the embedding space; GPT-3 uses 12,288 dimensions.
Unembedding matrix (W_U)
The final projection matrix that maps a context vector to logits over the vocabulary; essentially the transpose-oriented counterpart to the embedding matrix.
Vocabulary / Tokens
The set of all possible tokens the model can recognize; GPT-3 has about 50,257 tokens.
Context size
The number of tokens the model can process in one forward pass; GPT-3 has a context size of 2048.
Attention block
A mechanism that lets token representations communicate and update each other’s meanings based on context.
Multi-layer perceptron (MLP) / feed-forward layer
A per-token, parallel processing block that applies the same transformation to all tokens, without inter-token communication.
Weights / Parameters
Learned matrices and vectors that parameterize the model; GPT-3 has about 175 billion parameters organized into multiple matrix categories.
Matrix-vector multiplication
The fundamental computation in neural networks where a weight matrix multiplies an input vector to produce an output vector.
Dot product
A measure of alignment between two vectors; positive when directions align, zero when orthogonal, negative when opposite.
Softmax
A function that converts a vector of scores (logits) into a probability distribution by exponentiating and normalizing so all values sum to 1.
Temperature
A knob in softmax that controls distribution sharpness; higher temperature yields a broader distribution, lower temperature yields a peakier one (T=0 makes it pick the max almost always).
Logits
Raw, unnormalized scores produced before applying softmax for the next-token prediction.
System prompt
Initial context that defines the role of a chatbot or AI assistant and guides its responses.
DALL-E / Midjourney
Transformer-based image-generation tools that take text descriptions and generate images.
Backpropagation
The training algorithm used to propagate errors backward through the network to update weights.
Next-token prediction
The primary objective of many language models: predict the most likely next token given the preceding context, then sample or select accordingly.