1/33
Flashcards covering key concepts from Lecture 21, including decoding strategies, training techniques, fine-tuning methods, and model architectures.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Greedy Decoding
Decoding method that selects the most probable word at each step.
Alternatives to Greedy Decoding
Alternatives to greedy decoding, including beam search and sampling.
Autoregressive Generation
Generating text one word at a time, conditioned on the previous words.
Random Sampling
Sampling words from a distribution, balancing quality and diversity.
Top-k Sampling
Truncating the distribution to the top k words, renormalizing, and sampling.
Nucleus (Top-p) Sampling
Sampling from the top p percent of the distribution.
Temperature Sampling
Adjusting the softmax distribution to control randomness; lower T is more deterministic, higher T is more random.
Hybrid Decoding
Combining greedy, top-k, top-p, and temperature sampling dynamically.
Entropy-Aware Sampling
Adjusting temperature dynamically based on token entropy (uncertainty).
Speculative Decoding
Using a smaller model to generate tokens quickly, then verifying with a larger model.
Contrastive Decoding / Guidance
Combining outputs from a main model and a guidance model, accepting tokens only if both agree.
Self-Supervised Training
Training a model to predict the next word in a text corpus.
Teacher Forcing
Providing the model with the correct history sequence to predict the next word.
Parallel Training
Training and running in parallel, utilizing the full context window.
Sequence Length Limitation
Limiting input length due to positional embeddings trained for a specific length; solutions include truncation and sliding window.
Training Data Sources
Using web data, Wikipedia, books, and curated corpora like The Pile for training.
Training Data Considerations
Addressing privacy, toxicity, copyright, and consent issues in training data.
Fine-tuning
Adapting a pre-trained model to a specific domain, dataset, or task.
Continued Pretraining
Retraining all model parameters on new data using the same pretraining method.
PEFT - Parameter-Efficient Fine-Tuning
Freezing some (deeper) parameters to improve the parameter efficiency of fine-tuning.
Fine-tuning with extra head
Adding an extra head to the model for specific taks.
SFT - Supervised Fine-Tuning
Fine-tuning supervised finetuning
Low-Rank Adaptation (LORA)
Freezing pretrained weights and training a pair of matrices instead.
Scaling Laws
Relationships between loss, model size, dataset size, and compute budget.
KV Cache
Storing key-value pairs from the attention computation to avoid recomputation during inference.
RLHF on specific tasks
Using question answering and following instructions, etc.
Masked Language Model (MLM) vs. Autoregressive (Causal) Language Model
Using MLM vs. Autoregressive (Causal) Language Model
Causal vs Bidirectional self-attention layer
Causal vs Bidirectional self-attention layer.
Masking Words
Selected input tokens are masked during training.
Next Sentence Prediction
Predict whether an actual pair of adjacent sentences
Contextual Embeddings
The output of a BERT-style model is a contextual embedding vector
Sequence Classification
The output vector for the [CLS] token serves as input to a simple classifier.
Sequence Labelling
The output vector for each input token is passed to a simple k-way classifier
Different enconder/decoder stacks
There are encoder-decoder, encoder-only, decoder-only models