Transformers to ChatGPT

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/33

flashcard set

Earn XP

Description and Tags

Flashcards covering key concepts from Lecture 21, including decoding strategies, training techniques, fine-tuning methods, and model architectures.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

34 Terms

1
New cards

Greedy Decoding

Decoding method that selects the most probable word at each step.

2
New cards

Alternatives to Greedy Decoding

Alternatives to greedy decoding, including beam search and sampling.

3
New cards

Autoregressive Generation

Generating text one word at a time, conditioned on the previous words.

4
New cards

Random Sampling

Sampling words from a distribution, balancing quality and diversity.

5
New cards

Top-k Sampling

Truncating the distribution to the top k words, renormalizing, and sampling.

6
New cards

Nucleus (Top-p) Sampling

Sampling from the top p percent of the distribution.

7
New cards

Temperature Sampling

Adjusting the softmax distribution to control randomness; lower T is more deterministic, higher T is more random.

8
New cards

Hybrid Decoding

Combining greedy, top-k, top-p, and temperature sampling dynamically.

9
New cards

Entropy-Aware Sampling

Adjusting temperature dynamically based on token entropy (uncertainty).

10
New cards

Speculative Decoding

Using a smaller model to generate tokens quickly, then verifying with a larger model.

11
New cards

Contrastive Decoding / Guidance

Combining outputs from a main model and a guidance model, accepting tokens only if both agree.

12
New cards

Self-Supervised Training

Training a model to predict the next word in a text corpus.

13
New cards

Teacher Forcing

Providing the model with the correct history sequence to predict the next word.

14
New cards

Parallel Training

Training and running in parallel, utilizing the full context window.

15
New cards

Sequence Length Limitation

Limiting input length due to positional embeddings trained for a specific length; solutions include truncation and sliding window.

16
New cards

Training Data Sources

Using web data, Wikipedia, books, and curated corpora like The Pile for training.

17
New cards

Training Data Considerations

Addressing privacy, toxicity, copyright, and consent issues in training data.

18
New cards

Fine-tuning

Adapting a pre-trained model to a specific domain, dataset, or task.

19
New cards

Continued Pretraining

Retraining all model parameters on new data using the same pretraining method.

20
New cards

PEFT - Parameter-Efficient Fine-Tuning

Freezing some (deeper) parameters to improve the parameter efficiency of fine-tuning.

21
New cards

Fine-tuning with extra head

Adding an extra head to the model for specific taks.

22
New cards

SFT - Supervised Fine-Tuning

Fine-tuning supervised finetuning

23
New cards

Low-Rank Adaptation (LORA)

Freezing pretrained weights and training a pair of matrices instead.

24
New cards

Scaling Laws

Relationships between loss, model size, dataset size, and compute budget.

25
New cards

KV Cache

Storing key-value pairs from the attention computation to avoid recomputation during inference.

26
New cards

RLHF on specific tasks

Using question answering and following instructions, etc.

27
New cards

Masked Language Model (MLM) vs. Autoregressive (Causal) Language Model

Using MLM vs. Autoregressive (Causal) Language Model

28
New cards

Causal vs Bidirectional self-attention layer

Causal vs Bidirectional self-attention layer.

29
New cards

Masking Words

Selected input tokens are masked during training.

30
New cards

Next Sentence Prediction

Predict whether an actual pair of adjacent sentences

31
New cards

Contextual Embeddings

The output of a BERT-style model is a contextual embedding vector

32
New cards

Sequence Classification

The output vector for the [CLS] token serves as input to a simple classifier.

33
New cards

Sequence Labelling

The output vector for each input token is passed to a simple k-way classifier

34
New cards

Different enconder/decoder stacks

There are encoder-decoder, encoder-only, decoder-only models