L17_ Large Language Models

0.0(0)
studied byStudied by 1 person
0.0(0)
linked notesView linked note
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/20

flashcard set

Earn XP

Description and Tags

Flashcards covering key concepts from the lecture on large language models, including training methods, text generation, sampling strategies, fine-tuning, and optimization techniques.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

21 Terms

1
New cards

What is the function of decoder-type language models?

Next word prediction. Analogy: Like predicting the next word in a sentence based on what you've already read. Technically, it involves processing a sequence of tokens and predicting the most probable next token, autoregressively generating text one token at a time.

2
New cards

What training procedure is commonly used on decoder models?

Masked or causal attention, where the model can only look backward, trained on large, unlabeled data in a self-supervised approach. Analogy: Learning to fill in the blanks in a story, but you can only see what came before. Technically, this involves training the model to predict the next token in a sequence, given the preceding tokens, using techniques like masked language modeling.

3
New cards

What does the last layer of a decoder model look like for text generation?

Dense(units=vocabulary_size, activation="softmax"). Analogy: A final decision-making layer that chooses from a list of possible words. Technically, this layer outputs a probability distribution over the entire vocabulary, indicating the likelihood of each word being the next token, using a softmax activation function to normalize probabilities.

4
New cards

How does a decoder model generate text in an autoregressive loop?

Process sequence and predict next token, append predicted token to the sequence, process extended sequence, and repeat until end-of-sequence token. Analogy: Writing a story one word at a time, using the previous words to decide what comes next. Technically, the model generates text by iteratively predicting the next token, adding it to the sequence, and using the extended sequence as input for the next prediction until an end-of-sequence token is generated.

5
New cards

What is a drawback of using a greedy search for token sampling?

Can get stuck in sequence loops because it always selects the token with the highest score, making the model deterministic. Analogy: Always choosing the shiniest object, even if it leads you in circles. Technically, the model becomes overly deterministic because it lacks exploration of alternative tokens, potentially leading to repetitive or nonsensical sequences.

6
New cards

What is a drawback of using Beam Search for token sampling?

It is computationally expensive and can still get stuck in loops, although it keeps track of several possible branches of output sequences. Analogy: Exploring multiple paths in a maze, but still potentially getting stuck. Technically, while it explores multiple possible sequences, it does not guarantee avoiding loops and requires significant computational resources due to maintaining multiple candidate sequences.

7
New cards

What problems can arise from using simple random sampling when generating text?

Results can be nonsensical with large vocabularies. Analogy: Like randomly picking words from a dictionary to create a sentence. Technically, the model lacks coherence because it doesn't consider the context or relationships between tokens, leading to grammatically incorrect and semantically incoherent text.

8
New cards

What is Top-K sampling?

Sampling among the K tokens with the highest score. Analogy: Choosing from the top K most likely words. Technically, this method restricts the selection to the K tokens with the highest probabilities, reducing the chances of selecting irrelevant tokens and improving coherence.

9
New cards

How does adjusted softmax sampling work?

It adds a temperature parameter (T) to the softmax function, affecting how tokens are sampled. Lower T leads to more determinism, higher T to more randomness/creativity. Analogy: Adjusting the creativity level; a lower setting makes it stick to safer options, while a higher setting encourages more experimental ones. Technically, the temperature parameter modifies the probability distribution, influencing the likelihood of different tokens being selected and controlling the trade-off between exploitation and exploration.

10
New cards

What is an LLM?

A language model with 10^9 (1 billion) or more parameters. Analogy: A massive brain with billions of connections. Technically, the model's capacity to learn complex relationships increases with the number of parameters, allowing it to capture intricate patterns in the data.

11
New cards

What are the typical stages in training LLMs?

  1. Self-supervised pre-training. 2. Supervised fine-tuning. 3. Continuous fine-tuning.

Analogy: Learning general knowledge, then specializing, then keeping up with new information. Technically, these stages involve training on unlabeled data, adapting to specific tasks with labeled data, and continuously updating the model to maintain its relevance and performance.

12
New cards

What kind of data is used for pre-training LLMs?

Data scraped from Wikipedia, GitHub, ArXiv, Stack Overflow, Reddit, Project Gutenberg, etc. Analogy: Gathering knowledge from a vast library of books and articles. Technically, this data provides a diverse range of topics and writing styles for the model to learn from, enabling it to generate coherent and contextually relevant text.

13
New cards

What kind of data is used for fine-tuning LLMs?

Annotated data, specific datasets for question answering, and reinforcement learning from human feedback (RLHF). Analogy: Receiving targeted instruction and feedback. Technically, this data is used to align the model's behavior with specific tasks and human preferences, improving its performance on downstream applications.

14
New cards

What are some methods for parallelizing the training of LLMs?

Splitting data, weights, layers, and sequences across multiple GPUs. Analogy: Dividing tasks among multiple workers to speed up completion. Technically, this leverages the parallel processing capabilities of multiple GPUs to train the model more efficiently, reducing training time and enabling the use of larger datasets.

15
New cards

What is the purpose of supervised fine-tuning?

To adapt a pre-trained model to more specific uses, such as chatbots or question answering. Analogy: Tailoring a suit to fit perfectly. Technically, this involves training the model on specific datasets to optimize its performance on particular tasks, improving its accuracy and relevance.

16
New cards

Why is full model fine-tuning sometimes problematic?

Hardware requirements (VRAM) and risk of catastrophic forgetting. Analogy: Overloading the brain leading to loss of old memories. Technically, fine-tuning the entire model requires substantial computational resources and can cause the model to forget previously learned information, reducing its generalization ability.

17
New cards

What are some techniques for LLM fine-tuning with limited resources?

Prompt tuning and Low-Rank Adaptation (LoRA). Analogy: Learning tricks to improve performance without changing the core. Technically, these techniques allow the model to adapt to new tasks without requiring extensive retraining, reducing computational costs and preserving pre-trained knowledge.

18
New cards

How does Low-Rank Adaptation (LoRA) work?

It trains new, small matrices in parallel with the existing attention layers while keeping the original weight matrix frozen. Analogy: Adding a small module to learn new skills without altering the main system. Technically, this reduces the number of trainable parameters and mitigates the risk of catastrophic forgetting while still enabling the model to adapt to new tasks effectively.

19
New cards

What is quantization in the context of LLMs?

Reducing the memory cost of running inference by reducing numerical precision in weights and activations. Analogy: Compressing a file to save space. Technically, this involves representing numerical values with fewer bits, reducing the memory footprint of the model and enabling deployment on resource-constrained devices.

20
New cards

How is quantization used to reduce memory usage?

By storing float32 as float16, int8, or even lower bit representations. Analogy: Using smaller containers to store values. Technically, this reduces the memory required to store the weights and activations, allowing the model to run on devices with limited memory and improving inference speed.

21
New cards

What is knowledge distillation?

Training a small model to mimic the output of a bigger one, using the bigger model's output as labels. Analogy: A student learning from a teacher by imitating their answers. Technically, the smaller model learns to approximate the behavior of the larger model, reducing computational costs and enabling deployment in resource-constrained environments.