L17_ Large Language Models

Training Decoder Models

Decoder-type language models are designed for next-word prediction. The training process involves:

Masking the end of sentences.
Using the subsequent token as the prediction target. This is crucial for the model to learn the probabilities of the next word given the preceding context.
Revealing the token and advancing to the next one.The model then adjusts its weights to better predict this token in similar contexts.
Continuing this process until the end of the text. This ensures the model understands the full scope of the training data.

This procedure is commonly referred to as masked attention or causal attention, where the model is restricted to only look "backwards". This approach enables training on extensive, unlabeled data through self-supervision. Self-supervision allows the model to learn from vast amounts of text without manual labels, making it scalable and efficient.

Generating Text from a Decoder Model

The final layer of these models typically resembles Dense(units=vocabulary_size, activation="softmax"). The output is a vector of softmax scores representing all possible tokens. For example, with a vocabulary size of 50,000, the model outputs a vector of 50,000 probabilities, one for each possible word.

The model operates in an autoregressive loop:

Processing a sequence and predicting the next token.
Appending the predicted token to the sequence. This creates a new, extended sequence.
Processing the extended sequence and appending the next predicted token. The model uses the updated sequence to predict the subsequent token.
Repeating this loop until an end-of-sequence token is predicted. The end-of-sequence token signals the model to stop generating text.

Example:

Given the sequence "The cat sat on the _", the model might predict:

floor: 7.72%
bed: 6.82%
couch: 5.70%
ground: 4.71%
edge: 4.66%

Continuing the sequence, given "The cat sat on the floor _", the model might predict:

,: 25.08%
and: 13.56%
.: 7.38%
of: 7.07%
with: 6.58%

Further extending to "The cat sat on the floor, _", the model's predictions could be:

and: 9.41%
looking: 3.23%
the: 1.78%
he: 1.63%
his: 1.48%

Finally, with "The cat sat on the floor, and _", the model's predictions might be:

the: 8.40%
he: 5.72%
I: 4.55%
she: 3.65%
his: 3.56%

Sampling Strategies for Next Token

Greedy Search: Always selects the token with the highest score. This method ensures that the model always picks what it deems the most likely next word.
- Makes the model deterministic, producing the same output for a given input. For example, if you start with "The sky is," the model will consistently predict the same continuation.
- Often leads to getting stuck in repetitive sequence loops. The model might repeat phrases or ideas because it always chooses the safest, most probable option.
Beam Search: Keeps track of multiple possible branches of output sequences, selecting the sequence with the highest probability. Instead of just one possible sequence, it explores several at once.
- Computationally expensive because it evaluates multiple sequences in parallel.
- Can also get stuck in loops, particularly if the beam width is not sufficiently diverse.
Sampling: Uses scores as probabilities and samples randomly. This introduces more diversity in the output.
- With large vocabularies, results can be nonsensical because the model might select rare or meaningless words.
Top-K Sampling: Samples among the K tokens with the highest scores. This limits the randomness to only the most probable words.
Adjusted Softmax Sampling: Introduces a temperature parameter T in the softmax function to adjust how tokens are sampled. The temperature affects the probability distribution, making it sharper or flatter.

Softmax with Temperature

The softmax function is modified by dividing the logits a by a temperature T:

yi = \frac{exp(ai/T)}{\sumj exp(aj/T)}

Low T: Gives the highest score to the most likely token, resulting in more determinism. When T is close to 0, the model behaves similarly to greedy search.
High T: Gives more equal scores to all tokens, leading to more randomness and creativity. A higher T can lead to more diverse and unexpected outputs.
Temperature: 1

Large Language Models (LLMs)

An LLM is defined as a language model with 10^9 (1 billion) parameters. The number of parameters is a key indicator of the model's capacity to learn complex patterns.

Training LLMs

The typical training procedure involves:

Self-Supervised Pre-training: Training a general model, also known as a foundation model, which can be subsequently fine-tuned. This stage teaches the model general language understanding.
- Cost: $1 million to $100 million. The high cost is due to the large-scale computational resources and data required.
Supervised Fine-tuning: Adapting the model to more specific uses, such as chat, question answering, chain-of-thought reasoning, or domain-specific tasks. This stage tailors the model for specific applications.
- Cost: Varies, depending on the complexity and amount of annotated data.
Continuous Fine-tuning: Fine-tuning an already fine-tuned model. This allows for iterative improvements and adaptations.
- Cost: $1 to $1000 depending on the resources and the data required.

Training Data

Data is critical for a good model. For pre-training, data is scraped from various sources, including:

Wikipedia (multi-language)
GitHub (code)
ArXiv (academic text)
Stack Overflow (Q&A)
Reddit etc. (forums)
Project Gutenberg (books)
Common Crawl

For fine-tuning, annotated data is required. This includes:

Specific datasets for question answering, which are labeled with questions and their corresponding answers.
Reinforcement learning from human feedback (RLHF), which uses human preferences to guide the model.
Exact training data are often business secrets, even for open-sourced models. Companies often keep the exact data used to train their models confidential.

Distributed Training

When the model exceeds the capacity of a single GPU (e.g., Llama 3.1 requires 3.3TB VRAM for training), distributed training becomes essential. Distributing the training workload is necessary to handle very large models.

What can be split and parallelized?

Data: Running different batches in parallel. Each GPU processes a different subset of the data.
Weights: Distributing weight matrices over separate GPUs. Large weight matrices are partitioned across multiple GPUs.
Layers: Distributing different layers over separate GPUs. Each GPU is responsible for computing the activations for a subset of layers.
Sequences: Partitioning the input data sequences. Long sequences are split and processed in parallel.

The entire model is updated after each training step. TensorFlow and Hugging Face offer advanced methods for distributed training. These frameworks provide tools to manage and optimize distributed training processes.

Supervised Fine-Tuning

A pre-trained model can only append text to an input. Creating a chatbot requires instruction training, such as:

{
  "instruction": "Translate 'Good night' into Spanish.",
  "solution": "Buenas noches"
}
{
  "instruction": "Name primary colors.",
  "solution": "Red, blue, yellow"
}

On the Hugging Face model hub, you can often find two variants of the same model (e.g., meta-llama/Llama-3.2-3B and meta-llama/Llama-3.2-3B-Instruct). The "Instruct" variants are specifically fine-tuned for following instructions.

LLM Fine-Tuning with Limited Resources

Full model fine-tuning can be problematic due to:

Hardware requirements (mainly VRAM). The full fine-tuning requires substantial memory resources.
Risk of catastrophic forgetting. The model might forget previously learned knowledge.

Useful techniques include:

Prompt Tuning: Adding a small trainable model before the LLM, which outputs learned, task-specific tokens. This allows for task-specific adaptations without modifying the LLM itself.
Low-Rank Adaptation (LoRA): Adding small trainable layers in parallel with the existing attention layers. This reduces the number of trainable parameters.

Low-Rank Adaptation (LoRA)

LoRA keeps the original weight matrix W frozen and trains new, small matrices A and B. This adds 2 \cdot R \cdot D new parameters, compared to D^2 in W. LoRA significantly reduces the computational cost by only training a fraction of the parameters.

Quantization

Quantization reduces the memory cost of running inference by reducing numerical precision in weights and activations. This makes the model smaller and faster.

Can store float32 as float16 without significant modification. This reduces memory usage with minimal impact on accuracy.
Can store float32 as int8 for use on embedded systems (requires more modification). This is useful for deploying models on devices with limited resources.

Modern quantization schemes for LLMs are more extreme, going down to anywhere between 6 to 2 bits. Extreme quantization can significantly reduce model size but may also impact accuracy.

Example: DeepSeek-R1-Q4_K_M.gguf

Q{bits per weight}{type}{variant}

Schemes:

Scheme	Compression ratio (relative to f32)	Performance

Q8_0	1:4	High quality
Q4_K_M	1:8	Medium quality
Q3_K_M	1:11	Low quality

Knowledge Distillation

Knowledge distillation involves training a small model to mimic the output of a bigger one. The output of the bigger model is used as labels to supervise the smaller