Large Language Models & Transformers — Comprehensive Study Notes

Imagine a short movie script where the user’s line to an AI assistant is intact but the assistant’s reply is missing.
- A “magical machine” that predicts the next word can be used iteratively to fill in the missing dialogue.
- This anecdote mirrors exactly how large language models (LLMs) generate responses.

Mathematically, an LLM is a function that assigns a probability $P(\text{word}_{t+1}\,|\,\text{context})$ to every candidate next word.
Rather than one deterministic choice, it outputs an entire probability distribution.
Size marker: modern models carry hundreds of billions of parameters (weights), e.g. $\approx 10^{11}!+$ .

Engineer a conversation prompt: System description + user’s latest input becomes the context.
The model repeatedly samples the “assistant” portion one word at a time until a stopping condition is met.
Sampling strategy:
- Allowing occasional selection of less-likely words makes text feel natural and human-like.
- Although the underlying network is deterministic, stochastic sampling means the same prompt can yield a different answer on each run.

Training data usually comes from the open internet.
Reading volume comparison:
- A human reading non-stop $24/7$ would need $2600$ years merely to read the dataset used for GPT-3.
- Newer models exceed that corpus by a large factor.

Weight initialisation: all parameters start random → model outputs gibberish.
Training example = entire text snippet (few words → thousands).
- Feed all but the last word into the network.
- Compute prediction error on the true final word.
- Back-propagation adjusts every parameter, nudging the correct word’s probability up and others down.
Trillions of such updates → generalises to previously unseen text.

Hypothetical: performing $1\,000\,000\,000$ (one billion) multiply-adds each second.
- Training the largest models would still require well over $100\,000\,000$ years.
Feasible only via highly parallel hardware, chiefly GPUs.

Objective of next-word prediction ≠ objective of being helpful.
Reinforcement Learning with Human Feedback (RLHF) stage:
- Human labelers flag unhelpful or harmful outputs.
- The system is fine-tuned so future predictions align with human preferences.

Key departure: processes the entire sequence simultaneously rather than left-to-right.
Embedding layer: each token gets mapped to a high-dimensional real-valued vector $\mathbf{e}_i$ so gradients can flow.
Two core operations repeated in layers:
1. Self-Attention
- Every token vector queries every other, learning contextual relationships.
- Contextual disambiguation example: the vector for “bank” shifts toward “riverbank,” not “financial bank,” when watery context words appear.
1. Feed-Forward Neural Network (FFN)
- Independent, position-wise MLP that expands representational capacity.
Stacking many attention + FFN blocks deepens the model’s expressiveness.
Final linear + softmax layer converts the last-token vector into a probability distribution over the vocabulary.

Researchers design the framework (layers, attention maths, etc.).
The specific internal representations arise entirely from data-driven weight tuning—often inscrutable.
Explains why pinpointing why a model produced a specific sentence is so hard.

GPUs thrive on wide-parallel arithmetic → critical for training.
Earlier (pre-transformer) models processed words sequentially, limiting parallelism; transformers removed that bottleneck.

Generated text is “uncannily fluent” and often useful across writing, coding, tutoring, etc.
Variability in sampling provides creativity but can introduce inconsistency.
Ethical oversight via RLHF remains essential to curb misinformation, bias, and unsafe content.

Author’s deep-learning video series visualises attention, embeddings, and transformer internals.
A recorded casual talk at company TNG in Munich (available on second channel) offers an alternative, less-produced explanation.
Viewers can choose between structured lessons and conversational lecture according to preference.