Large Language Models & Transformers — Comprehensive Study Notes
- Imagine a short movie script where the user’s line to an AI assistant is intact but the assistant’s reply is missing.
- A “magical machine” that predicts the next word can be used iteratively to fill in the missing dialogue.
- This anecdote mirrors exactly how large language models (LLMs) generate responses.
What a Large Language Model (LLM) Is
- Mathematically, an LLM is a function that assigns a probability P(wordt+1∣context) to every candidate next word.
- Rather than one deterministic choice, it outputs an entire probability distribution.
- Size marker: modern models carry hundreds of billions of parameters (weights), e.g. ≈1011!+.
Building a Chatbot via Prompt-Completion
- Engineer a conversation prompt: System description + user’s latest input becomes the context.
- The model repeatedly samples the “assistant” portion one word at a time until a stopping condition is met.
- Sampling strategy:
- Allowing occasional selection of less-likely words makes text feel natural and human-like.
- Although the underlying network is deterministic, stochastic sampling means the same prompt can yield a different answer on each run.
Data Requirements and Scale
- Training data usually comes from the open internet.
- Reading volume comparison:
- A human reading non-stop 24/7 would need 2600 years merely to read the dataset used for GPT-3.
- Newer models exceed that corpus by a large factor.
Pre-Training Mechanics
- Weight initialisation: all parameters start random → model outputs gibberish.
- Training example = entire text snippet (few words → thousands).
- Feed all but the last word into the network.
- Compute prediction error on the true final word.
- Back-propagation adjusts every parameter, nudging the correct word’s probability up and others down.
- Trillions of such updates → generalises to previously unseen text.
Computational Cost
- Hypothetical: performing 1000000000 (one billion) multiply-adds each second.
- Training the largest models would still require well over 100000000 years.
- Feasible only via highly parallel hardware, chiefly GPUs.
From Pre-Training to Helpful Assistant: RLHF
- Objective of next-word prediction ≠ objective of being helpful.
- Reinforcement Learning with Human Feedback (RLHF) stage:
- Human labelers flag unhelpful or harmful outputs.
- The system is fine-tuned so future predictions align with human preferences.
- Key departure: processes the entire sequence simultaneously rather than left-to-right.
- Embedding layer: each token gets mapped to a high-dimensional real-valued vector ei so gradients can flow.
- Two core operations repeated in layers:
- Self-Attention
- Every token vector queries every other, learning contextual relationships.
- Contextual disambiguation example: the vector for “bank” shifts toward “riverbank,” not “financial bank,” when watery context words appear.
- Feed-Forward Neural Network (FFN)
- Independent, position-wise MLP that expands representational capacity.
- Stacking many attention + FFN blocks deepens the model’s expressiveness.
- Final linear + softmax layer converts the last-token vector into a probability distribution over the vocabulary.
Emergent & Opaque Behavior
- Researchers design the framework (layers, attention maths, etc.).
- The specific internal representations arise entirely from data-driven weight tuning—often inscrutable.
- Explains why pinpointing why a model produced a specific sentence is so hard.
Hardware & Parallelisation
- GPUs thrive on wide-parallel arithmetic → critical for training.
- Earlier (pre-transformer) models processed words sequentially, limiting parallelism; transformers removed that bottleneck.
Practical Takeaways & Real-World Relevance
- Generated text is “uncannily fluent” and often useful across writing, coding, tutoring, etc.
- Variability in sampling provides creativity but can introduce inconsistency.
- Ethical oversight via RLHF remains essential to curb misinformation, bias, and unsafe content.
Further Learning Resources
- Author’s deep-learning video series visualises attention, embeddings, and transformer internals.
- A recorded casual talk at company TNG in Munich (available on second channel) offers an alternative, less-produced explanation.
- Viewers can choose between structured lessons and conversational lecture according to preference.