Large Language Models & Transformers — Comprehensive Study Notes

Foundational Metaphor: Finishing a Torn Script

  • Imagine a short movie script where the user’s line to an AI assistant is intact but the assistant’s reply is missing.
    • A “magical machine” that predicts the next word can be used iteratively to fill in the missing dialogue.
    • This anecdote mirrors exactly how large language models (LLMs) generate responses.

What a Large Language Model (LLM) Is

  • Mathematically, an LLM is a function that assigns a probability P(wordt+1context)P(\text{word}_{t+1}\,|\,\text{context}) to every candidate next word.
  • Rather than one deterministic choice, it outputs an entire probability distribution.
  • Size marker: modern models carry hundreds of billions of parameters (weights), e.g. 1011!+\approx 10^{11}!+.

Building a Chatbot via Prompt-Completion

  • Engineer a conversation prompt: System description + user’s latest input becomes the context.
  • The model repeatedly samples the “assistant” portion one word at a time until a stopping condition is met.
  • Sampling strategy:
    • Allowing occasional selection of less-likely words makes text feel natural and human-like.
    • Although the underlying network is deterministic, stochastic sampling means the same prompt can yield a different answer on each run.

Data Requirements and Scale

  • Training data usually comes from the open internet.
  • Reading volume comparison:
    • A human reading non-stop 24/724/7 would need 26002600 years merely to read the dataset used for GPT-3.
    • Newer models exceed that corpus by a large factor.

Pre-Training Mechanics

  • Weight initialisation: all parameters start random → model outputs gibberish.
  • Training example = entire text snippet (few words → thousands).
    • Feed all but the last word into the network.
    • Compute prediction error on the true final word.
    • Back-propagation adjusts every parameter, nudging the correct word’s probability up and others down.
  • Trillions of such updates → generalises to previously unseen text.

Computational Cost

  • Hypothetical: performing 10000000001\,000\,000\,000 (one billion) multiply-adds each second.
    • Training the largest models would still require well over 100000000100\,000\,000 years.
  • Feasible only via highly parallel hardware, chiefly GPUs.

From Pre-Training to Helpful Assistant: RLHF

  • Objective of next-word prediction ≠ objective of being helpful.
  • Reinforcement Learning with Human Feedback (RLHF) stage:
    • Human labelers flag unhelpful or harmful outputs.
    • The system is fine-tuned so future predictions align with human preferences.

Transformer Architecture (Introduced 20172017)

  • Key departure: processes the entire sequence simultaneously rather than left-to-right.
  • Embedding layer: each token gets mapped to a high-dimensional real-valued vector ei\mathbf{e}_i so gradients can flow.
  • Two core operations repeated in layers:
    1. Self-Attention
    • Every token vector queries every other, learning contextual relationships.
    • Contextual disambiguation example: the vector for “bank” shifts toward “riverbank,” not “financial bank,” when watery context words appear.
    1. Feed-Forward Neural Network (FFN)
    • Independent, position-wise MLP that expands representational capacity.
  • Stacking many attention + FFN blocks deepens the model’s expressiveness.
  • Final linear + softmax layer converts the last-token vector into a probability distribution over the vocabulary.

Emergent & Opaque Behavior

  • Researchers design the framework (layers, attention maths, etc.).
  • The specific internal representations arise entirely from data-driven weight tuning—often inscrutable.
  • Explains why pinpointing why a model produced a specific sentence is so hard.

Hardware & Parallelisation

  • GPUs thrive on wide-parallel arithmetic → critical for training.
  • Earlier (pre-transformer) models processed words sequentially, limiting parallelism; transformers removed that bottleneck.

Practical Takeaways & Real-World Relevance

  • Generated text is “uncannily fluent” and often useful across writing, coding, tutoring, etc.
  • Variability in sampling provides creativity but can introduce inconsistency.
  • Ethical oversight via RLHF remains essential to curb misinformation, bias, and unsafe content.

Further Learning Resources

  • Author’s deep-learning video series visualises attention, embeddings, and transformer internals.
  • A recorded casual talk at company TNG in Munich (available on second channel) offers an alternative, less-produced explanation.
  • Viewers can choose between structured lessons and conversational lecture according to preference.