Notes on Large Language Models and Transformers

Introduction to Large Language Models (LLMs)

  • Definition: LLMs are large artificial intelligence models capable of understanding and generating human-like text.
  • Caveat:
    • Minimal introduction provided in the course.
    • One semester is insufficient to cover all aspects of LLMs.

Transformer Architecture

  • Fundamentals:
    • Introduced by Ashish Vaswani et al. in "Attention Is All You Need".
    • Replaces recurrence and convolutions with attention mechanisms, allowing for better parallelization and efficiency in training.
  • Advantages of Transformer architecture:
    • Higher quality in machine translation tasks.
    • More parallelizable compared to previous models.

Building Large Language Models

  1. Training Data: Massive datasets needed for training.
  2. Tokenizers and Embeddings:
    • Tokenization: Processes text into tokens.
    • Embeddings: Maps tokens into a numerical vector space where semantic similarity can be represented.
  3. Loss Function: Critical for training, measures the difference between predicted and actual outcomes.
  4. Transformer Architecture: The backbone of LLMs integrating attention.
  5. Reinforcement Learning from Human Feedback (RLHF): Enhances learning based on human interactions.
  6. Reasoning Models: Integrate inferential capabilities into LLMs.

Cost and Access

  • Financial Implications: Training LLMs like GPT-4 can exceed $100 million, making access largely restricted to well-funded private companies.

Foundation Models

  • Definition: Models trained on extensive datasets are adaptable for various downstream tasks. Defined by Stanford's Center for Research on Foundation Models (CRFM).

Examples of LLMs

  • Prominent Models:
    • ChatGPT by OpenAI: First user-friendly LLM.
    • Gemini by Google: Previously known as Bard.
    • Llama by Meta.
    • Claude by Anthropic, developed by former OpenAI employees.

Training Data Composition

  • Data Collection Methods:
    • Web-crawled data, including almost 3 billion pages.
    • Curated datasets like WebText, English Wikipedia, and various books.
  • Challenges:
    • Cleaning and preparing data, as in the case of Google’s C4.

Tokenization Process

  • One-Hot Encoding: Basic method connecting words directly to neurons but fails to capture semantic relationships.
  • Advanced Tokenization (Byte Pair Encoding): Combines similar tokens based on frequency.
  • Importance of Tokenization: Affects all aspects of LLM performance, including spelling, simple processing, and multilingual support.

Embeddings

  • Purpose: Create a meaningful space where token relationships and distances correlate to semantic similarity.
  • Methods: Different techniques to represent embeddings based on co-occurrence in data.
  • Mathematical Operations: Allows handling relationships between words: e.g., $D = ext{Man} - ext{Woman} + ext{Queen} = ext{King}$.

Metric Space and Vector Representation

  • Questions Raised: Whether concepts fit into a metric space threatens the foundation of tokenization and embeddings.
  • Key Properties:
    • Symmetry: $d(x,y) = d(y,x)$ for similarity relationships.
    • Triangle Inequality: $d(x,z) \leq d(x,y) + d(y,z)$ illustrates relationships in a vector space.

Training Goals of Large Language Models

  • Primary Function: Next-token prediction using an autoregressive model, predicting subsequent tokens based on sequence input.

Summary of Key Learning Outcomes

  • Explain the workings of tokenizers and how they are built.
  • Understand embedding spaces and critique the notion of concepts existing in a metric space.
  • Apply knowledge of loss functions in LLM training.