Notes on Large Language Models and Transformers
Introduction to Large Language Models (LLMs)
- Definition: LLMs are large artificial intelligence models capable of understanding and generating human-like text.
- Caveat:
- Minimal introduction provided in the course.
- One semester is insufficient to cover all aspects of LLMs.
- Fundamentals:
- Introduced by Ashish Vaswani et al. in "Attention Is All You Need".
- Replaces recurrence and convolutions with attention mechanisms, allowing for better parallelization and efficiency in training.
- Advantages of Transformer architecture:
- Higher quality in machine translation tasks.
- More parallelizable compared to previous models.
Building Large Language Models
- Training Data: Massive datasets needed for training.
- Tokenizers and Embeddings:
- Tokenization: Processes text into tokens.
- Embeddings: Maps tokens into a numerical vector space where semantic similarity can be represented.
- Loss Function: Critical for training, measures the difference between predicted and actual outcomes.
- Transformer Architecture: The backbone of LLMs integrating attention.
- Reinforcement Learning from Human Feedback (RLHF): Enhances learning based on human interactions.
- Reasoning Models: Integrate inferential capabilities into LLMs.
Cost and Access
- Financial Implications: Training LLMs like GPT-4 can exceed $100 million, making access largely restricted to well-funded private companies.
Foundation Models
- Definition: Models trained on extensive datasets are adaptable for various downstream tasks. Defined by Stanford's Center for Research on Foundation Models (CRFM).
Examples of LLMs
- Prominent Models:
- ChatGPT by OpenAI: First user-friendly LLM.
- Gemini by Google: Previously known as Bard.
- Llama by Meta.
- Claude by Anthropic, developed by former OpenAI employees.
Training Data Composition
- Data Collection Methods:
- Web-crawled data, including almost 3 billion pages.
- Curated datasets like
WebText, English Wikipedia, and various books.
- Challenges:
- Cleaning and preparing data, as in the case of Google’s C4.
Tokenization Process
- One-Hot Encoding: Basic method connecting words directly to neurons but fails to capture semantic relationships.
- Advanced Tokenization (Byte Pair Encoding): Combines similar tokens based on frequency.
- Importance of Tokenization: Affects all aspects of LLM performance, including spelling, simple processing, and multilingual support.
Embeddings
- Purpose: Create a meaningful space where token relationships and distances correlate to semantic similarity.
- Methods: Different techniques to represent embeddings based on co-occurrence in data.
- Mathematical Operations: Allows handling relationships between words: e.g., $D = ext{Man} - ext{Woman} + ext{Queen} = ext{King}$.
Metric Space and Vector Representation
- Questions Raised: Whether concepts fit into a metric space threatens the foundation of tokenization and embeddings.
- Key Properties:
- Symmetry: $d(x,y) = d(y,x)$ for similarity relationships.
- Triangle Inequality: $d(x,z) \leq d(x,y) + d(y,z)$ illustrates relationships in a vector space.
Training Goals of Large Language Models
- Primary Function: Next-token prediction using an autoregressive model, predicting subsequent tokens based on sequence input.
Summary of Key Learning Outcomes
- Explain the workings of tokenizers and how they are built.
- Understand embedding spaces and critique the notion of concepts existing in a metric space.
- Apply knowledge of loss functions in LLM training.