Notes on Large Language Models and Transformers

Definition: LLMs are large artificial intelligence models capable of understanding and generating human-like text.
Caveat:
- Minimal introduction provided in the course.
- One semester is insufficient to cover all aspects of LLMs.

Fundamentals:
- Introduced by Ashish Vaswani et al. in "Attention Is All You Need".
- Replaces recurrence and convolutions with attention mechanisms, allowing for better parallelization and efficiency in training.
Advantages of Transformer architecture:
- Higher quality in machine translation tasks.
- More parallelizable compared to previous models.

Training Data: Massive datasets needed for training.
Tokenizers and Embeddings:
- Tokenization: Processes text into tokens.
- Embeddings: Maps tokens into a numerical vector space where semantic similarity can be represented.
Loss Function: Critical for training, measures the difference between predicted and actual outcomes.
Transformer Architecture: The backbone of LLMs integrating attention.
Reinforcement Learning from Human Feedback (RLHF): Enhances learning based on human interactions.
Reasoning Models: Integrate inferential capabilities into LLMs.

Financial Implications: Training LLMs like GPT-4 can exceed $100 million, making access largely restricted to well-funded private companies.

Definition: Models trained on extensive datasets are adaptable for various downstream tasks. Defined by Stanford's Center for Research on Foundation Models (CRFM).

Prominent Models:
- ChatGPT by OpenAI: First user-friendly LLM.
- Gemini by Google: Previously known as Bard.
- Llama by Meta.
- Claude by Anthropic, developed by former OpenAI employees.

Data Collection Methods:
- Web-crawled data, including almost 3 billion pages.
- Curated datasets like WebText, English Wikipedia, and various books.
Challenges:
- Cleaning and preparing data, as in the case of Google’s C4.

One-Hot Encoding: Basic method connecting words directly to neurons but fails to capture semantic relationships.
Advanced Tokenization (Byte Pair Encoding): Combines similar tokens based on frequency.
Importance of Tokenization: Affects all aspects of LLM performance, including spelling, simple processing, and multilingual support.

Purpose: Create a meaningful space where token relationships and distances correlate to semantic similarity.
Methods: Different techniques to represent embeddings based on co-occurrence in data.
Mathematical Operations: Allows handling relationships between words: e.g., $D = ext{Man} - ext{Woman} + ext{Queen} = ext{King}$.

Questions Raised: Whether concepts fit into a metric space threatens the foundation of tokenization and embeddings.
Key Properties:
- Symmetry: $d(x,y) = d(y,x)$ for similarity relationships.
- Triangle Inequality: $d(x,z) \leq d(x,y) + d(y,z)$ illustrates relationships in a vector space.

Primary Function: Next-token prediction using an autoregressive model, predicting subsequent tokens based on sequence input.

Explain the workings of tokenizers and how they are built.
Understand embedding spaces and critique the notion of concepts existing in a metric space.
Apply knowledge of loss functions in LLM training.