1/44
Flashcards covering key terms, concepts, models, algorithms, and techniques from Chapters 1–5 of “Hands-On Large Language Models.”
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is a Large Language Model (LLM)?
A neural network—usually Transformer-based—with many parameters that is pretrained on massive text corpora to understand and generate human language.
Name the three main families of Transformer models discussed.
Encoder-only (representation, e.g., BERT), decoder-only (generative, e.g., GPT), and encoder-decoder (seq-to-seq, e.g., T5).
Which 2017 paper introduced the Transformer architecture?
“Attention Is All You Need.”
Why did GPT-2 (2019) cause a stir?
It could generate human-like text and marked the first widely publicised generative LLM, showing the power of scaling parameters (1.5 B).
Define ‘representation model’.
An encoder-only model that focuses on producing embeddings or intermediate representations of text rather than generating it.
Define ‘generative model’.
A decoder-only (or encoder-decoder) model that autocompletes or creates new text token by token.
Bag-of-Words limitation
Ignores word order and semantics, treating documents as unordered token counts.
What are embeddings?
Dense vectors that capture semantic properties of tokens, sentences, or documents.
How does word2vec learn embeddings?
By predicting neighbouring words (skip-gram / CBOW) using neural networks with negative sampling.
Static vs contextual embeddings
Static (word2vec) assigns one vector per word; contextual (BERT) changes the vector based on sentence context.
Purpose of attention mechanism
Allows the model to focus on relevant parts of the input when computing representations or generating output.
Self-Attention
Attention applied within a single sequence, letting every token attend to all others.
Multi-Head Attention
Parallel attention heads that capture different relational patterns between tokens.
Tokenization
Process of splitting raw text into tokens that the model’s vocabulary can handle.
Subword tokenization advantage
Balances vocabulary size and ability to represent rare or unseen words by splitting them into meaningful chunks.
Byte Pair Encoding (BPE)
Popular subword algorithm that merges frequent symbol pairs iteratively.
Context window (context length)
Maximum number of tokens an LLM can process in a single pass.
Why are GPUs important for LLMs?
They accelerate matrix operations needed for training/inference; VRAM limits model size.
Difference between open and proprietary LLMs
Open models release weights/architecture (e.g., Llama 2); proprietary models stay behind an API (e.g., GPT-4).
Two-step training paradigm for LLMs
(1) Pretraining on large unlabeled text, (2) fine-tuning/alignment on task-specific or preference data.
Masked Language Modeling (MLM)
Pretraining task where the model predicts masked tokens (used in BERT).
Instruction fine-tuning
Teaching a model to follow natural-language instructions by training on (prompt, desired answer) pairs.
Reinforcement Learning from Human Feedback (RLHF)
Method that ranks model outputs and trains a reward model to align LLM behaviour with human preferences.
Primary use cases of LLMs
Text generation, translation, summarisation, classification, code assistance, semantic search, chatbots, etc.
Retrieval-Augmented Generation (RAG)
Technique that injects external documents into the prompt to supply up-to-date or domain knowledge.
Ethical concerns around LLMs
Bias, hallucination, harmful content, intellectual-property questions, transparency, regulation.
What is UMAP used for in text clustering?
Reducing high-dimensional embeddings to lower dimensions while preserving structure for clustering/visualisation.
HDBSCAN role in clustering pipeline
Density-based algorithm that groups similar documents and labels outliers without pre-setting cluster count.
c-TF-IDF in BERTopic
Class-based TF-IDF that weights terms by importance within each cluster (topic) across the corpus.
KeyBERTInspired representation
Reranks topic words by comparing candidate-word embeddings with average document embeddings per topic.
Maximal Marginal Relevance (MMR)
Diversifies topic keywords by balancing relevance and redundancy.
Prompt Engineering
Crafting and iteratively refining instructions to guide generative LLMs toward desired outputs.
Zero-shot classification via embeddings
Assigning labels by comparing document embeddings with label-description embeddings using cosine similarity—no training data needed.
SentenceTransformer library
Python package that wraps Transformer models for easy embedding generation of sentences/documents.
Flash Attention purpose
Optimised GPU kernel that speeds attention computation by reducing memory traffic.
Grouped-Query Attention (GQA)
Efficiency improvement where sets of heads share key/value projections to lower memory during inference (used in Llama 2/3).
Rotary Positional Embeddings (RoPE)
Technique that encodes positions as rotations in embedding space, enabling longer context and packed training.
Why is dimensionality reduction helpful before clustering?
Mitigates the curse of dimensionality and reduces noise, making density or distance measures more meaningful.
Difference between PCA and UMAP
PCA is linear, optimises variance; UMAP is non-linear, preserves local and global manifold structure.
Major limitation of bag-of-words for topic modeling
Cannot capture synonymy, polysemy, or word order; purely frequency-based.
How does BERTopic label topics with LLMs?
Feeds representative documents and keywords into a generative model (e.g., GPT-3.5) to output a concise topic name.
GPU-poor workaround
Use quantised smaller models, external APIs, or run inference on free Colab T4 with 16 GB VRAM.
Sparse attention motivation
Scale Transformers to longer sequences by limiting each token’s attention scope, reducing quadratic cost.
Word vs character vs byte tokens—impact on context
Smaller tokens (chars/bytes) allow OOV handling but inflate sequence length, reducing effective context window.
Why are open-source frameworks like Hugging Face important?
Provide model zoo, tokenizers, training/inference utilities, fostering reproducibility and experimentation.