Week 7: COM3029 & COMM061 - Scaling Up to Large Language Models Notes

ELMo: Deep Contextualised Word Representation

  • Recap of ELMo, acknowledging figures from http://jalammar.github.io/illustrated-bert/
  • Pre-training process:
    • Given input like "Let’s stick to", predict the next most likely word.
    • Trained on large datasets to pick up language patterns.
    • Example: After "hang", ELMo assigns higher probability to "out" than "camera".
  • Task-specific weighting of biLM layers:
    • stasks_{task} are softmax-normalized weights.
    • γtask\gamma_{task} is a scalar parameter allowing the task model to scale the entire ELMo vector.
    • γ\gamma aids the optimization process.

BERT: Pre-training Process

  • Recap of BERT’s Masked Language Model (MLM) and Next Sentence Prediction (NSP) training objectives combined.
  • Objective: Predict if the second sentence is connected to the first.
  • The entire input sequence goes through the Transformer model.
  • [CLS] token’s output is transformed into a 2×12 \times 1 vector using a classification layer (learned weights and biases).
  • Calculating the probability of IsNextSequence with softmax.
  • Masked LM and Next Sentence Prediction are trained together to minimize the combined loss function.

RoBERTa

  • More training data (16G vs 160G).
  • Masking: Introduces dynamic masking.
  • Experimented with the removal of NSP loss.
  • Using large mini-batches improves the perplexity of MLM objective and end-accuracy.
  • Byte-Pair Encoding is used over raw bytes instead of Unicode characters.

XLM-RoBERTa (XLM-R)

  • Extends MLM to Translation Language Modelling (TLM).
  • TLM objective extends MLM to pairs of parallel sentences.
  • Example: Predict a masked English word using both the English sentence and its French translation, aligning English and French representations.
  • Leverages the French context if the English one is insufficient to infer the masked English words.

SentenceBERT

  • Sentences passed through pooling layers result in two 768-dimensional vectors, u and v.
  • Three approaches for optimising different objectives using these vectors.
  • Suitable NLP Task: Natural Language Inference (NLI)
    • Given sentences A and B (hypothesis and premise), predict if the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise.
  • Suitable NLP Task: Semantic Textual Similarity
    • Given sentences A and B, measure their semantic (meaning-wise) similarity (normalized on a scale of 0-1).

Transfer Learning and NLU Tasks

  • Sequence Classification
    • Providing class labels to a sequence of words, typically a sentence; can be a conversation, paragraph, or document.
  • Token Classification
    • Providing token-level or phrase-level labels to a sequence of words.
  • Emotion Identification
    • Example: “I am excited about this tutorial” -> (Label?)
    • Example: “Data is the new oil” -> (Label?)
    • Considerations for multi-label vs. multi-class.
  • Abbreviation and Long-form Detection
    • Example: “ECG reports show reduced pressure”
    • Example: “Neural Networks are good at generalization but NN explainability is much needed” (Label for each token?)
    • Considerations for token/label ratio; hard with real-world data.

How to Model NLP Tasks

  • Define the problem clearly
    • What specific NLP task are you trying to solve? (e.g., sentiment analysis, question answering, machine translation, code generation, etc.)
  • Analyze the data for challenges
    • Multilingual? Does it involve multiple languages?
    • Dialectal? Does it contain variations within a language?
    • Domain-specific? Is the language from a particular field (e.g., medical, legal)?
    • Conversational? Is it informal, with slang and colloquialisms?
    • Chronological? Does the order of the data matter (e.g., time-series data)?
    • Core: What is your input/output? What machine learning algorithms are applicable?
  • Data Size and Quality
    • How much data is available? Is it labelled?
    • Is it clean and well-formatted? If no, what preprocessing techniques are applicable to your problem?
  • Select appropriate model architecture (e.g., sequence-to-sequence, classification/regression).
  • Select appropriate approaches given data (e.g., fine-tuning with labelled data / continual pre-training for unlabelled data).
  • Identify suitable metrics for both implicit (e.g., training/validation loss, perplexity) and explicit evaluation
    • Examples: AUROC, Precision, Recall, F1, Spearman/Pearson’s Correlation, BLEU, ROUGE, BERTScore, TER, etc.
    • To be explained in Lecture 9 on Evaluating model performance comprehensively.

NLP Research Community

  • Research papers on ACL Anthology – 93k+ papers!
    • Search for relevant area or topic to learn which methods are applied.
    • Search for method to learn which topics it is applied to.
  • Great way to learn about current topics and ongoing research and a good archive of NLP papers.
  • ACL Anthology data is also used by research.
  • Repository contains research papers, authors, paper metadata including links to codebase, and presentation video.

Research Papers

  • Most NLP papers are published as 9-page conference papers (long papers) with 3 reviewers (double blind).
  • Short papers (5 pages).
  • Findings papers.
  • Systems demo track.
  • Industry track.
  • Journal papers (up to 12 pages, or unlimited).
  • Survey Papers [e.g., IJCAI Survey track]
  • Venues: ACL, EMNLP, NAACL, EACL, AACL, COLING, LREC, AAAI, IJCAI, NeurIPS, ICLR, CVPR, ECCV, Interspeech, IWSLT, and various other conferences and workshops.

NLP Research Tracks

  • Machine Learning for NLP
  • Interpretability & Analysis of Models for NLP
  • Resources & Evaluation
  • Ethics & NLP
  • Phonology, Morphology & Word Segmentation
  • Syntax: Tagging, Chunking & Parsing
  • Semantics: Lexical
  • Semantics: Sentence-level Semantics, Textual Inference & Other areas
  • Linguistic Theories, Cognitive Modeling & Psycholinguistics
  • Information Extraction
  • Information Retrieval & Text Mining
  • Question Answering
  • Summarization
  • Machine Translation & Multilinguality
  • Speech & Multimodality
  • Discourse & Pragmatics
  • Sentiment Analysis, Stylistic Analysis, & Argument Mining
  • Dialogue & Interactive Systems
  • Language Grounding to Vision, Robotics & Beyond
  • Computational Social Science & Cultural Analytics
  • NLP Applications
  • Often Special themes are introduced.

Following NLP Research

  • arXiv papers
  • Twitter accounts
    • NLP researchers with active accounts (grad students, profs, industry folks)
    • Official conference accounts
  • “NLP Highlights” podcast
  • “NLP News” newsletter, ‘NLP People’ (https://nlppeople.com/jobs/)
  • Huggingface, Microsoft Research, Google DeepMind, Meta AI (FAIR), Amazon, AI4Bharat, eBay Research, NVIDIA Research, academic labs (StanfordNLP, MBZUAI, CFILT@IITB, SurreyNLP) and many startups (Cohere, Anthropic)

How to Begin NLP Research

  • Find a relevant problem and identify recent research
  • Find good or popular tools:
    • Identify papers, ask around, search the web
  • Trying to identify the best tool for your job:
    • Produces appropriate, sufficiently detailed output?
    • Accurate? (on the measure you care about)
    • Robust? (accurate on your data, not just theirs)
    • Fast?
    • Easy and flexible to use? Nice file formats, command line options, visualization?
    • Trainable for new data and languages? How slow is training?
    • Open-source and easy to extend?

Shared Tasks

  • Shared Tasks are competitions organized by research community which invite participation
  • Training data (optional)
  • Development data (optional)
  • Test data, for evaluating the final participating systems
  • Evaluation metric(s) (how well does the system perform)
  • Additional Training/Synthetic data (optional)
  • A prize (optional; with clear rules on what data can be used)
  • Easy to write system description paper.

Autoregressive Decoders

  • Goal of NLP research: ensure machines understand human language.
    • Analyze Human Language: Textual analytics, extraction, and retrieval to analyze the information present in human language.
    • Generate Human Language: Generation of understandable human language to interface with people.

Pre-trained Transformer Flavors

  • Encoders
    • Examples: BERT, ROBERTA, SciBERT.
    • Captures bidirectional context
  • Decoders
    • Examples: GPT-2, GPT-3, LaMDA
    • Also known as: causal or auto-regressive language model
    • Natural if the goal is generation, but cannot condition on future words
  • Encoder-Decoders
    • Examples: BART, T5, Meena
    • Conditional generation based on an encoded input

Autoregressive Decoders

  • Non-auto-regressive model
    • Inputs and outputs are different
    • Use case: Assigning labels for each word (e.g., part-of-speech tagging)
  • Causal or auto-regressive model
    • Each output is the next input in the sequence
    • Use case: Generating tokens (e.g., language modeling)

The GPT Family

  • GPT (2018), 117 million parameters.
  • GPT-2 (2019), 1.5 billion parameters.
  • GPT-3 (2020), 175 billion parameters. (NeurIPS 2020 best paper)

GPT Anatomy

  • An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input).

Masked Self-attention (Decoders)

  • Masks future tokens by interfering in the self-attention calculation, blocking information from tokens to the right of the position being calculated.
  • A normal self-attention block allows a position to ‘peek’ at tokens to its right; masked self-attention prevents that from happening.

Decoder-only Block

  • As it processes each subword, it masks the “future” words and conditions on (i.e., attends to) the previous words.

Pre-training Decoder Stack Language Models

  • Generate text by predicting the next token given a prompt.
  • Given a prompt (input) x=(x<em>1,x</em>2,,x<em>n)x = (x<em>1, x</em>2, …, x<em>n), where x</em>ix</em>i is a token at the i-th index,
  • A causal language model (autoregressive decoder) estimates the probability of the next token xn+1x_{n+1}
  • θ\theta is the parameter of the language model.
  • Models are trained using causal language modelling loss:
    • L=<em>i=1TlogP(x</em>ix<em>1,,x</em>i1;θ)L = - \sum<em>{i=1}^{T} \log P(x</em>i | x<em>1, …, x</em>{i-1}; \theta)
    • TT is the total number of tokens in a sequence.
  • During inference (generation), given a prompt x, the model generates new tokens by sampling.
  • Model Inference based on the probability distribution of the vocabulary given the preceding tokens.

How Models Decode

  • Using various sampling techniques!
  • Top-k sampling
    • Probabilities for all possible next tokens are generated and sorted in descending order.
    • Only the top-k tokens are considered for the next step.
    • A token is then randomly sampled from this reduced set based on their probabilities
    • Lower k leads to more focused and deterministic output, as the model has fewer options to choose from.
    • Higher k introduces more randomness and creativity, as a wider range of potential tokens is considered.
    • In essence, “choosing the next word from the top-k suggestions”.
  • Top-p (nucleus) sampling
    • Selects the smallest set of most probable tokens whose cumulative probability exceeds a threshold p.
    • The cumulative probability is calculated as you go down the list.
    • Once the cumulative probability reaches or exceeds p, the remaining lower-probability tokens are discarded.
    • The next token is then randomly sampled from this nucleus of tokens
    • A lower p results in more focused and coherent output, as only the most likely tokens (that together cross the probability threshold) remain.
    • A higher p allows for more diverse and surprising output, as a larger set of tokens (higher cumulative probability) becomes eligible for sampling.
    • In essence, "choosing the next word from a set of suggestions that together make up a certain level of confidence (prob.)“.
  • Combine top-k and top-p sampling:
    • Apply top-k to narrow down the candidate set, then apply top-p to this reduced set.

The First GPT (or GPT-1)

  • Transformer decoder with 12 layers.
  • Trained on BooksCorpus: over 7000 unique books (4.6GB text).
  • Showed that language modeling at scale can be an effective pretraining technique for downstream tasks like natural language inference.
  • Example:
    • [START] The man is in the doorway [SEP] The person is near the door [EXTRACT]

GPT-2

  • Layer norm moved to the input of each sub-block.
  • Vocabulary extended to 50,257 tokens and context size increased from 512 to 1024.
  • Trained on 8 million docs from the web (Common Crawl), minus Wikipedia
  • Lab Assignment: Train your own GPT-2 from scratch on the “Tiny-Shakespeare” dataset, utilizing some pre-trained GPT-2 weights.

GPT-2 Variants

  • GPT-2 Small
    • Model Dimensionality: 768
  • GPT-2 Medium
    • Model Dimensionality: 1024
  • GPT-2 Large
    • Model Dimensionality: 1280
  • GPT-2 Extra Large
    • Model Dimensionality: 1600

Emergent Abilities of Large Language Models: GPT-2 (2019)

  • GPT-2 (1.5B parameters)
  • Same architecture as GPT, just bigger (117M -> 1.5B)
  • Trained on much more data: 4GB -> 40GB of internet text data (WebText)
  • Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality)

Zero-shot Capabilities

  • One key emergent ability in GPT-2 is zero-shot: the ability to do many tasks without any examples, and no gradient updates
  • Specifying the right sequence prediction problem (e.g. question answering):
    • Passage: Tom Brady… Q: Where was Tom Brady born? A: …
  • Comparing probabilities of sequences (e.g., Winograd Schema Challenge [Levesque, 2011]):
    • The cat couldn’t fit into the hat because it was too big. Does it = the cat or the hat?
    • ≡ Is P(…because the cat was too big) >= P(…because the hat was too big)?

GPT-3

  • More layers & parameters, bigger dataset, longer training, larger embedding/hidden dimension, larger context window
  • GPT-3 (175B parameters)
  • Another increase in size (1.5B -> 175B) and data (40GB -> over 600GB)

Model Size

  • BERT-Base:
    • 12 transformer blocks, 12 attention heads, 110M parameters
  • BERT-Large:
    • 24 transformer blocks, 16 attention heads, 340M parameters
  • GPT-2:
    • Trained on 40GB of text data (8M webpages), 1.5B parameters
  • GPT-3:
    • 175B parameters
  • GPT-4:
    • Estimated 1.8 trillion parameters [estimated number, unconfirmed by OpenAI].
  • GPT-4o Mini:
    • Could be as small as 8 billion parameters.

GPT-4o Onwards

  • GPT-4 ‘o’-series of models can emulate the thinking process.
  • Enabled by reinforcement learning (RL) and adding more synthetic data to the RL pipeline (after pre-training; called post-training phase)
  • Models can simulate an inner monologue, where they articulate their reasoning process step-by-step; helps in verifying the accuracy of their conclusion

Learning – Embedded within Pre-training Objective

  • University of Surrey is located in _, United Kingdom. [Trivia]
  • I put ___ fork down on the table. [syntax]
  • The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
  • I went to the ocean to see the fish, turtles, seals, and _. [lexical semantics/topic]
  • Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___. [sentiment]
  • Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the . [some reasoning – this is harder]
  • I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic arithmetic; they may not learn the Fibonnaci sequence]

Prompting a Language Model

  • Pretraining Knowledge
    • Language models learn various linguistic patterns during pretraining, such as syntax, coreference, and semantics.
  • Zero/Few-Shot Capabilities
    • GPT-2 onwards LLMs can perform tasks with and without limited examples, by predicting sequences or comparing probabilities.
  • Prompt Engineering
    • Crafting prompts that reformulate tasks to resemble those solved during pretraining.

Emergent Few-Shot Learning

  • Specify a task by simply prepending examples of the task before your example
  • Also called in-context learning, to stress that no gradient updates are performed when learning a new task

Few-Shot Learning

  • Observed improvement in performance as an increasing number of examples are provided.

New Methods of Prompting LMs

  • Zero/few-shot prompting
  • Traditional fine-tuning
  • Lab Assignment: Get familiar with zero-shot, one-shot and few-shot scenarios by prompt engineering for Dialogue Summarization.

Limits of Prompting for Harder Tasks

  • Some tasks seem too hard for even large LMs to learn through prompting alone, especially tasks involving richer, multi-step reasoning.
  • Solution: change the prompt!

Chain-of-Thought Prompting

  • Encourages the model to reason step-by-step, improving performance on complex tasks.

Zero-Shot Chain-of-Thought Prompting

  • Adding "Let’s think step by step." to the prompt significantly improves zero-shot performance.

Summary

  • Variants to BERT are available.
  • NLP tasks can be clubbed together, depending on the role/need in the pipeline.
  • Autoregressive Decoders are Transformer variants with stacked decoders.
  • Impressive task-level generation performance seen with emergent properties, in both zero-/few-shot scenarios, including CoT.
  • Modelling for an NLP task may very well be a matter of creating/engineering a detailed prompt, but ensuring accurate outputs from LLMs still requires supervised training data.