Week 7: COM3029 & COMM061 - Scaling Up to Large Language Models Notes

ELMo: Deep Contextualised Word Representation

Recap of ELMo, acknowledging figures from http://jalammar.github.io/illustrated-bert/
Pre-training process:
- Given input like "Let’s stick to", predict the next most likely word.
- Trained on large datasets to pick up language patterns.
- Example: After "hang", ELMo assigns higher probability to "out" than "camera".
Task-specific weighting of biLM layers:
- $s_{task}$ are softmax-normalized weights.
- $\gamma_{task}$ is a scalar parameter allowing the task model to scale the entire ELMo vector.
- $\gamma$ aids the optimization process.

BERT: Pre-training Process

Recap of BERT’s Masked Language Model (MLM) and Next Sentence Prediction (NSP) training objectives combined.
Objective: Predict if the second sentence is connected to the first.
The entire input sequence goes through the Transformer model.
[CLS] token’s output is transformed into a $2 \times 1$ vector using a classification layer (learned weights and biases).
Calculating the probability of IsNextSequence with softmax.
Masked LM and Next Sentence Prediction are trained together to minimize the combined loss function.

RoBERTa

More training data (16G vs 160G).
Masking: Introduces dynamic masking.
Experimented with the removal of NSP loss.
Using large mini-batches improves the perplexity of MLM objective and end-accuracy.
Byte-Pair Encoding is used over raw bytes instead of Unicode characters.

XLM-RoBERTa (XLM-R)

Extends MLM to Translation Language Modelling (TLM).
TLM objective extends MLM to pairs of parallel sentences.
Example: Predict a masked English word using both the English sentence and its French translation, aligning English and French representations.
Leverages the French context if the English one is insufficient to infer the masked English words.

SentenceBERT

Sentences passed through pooling layers result in two 768-dimensional vectors, u and v.
Three approaches for optimising different objectives using these vectors.
Suitable NLP Task: Natural Language Inference (NLI)
- Given sentences A and B (hypothesis and premise), predict if the hypothesis is true (entailment), false (contradiction), or undetermined (neutral) given the premise.
Suitable NLP Task: Semantic Textual Similarity
- Given sentences A and B, measure their semantic (meaning-wise) similarity (normalized on a scale of 0-1).

Transfer Learning and NLU Tasks

Sequence Classification
- Providing class labels to a sequence of words, typically a sentence; can be a conversation, paragraph, or document.
Token Classification
- Providing token-level or phrase-level labels to a sequence of words.
Emotion Identification
- Example: “I am excited about this tutorial” -> (Label?)
- Example: “Data is the new oil” -> (Label?)
- Considerations for multi-label vs. multi-class.
Abbreviation and Long-form Detection
- Example: “ECG reports show reduced pressure”
- Example: “Neural Networks are good at generalization but NN explainability is much needed” (Label for each token?)
- Considerations for token/label ratio; hard with real-world data.

How to Model NLP Tasks

Define the problem clearly
- What specific NLP task are you trying to solve? (e.g., sentiment analysis, question answering, machine translation, code generation, etc.)
Analyze the data for challenges
- Multilingual? Does it involve multiple languages?
- Dialectal? Does it contain variations within a language?
- Domain-specific? Is the language from a particular field (e.g., medical, legal)?
- Conversational? Is it informal, with slang and colloquialisms?
- Chronological? Does the order of the data matter (e.g., time-series data)?
- Core: What is your input/output? What machine learning algorithms are applicable?
Data Size and Quality
- How much data is available? Is it labelled?
- Is it clean and well-formatted? If no, what preprocessing techniques are applicable to your problem?
Select appropriate model architecture (e.g., sequence-to-sequence, classification/regression).
Select appropriate approaches given data (e.g., fine-tuning with labelled data / continual pre-training for unlabelled data).
Identify suitable metrics for both implicit (e.g., training/validation loss, perplexity) and explicit evaluation
- Examples: AUROC, Precision, Recall, F1, Spearman/Pearson’s Correlation, BLEU, ROUGE, BERTScore, TER, etc.
- To be explained in Lecture 9 on Evaluating model performance comprehensively.

NLP Research Community

Research papers on ACL Anthology – 93k+ papers!
- Search for relevant area or topic to learn which methods are applied.
- Search for method to learn which topics it is applied to.
Great way to learn about current topics and ongoing research and a good archive of NLP papers.
ACL Anthology data is also used by research.
Repository contains research papers, authors, paper metadata including links to codebase, and presentation video.

Research Papers

Most NLP papers are published as 9-page conference papers (long papers) with 3 reviewers (double blind).
Short papers (5 pages).
Findings papers.
Systems demo track.
Industry track.
Journal papers (up to 12 pages, or unlimited).
Survey Papers [e.g., IJCAI Survey track]
Venues: ACL, EMNLP, NAACL, EACL, AACL, COLING, LREC, AAAI, IJCAI, NeurIPS, ICLR, CVPR, ECCV, Interspeech, IWSLT, and various other conferences and workshops.

NLP Research Tracks

Machine Learning for NLP
Interpretability & Analysis of Models for NLP
Resources & Evaluation
Ethics & NLP
Phonology, Morphology & Word Segmentation
Syntax: Tagging, Chunking & Parsing
Semantics: Lexical
Semantics: Sentence-level Semantics, Textual Inference & Other areas
Linguistic Theories, Cognitive Modeling & Psycholinguistics
Information Extraction
Information Retrieval & Text Mining
Question Answering
Summarization
Machine Translation & Multilinguality
Speech & Multimodality
Discourse & Pragmatics
Sentiment Analysis, Stylistic Analysis, & Argument Mining
Dialogue & Interactive Systems
Language Grounding to Vision, Robotics & Beyond
Computational Social Science & Cultural Analytics
NLP Applications
Often Special themes are introduced.

Following NLP Research

arXiv papers
Twitter accounts
- NLP researchers with active accounts (grad students, profs, industry folks)
- Official conference accounts
“NLP Highlights” podcast
“NLP News” newsletter, ‘NLP People’ (https://nlppeople.com/jobs/)
Huggingface, Microsoft Research, Google DeepMind, Meta AI (FAIR), Amazon, AI4Bharat, eBay Research, NVIDIA Research, academic labs (StanfordNLP, MBZUAI, CFILT@IITB, SurreyNLP) and many startups (Cohere, Anthropic)

How to Begin NLP Research

Find a relevant problem and identify recent research
Find good or popular tools:
- Identify papers, ask around, search the web
Trying to identify the best tool for your job:
- Produces appropriate, sufficiently detailed output?
- Accurate? (on the measure you care about)
- Robust? (accurate on your data, not just theirs)
- Fast?
- Easy and flexible to use? Nice file formats, command line options, visualization?
- Trainable for new data and languages? How slow is training?
- Open-source and easy to extend?

Shared Tasks

Shared Tasks are competitions organized by research community which invite participation
Training data (optional)
Development data (optional)
Test data, for evaluating the final participating systems
Evaluation metric(s) (how well does the system perform)
Additional Training/Synthetic data (optional)
A prize (optional; with clear rules on what data can be used)
Easy to write system description paper.

Autoregressive Decoders

Goal of NLP research: ensure machines understand human language.
- Analyze Human Language: Textual analytics, extraction, and retrieval to analyze the information present in human language.
- Generate Human Language: Generation of understandable human language to interface with people.

Pre-trained Transformer Flavors

Encoders
- Examples: BERT, ROBERTA, SciBERT.
- Captures bidirectional context
Decoders
- Examples: GPT-2, GPT-3, LaMDA
- Also known as: causal or auto-regressive language model
- Natural if the goal is generation, but cannot condition on future words
Encoder-Decoders
- Examples: BART, T5, Meena
- Conditional generation based on an encoded input

Autoregressive Decoders

Non-auto-regressive model
- Inputs and outputs are different
- Use case: Assigning labels for each word (e.g., part-of-speech tagging)
Causal or auto-regressive model
- Each output is the next input in the sequence
- Use case: Generating tokens (e.g., language modeling)

The GPT Family

GPT (2018), 117 million parameters.
GPT-2 (2019), 1.5 billion parameters.
GPT-3 (2020), 175 billion parameters. (NeurIPS 2020 best paper)

GPT Anatomy

An autoregressive model that predicts the next token given tokens so far (either predicted or given as part of input).

Masked Self-attention (Decoders)

Masks future tokens by interfering in the self-attention calculation, blocking information from tokens to the right of the position being calculated.
A normal self-attention block allows a position to ‘peek’ at tokens to its right; masked self-attention prevents that from happening.

Decoder-only Block

As it processes each subword, it masks the “future” words and conditions on (i.e., attends to) the previous words.

Pre-training Decoder Stack Language Models

Generate text by predicting the next token given a prompt.
Given a prompt (input) $x = (x1, x2, …, xn)$ , where $xi$ is a token at the i-th index,
A causal language model (autoregressive decoder) estimates the probability of the next token $x_{n+1}$
$\theta$ is the parameter of the language model.
Models are trained using causal language modelling loss:
- $L = - \sum{i=1}^{T} \log P(xi | x1, …, x{i-1}; \theta)$
- $T$ is the total number of tokens in a sequence.
During inference (generation), given a prompt x, the model generates new tokens by sampling.
Model Inference based on the probability distribution of the vocabulary given the preceding tokens.

How Models Decode

Using various sampling techniques!
Top-k sampling
- Probabilities for all possible next tokens are generated and sorted in descending order.
- Only the top-k tokens are considered for the next step.
- A token is then randomly sampled from this reduced set based on their probabilities
- Lower k leads to more focused and deterministic output, as the model has fewer options to choose from.
- Higher k introduces more randomness and creativity, as a wider range of potential tokens is considered.
- In essence, “choosing the next word from the top-k suggestions”.
Top-p (nucleus) sampling
- Selects the smallest set of most probable tokens whose cumulative probability exceeds a threshold p.
- The cumulative probability is calculated as you go down the list.
- Once the cumulative probability reaches or exceeds p, the remaining lower-probability tokens are discarded.
- The next token is then randomly sampled from this nucleus of tokens
- A lower p results in more focused and coherent output, as only the most likely tokens (that together cross the probability threshold) remain.
- A higher p allows for more diverse and surprising output, as a larger set of tokens (higher cumulative probability) becomes eligible for sampling.
- In essence, "choosing the next word from a set of suggestions that together make up a certain level of confidence (prob.)“.
Combine top-k and top-p sampling:
- Apply top-k to narrow down the candidate set, then apply top-p to this reduced set.

The First GPT (or GPT-1)

Transformer decoder with 12 layers.
Trained on BooksCorpus: over 7000 unique books (4.6GB text).
Showed that language modeling at scale can be an effective pretraining technique for downstream tasks like natural language inference.
Example:
- [START] The man is in the doorway [SEP] The person is near the door [EXTRACT]

GPT-2

Layer norm moved to the input of each sub-block.
Vocabulary extended to 50,257 tokens and context size increased from 512 to 1024.
Trained on 8 million docs from the web (Common Crawl), minus Wikipedia
Lab Assignment: Train your own GPT-2 from scratch on the “Tiny-Shakespeare” dataset, utilizing some pre-trained GPT-2 weights.

GPT-2 Variants

GPT-2 Small
- Model Dimensionality: 768
GPT-2 Medium
- Model Dimensionality: 1024
GPT-2 Large
- Model Dimensionality: 1280
GPT-2 Extra Large
- Model Dimensionality: 1600

Emergent Abilities of Large Language Models: GPT-2 (2019)

GPT-2 (1.5B parameters)
Same architecture as GPT, just bigger (117M -> 1.5B)
Trained on much more data: 4GB -> 40GB of internet text data (WebText)
Scrape links posted on Reddit w/ at least 3 upvotes (rough proxy of human quality)

Zero-shot Capabilities

One key emergent ability in GPT-2 is zero-shot: the ability to do many tasks without any examples, and no gradient updates
Specifying the right sequence prediction problem (e.g. question answering):
- Passage: Tom Brady… Q: Where was Tom Brady born? A: …
Comparing probabilities of sequences (e.g., Winograd Schema Challenge [Levesque, 2011]):
- The cat couldn’t fit into the hat because it was too big. Does it = the cat or the hat?
- ≡ Is P(…because the cat was too big) >= P(…because the hat was too big)?

GPT-3

More layers & parameters, bigger dataset, longer training, larger embedding/hidden dimension, larger context window
GPT-3 (175B parameters)
Another increase in size (1.5B -> 175B) and data (40GB -> over 600GB)

Model Size

BERT-Base:
- 12 transformer blocks, 12 attention heads, 110M parameters
BERT-Large:
- 24 transformer blocks, 16 attention heads, 340M parameters
GPT-2:
- Trained on 40GB of text data (8M webpages), 1.5B parameters
GPT-3:
- 175B parameters
GPT-4:
- Estimated 1.8 trillion parameters [estimated number, unconfirmed by OpenAI].
GPT-4o Mini:
- Could be as small as 8 billion parameters.

GPT-4o Onwards

GPT-4 ‘o’-series of models can emulate the thinking process.
Enabled by reinforcement learning (RL) and adding more synthetic data to the RL pipeline (after pre-training; called post-training phase)
Models can simulate an inner monologue, where they articulate their reasoning process step-by-step; helps in verifying the accuracy of their conclusion

Learning – Embedded within Pre-training Objective

University of Surrey is located in _, United Kingdom. [Trivia]
I put ___ fork down on the table. [syntax]
The woman walked across the street, checking for traffic over ___ shoulder. [coreference]
I went to the ocean to see the fish, turtles, seals, and _. [lexical semantics/topic]
Overall, the value I got from the two hours watching it was the sum total of the popcorn and the drink. The movie was ___. [sentiment]
Iroh went into the kitchen to make some tea. Standing next to Iroh, Zuko pondered his destiny. Zuko left the . [some reasoning – this is harder]
I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, ____ [some basic arithmetic; they may not learn the Fibonnaci sequence]

Prompting a Language Model

Pretraining Knowledge
- Language models learn various linguistic patterns during pretraining, such as syntax, coreference, and semantics.
Zero/Few-Shot Capabilities
- GPT-2 onwards LLMs can perform tasks with and without limited examples, by predicting sequences or comparing probabilities.
Prompt Engineering
- Crafting prompts that reformulate tasks to resemble those solved during pretraining.

Emergent Few-Shot Learning

Specify a task by simply prepending examples of the task before your example
Also called in-context learning, to stress that no gradient updates are performed when learning a new task

Few-Shot Learning

Observed improvement in performance as an increasing number of examples are provided.

New Methods of Prompting LMs

Zero/few-shot prompting
Traditional fine-tuning
Lab Assignment: Get familiar with zero-shot, one-shot and few-shot scenarios by prompt engineering for Dialogue Summarization.

Limits of Prompting for Harder Tasks

Some tasks seem too hard for even large LMs to learn through prompting alone, especially tasks involving richer, multi-step reasoning.
Solution: change the prompt!

Chain-of-Thought Prompting

Encourages the model to reason step-by-step, improving performance on complex tasks.

Zero-Shot Chain-of-Thought Prompting

Adding "Let’s think step by step." to the prompt significantly improves zero-shot performance.

Summary

Variants to BERT are available.
NLP tasks can be clubbed together, depending on the role/need in the pipeline.
Autoregressive Decoders are Transformer variants with stacked decoders.
Impressive task-level generation performance seen with emergent properties, in both zero-/few-shot scenarios, including CoT.
Modelling for an NLP task may very well be a matter of creating/engineering a detailed prompt, but ensuring accurate outputs from LLMs still requires supervised training data.