Revision

Module Overview

NLP advancements: Transformers, Foundation Models / Large Language Models, Multimodality in AI.
Skills to build NLP models for:
- Classification/Regression :: Language Understanding.
- Machine translation :: Language Generation.
- LLM-based agents :: Building Frameworks.
Building NLP pipelines:
- Preparing training data.
- Choosing algorithms and techniques for real-world problems.
Emphasis on deep learning and transfer learning for model training/tuning.

Learning Outcomes

Describe the NLP process lifecycle and theoretical fundamentals (KC).
Demonstrate ability to build processes for solving specific NLP problems (CP).
Build appropriate NLP transformation pipelines for training computational models (KCPT).
Describe and deploy experiments, comparing techniques and selecting algorithms (KCPT).
Build experiment scripts using a programming language, and produce NLP models (CPT).
Deploy NLP models as Web service inference endpoints and test services (KCPT).
C-Cognitive/Analytical; K-Subject Knowledge; T-Transferable Skills; P-Professional/Practical skills.

Module Content

Introduction to NLP
Traditional/Linguistic approaches to NLP
From Feature Vectors to Language Embeddings
Topic Modelling & Artificial Neural Networks for NLP [RNN, LSTM/GRU, CNN]
Intro to Transformers and NLP Tasks [How to model for them? Supervised/Unsupervised]
Transformer-based Language Models [w/ Multilinguality, Low-resource Languages, Creoles]
Autoregressive decoders [Large Language Models]
NLP Pipeline [LM Fine-tuning / Instruction fine-tuning / Zero-shot / Few-shot]
Evaluation in NLP [Task-based Evaluation, LLM Evaluation for Toxicity/Bias/Truthfulness, …]
MLOps & Deployment [Hardware requirements in training scalable models, MiniBatching, GPUs]
Advanced approaches in NLP [Retrieval-augmented Generation, …]
Revision Lecture (you are here)

Assessment Strategy

Group formation (SurreyLearn): week 6
- Self-enrolling groups in an editable shared Excel sheet.
- Students are responsible for forming groups.
- No group changes after the deadline.
- Thorough discussion within peers is necessary.
Individual examination (50% marks)
- Multiple choice, multi-select, fill in the blanks, true/false, short answers.
- Refer to module slides, classroom discussions, reading list and supplementary material on SurreyLearn.
Group coursework (50% marks)
- Build a proof-of-concept NLP pipeline.
- Demonstrate technical understanding and good practice (voice over screen capture recording).
- Submit code, final report, and demo video; deadline in week 12.

Assessment - Formative feedback

Weekly lab exercises provide feedback on understanding and practice.
Labs and questions support coursework progress.
No language models for plagiarising NLP coursework.
Academic misconduct will penalise the ENTIRE component mark!
Start coursework early!

Starting with Words

Words are the data.
Average language has ~200K unique words.
Zipf’s law: $freq \times rank \approx constant$
Word frequency is inversely proportional to its rank.

Zipf's Law

According to Zipf's Law, the product of a word's frequency and its rank is approximately a constant.
The frequency of a word is inversely proportional to its rank in the frequency table.

NLP Leverages Linguistics

Linguistics: Study of language including morphology, meaning, context, social/cultural, and historical/political influences.
Understanding language structure is important for building NLP systems.
- Words.
- Morphology.
- Parts of speech.
- Syntax parsing.
- Semantics.
- Textual entailment.

Syntax Parsing

Example parse tree:
S ├── NP │ ├── Det │ │ └── The │ ├── Adj │ │ └── Quick │ ├── Adj │ │ └── Brown │ └── N │ └── fox └── [1]_ ├── V │ └── Jumps └── [2]_ ├── [3]_ │ └── Over └── [4]_ ├── Det │ └── The ├── Adj │ └── Lazy └── N └── dog

From Stems to Lemmas

Lemma: Canonical form of a set of words.
- Example: {processed, processes, processing} -> lemma: process.
Lemmatization: NLP pipeline task - breaks text into lemmas.
Stemming: Breaks text into stems.
Irregular forms: {is, are, was, were} -> lemma: be; stems: is, are, wa(s), wer(e).

Morphology

Changing "car" to "cars" is an example of inflectional morphology.
Regular expressions are not a subset of natural languages (False).

Subword Tokenization - WordPiece

Greedily creates a vocabulary of subword units based on frequency.
Starts with characters.
Merges frequent subword pairs to maximize training data likelihood.
Continues until vocabulary size or likelihood threshold is reached.
BERT, DistilBERT: "unlikely" -> ["un", "##like", "##ly"].

Tokenization

Simple character-level tokenization can lose semantic information in words/phrases.

Representation of Words

Encapsulate meaning of words.
Feature extraction: One-hot vectors.
Bag of words approach.
Words as discrete symbols.

Properties of Semantic Queries

Representing words as vectors allows mathematical calculations.
- Example: mariecurievector = we['woman'] + we['Europe'] + we['physics'] + … we['scientist'] - we['male']
Analogy questions:
- If UK → London, then France → ???
- $we[?] = we['france'] + (we['london'] - we['uk'])$
Guessing words based on surrounding words:
- "Better ___ than sorry" → safe.

word2vec skip-gram model

Two-layer network.
Hidden layer: n neurons (vector dimensions).
Input/output layers: M neurons (vocabulary size).
Output layer activation: softmax (for classification).

Topic Modelling in Practice

Calculate K topics in a large corpus of documents.
Topics as features of each document.
Topic distribution gives the "DNA" of documents.
Similarity: apply a distance metric to compare each document pair topic distribution.

Document Embeddings

Converting documents to numerical data.
Pre-trained language models: BERT, RoBERTa, ALBERT, S-BERT.
Multilingual models: XLM-R, mBERT, mBART, IndicBERT.
Ensure data language is part of the multilingual language model.

Improving RNNs

Techniques:
- Gradient clipping.
- Bidirectional RNNs & Multi-Layer RNNs (or Stacked RNNs).
- Long Short-Term Memory (LSTM) RNNs.
  - Additional "memory" for long term information, controlled by gates.
  - LSTMs became the default for most NLP tasks.
- Gated Recurrent Unit (GRU).
  - Similar to LSTM, but simpler (two gates).
  - More efficient, but possibly less accurate.

Attention

Key-value store containing input's information, looked up with a query (current context).
List of vectors (one for each token) associated with keys.
Increases "memory" decoder can refer to.
Mapping shared with the decoder at each time step.
Two common types of attention:
- Encoder-decoder attention (cross-attention; RNN Seq2Seq and Transformer).
- Self-attention (Transformer).

Self-Attention

Create Query (Q), Key (K), and Value (V) vectors from each encoder input vector (embedding).
Encoded representation of input as key-value pairs, (K,V), of dimension dk.

Create Q vector qi, K vector ki, and a V vector vi by multiplying the word embedding by three matrices learned during training.
Calculate the self-attention score for each word.
Divide scores by $\sqrt{d_k}$ to stabilise gradients.
Pass the result through a SoftMax operation.
Multiply each value vector by the SoftMax score.
Sum up the weighted value vectors.

Self-Attention Numerical Example

Given: Q = [1,0], K1 = [1,0], K2 = [0,1], V1 = [2,2], V2 = [1,2], d_k = 2, $\sqrt{2} \approx 1.4$ , $exp(0.7) \approx 2$
Step 1: Dot products of Q with each key
- Q . K1 = $(1 \times 1) + (0 \times 0) = 1$
- Q . K2 = $(1 \times 0) + (0 \times 1) = 0$
Step 2: Scale dot products by $\sqrt{d_k} = \sqrt{2} = 1.4$
- Scaled scores = [1/1.4, 0/1.4] = [0.7, 0]
Step 3: Apply Softmax
- $exp(0.7) \approx 2$ , $exp(0) \approx 1$
- Softmax denominator = 3, Attention weights = [2/3, 1/3]
Step 4: Attention output for Q:
- $(2/3 \times V1) + (1/3 \times V2) = [4/3, 4/3] + [1/3, 2/3] = [5/3, 6/3] = [5/3, 2]$

Positional Encoding (PE)

Attention layers see input as set of vectors with no sequential order.
PE accounts for word order in the input sequence (Transformer model).
PE vector is added to the embedding vector.
Embeddings represent token in d-dimensional space (similar meanings closer).
PE does not encode relative position of tokens.
After adding PE, tokens are closer based on meaning and position in d-dimensional space.
Formula:
Example: pos=0, d_model=4:
- PE(pos=0) = [sin(0/10000^(0/4)), cos(0/10000^(0/4)), sin(0/10000^(2/4)), cos(0/10000^(2/4))] = [0,1,0,1].

Transformers Architecture

illustrates the architecture of the transformer model, highlighting the encoder and decoder components, multi-head attention, and positional encoding.

Transfer Learning

Improve model performance using data/models trained on a different task.
Two steps:
- Pretraining: ML model trained for one task.
- Adaptation: Model adjusted and used in another task.
- Fine-tuning: Using same model for both tasks.

Scaling Laws

Larger models train faster.
BERT was trained with 3 epochs; GPT-3 with less than one epoch.

ELMo

Task-specific combination of intermediate layer representations in the biLM.
For each token tk, a L-layer biLM computes a set of $2L+1$ representations
$h{k,0}^{LM}$ is the token layer and $h{k,j}^{LM} = [\overrightarrow{h{k,j}^{LM}}, \overleftarrow{h{k,j}^{LM}}]$ for each biLSTM layer.
ELMo collapses all layers in R into a single vector for downstream models.
Simplest case -> ELMo selects the top layer as in TagLM and CoVe.

BERT: pre-training process

BERT’s MLM (Masked Language Model) and NSP (Next Sentence Prediction) training objectives combined
Predict if the second sentence is connected to the first.
Input sequence through the Transformer model.
Output of the [CLS] token transformed into a $\mathbf{2 \times 1}$ shaped vector using a simple classification layer (learned matrices of weights and biases).
Calculating the probability of IsNextSequence with softmax.
Masked LM and Next Sentence Prediction are trained together, minimizing combined loss function.

XLM-R

Follows the same approach as XLM.
Scaled-up version of XLM-100.
Sample streams of text from each language and train the model to predict masked tokens.
Sentence piece with a unigram language model applying subword tokenization on the raw text.
Sample batches from different languages using the same sampling distribution of $\alpha = 0.3$ .
Extend MLM to Translation Language Modelling (TLM)
- TLM extends MLM to pairs of parallel sentences.
- Predict masked English word, model attends to both English sentence and its French translation

SentenceBERT

Sentences passed through pooling layers result in two 768-dimensional vectors u and v.
Three approaches for optimizing different objectives using u and v.
NLP Task examples:
- Natural Language Inference
- Given sentences A and B, predict if B(hypothesis) is true (entailment), false (contradiction), or undetermined (neutral) given A (premise).
- Semantic Textual Similarity
- Measure semantic similarity, normalized on scale of 0 – 1.

Token Classification

Transformer + Conditional Random Field (CRF).
Fine-tune an encoder, optionally, with Conditional Random Field (CRF) for non-local information gathering.
Use the WikiNeural dataset for the multilingual Named Entity Recognition (NER) task.

How to model NLP Tasks?

Define the problem clearly.
Analyze the data.
- Multilingual?
- Dialectal?
- Domain-specific?
- Conversational?
- Chronological?
Core: Input/Output?
Data Size and Quality.
Select appropriate model architecture (e.g., sequence-to-sequence, classification/regression).
Select appropriate approaches given data (e.g., fine-tuning with labelled data / continual pre-training for unlabelled data).
Identify suitable metrics (implicit & explicit evaluation).

Example

Categorize customer feedback (online retail): positive, negative, or neutral.
Written statement and 1-5 star reviews.
System extracts key information (materials, methods, results) from scientific papers.
Papers may have structured format Abstract, Introduction, Methods, Results, and Conclusion.

Key Considerations for Extraction Systems

Discuss data sources, webscraping
Consider Manual annotation for a subset of papers may be necessary.
Consider using domain-specific models.
Discuss approaches to extract information
Discuss evaluation for the task

Modeling Tasks: Multiple Choice

Abstractive text summarisation with Transformers:
- Advantage: Ability to generate new sentences that capture the meaning of the original text
Advantages of Transformers over RNNs for sequence-to-sequence tasks:
- Better at capturing long-range dependencies
- More suitable for parallelisation

Types of Attention

Encoder self-attention
Masked decoder self-attention
Encoder-decoder self-attention

Pre-training Decoder

Language models generate text by predicting the next token given a prompt.
Given a prompt $x = (x1, x2, …, xn)$ , a causal language model estimates the probability of the next token $x{n+1}$ . $\theta$ is the parameter of the language model
Models are trained using causal language modelling loss:
During inference (generation), the model generates new tokens by sampling from the probability distribution of the vocabulary given preceding tokens.

Emergent Abilities of Large Language Models: GPT-2 (2019)

GPT-2 (1.5B parameters).
Same architecture as GPT, just bigger.
Trained on much more data: 40GB of internet text data (WebText).
Scrape links posted on Reddit w/ at least 3 upvotes.

Zero-shot Capabilities

One key emergent ability in GPT-2 is zero-shot.
Ability to do many tasks without any examples, and no gradient updates, by simply:
- Specifying the right sequence prediction problem (e.g. question answering)
- Passage: Tom Brady… Q: Where was Tom Brady born? A: …

Zero-shot Chain-of-thought prompting

Example:
- “Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?\ \ A: Let’s think step by step.”
- There are 16 balls in total. Half of the balls are golf balls. That means there are 8 golf balls. Half of the golf balls are blue. That means there are 4 blue golf balls.

Prompt engineering strategies

Clear task description.
Precise, specific, and clear description of the problem and instruct the LLM to perform as we expect.

Working with LLMs

Zero-shot / Few-shot Scenarios (learning and evaluation).
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
Instruction Fine-tuning.
- Simple and straightforward, generalize on unseen tasks.
- Collecting demonstrations for so many tasks is expensive.
- Mismatch between LM objective and human preferences.
Reinforcement Learning with Human Feedback (RLHF).
- Reward model with human feedback.
- Use these to further improve the model

LLMs

Valid approaches for continual pre-training of LLMs:
- Training on domain-specific text corpora to adapt the model to a specific field
- Updating the model with recent data to incorporate new knowledge

Statistical Metrics for LLM Evaluation

Enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup.
Each predicted token (word) are compared with the ground truth.
Mostly only Accuracy and F1 Score are used. Any classification-based metric could be used.

\begin{tabular}{|l|l|}
\hline
Metric & Equation \
\hline
Accuracy & (TP + TN) / (TP + TN + FP + FN) \
\hline
Precision & TP / (TP+FP) \
\hline
Recall & TP / (TP+FN) \
\hline
F1 Score & 2 * (precision * recall)/ (precision + recall) \
\hline
AUPRC & P / (P +N) \
\hline
\end{tabular}

Deploying and Serving a Model

Understand application requirements for deployment:
- Dimension hardware and software requirements.
- Deal with speed, costs and accuracy requirements.
Constraints related to privacy, throughput, latency.
Heavily compress large models can be a challenge when trying to keep accuracy at similar levels
Some techniques can be used to heavily compress large models after they were trained, such as: pruning, quantization, knowledge distillation
Achieved by implementing serverless architectures for inference and training, but also utilising caching where appropriate.

MLOps Tools

mlflow: manage the ML lifecycle.
seldon: deploy machine learning models at scale, on platforms such as Kubernetes.
Jenkins: enables Continuous Integration and Continuous Delivery (CI/CD).
Kubeflow: open-source container-orchestration system based on Kubernetes.
Airflow: monitor, schedule, and manage your ML workflows

RAG

Relevant in implementing effective RAG systems:
- Vector databases for storing document embeddings
- Chunking strategies for breaking down source documents
- Retrieval depth vs. computational cost trade-offs