Revision

Module Overview

  • NLP advancements: Transformers, Foundation Models / Large Language Models, Multimodality in AI.
  • Skills to build NLP models for:
    • Classification/Regression :: Language Understanding.
    • Machine translation :: Language Generation.
    • LLM-based agents :: Building Frameworks.
  • Building NLP pipelines:
    • Preparing training data.
    • Choosing algorithms and techniques for real-world problems.
  • Emphasis on deep learning and transfer learning for model training/tuning.

Learning Outcomes

  • Describe the NLP process lifecycle and theoretical fundamentals (KC).
  • Demonstrate ability to build processes for solving specific NLP problems (CP).
  • Build appropriate NLP transformation pipelines for training computational models (KCPT).
  • Describe and deploy experiments, comparing techniques and selecting algorithms (KCPT).
  • Build experiment scripts using a programming language, and produce NLP models (CPT).
  • Deploy NLP models as Web service inference endpoints and test services (KCPT).
  • C-Cognitive/Analytical; K-Subject Knowledge; T-Transferable Skills; P-Professional/Practical skills.

Module Content

  1. Introduction to NLP
  2. Traditional/Linguistic approaches to NLP
  3. From Feature Vectors to Language Embeddings
  4. Topic Modelling & Artificial Neural Networks for NLP [RNN, LSTM/GRU, CNN]
  5. Intro to Transformers and NLP Tasks [How to model for them? Supervised/Unsupervised]
  6. Transformer-based Language Models [w/ Multilinguality, Low-resource Languages, Creoles]
  7. Autoregressive decoders [Large Language Models]
  8. NLP Pipeline [LM Fine-tuning / Instruction fine-tuning / Zero-shot / Few-shot]
  9. Evaluation in NLP [Task-based Evaluation, LLM Evaluation for Toxicity/Bias/Truthfulness, …]
  10. MLOps & Deployment [Hardware requirements in training scalable models, MiniBatching, GPUs]
  11. Advanced approaches in NLP [Retrieval-augmented Generation, …]
  12. Revision Lecture (you are here)

Assessment Strategy

  • Group formation (SurreyLearn): week 6
    • Self-enrolling groups in an editable shared Excel sheet.
    • Students are responsible for forming groups.
    • No group changes after the deadline.
    • Thorough discussion within peers is necessary.
  • Individual examination (50% marks)
    • Multiple choice, multi-select, fill in the blanks, true/false, short answers.
    • Refer to module slides, classroom discussions, reading list and supplementary material on SurreyLearn.
  • Group coursework (50% marks)
    • Build a proof-of-concept NLP pipeline.
    • Demonstrate technical understanding and good practice (voice over screen capture recording).
    • Submit code, final report, and demo video; deadline in week 12.

Assessment - Formative feedback

  • Weekly lab exercises provide feedback on understanding and practice.
  • Labs and questions support coursework progress.
  • No language models for plagiarising NLP coursework.
  • Academic misconduct will penalise the ENTIRE component mark!
  • Start coursework early!

Starting with Words

  • Words are the data.
  • Average language has ~200K unique words.
  • Zipf’s law: freq×rankconstantfreq \times rank \approx constant
  • Word frequency is inversely proportional to its rank.

Zipf's Law

  • According to Zipf's Law, the product of a word's frequency and its rank is approximately a constant.
  • The frequency of a word is inversely proportional to its rank in the frequency table.

NLP Leverages Linguistics

  • Linguistics: Study of language including morphology, meaning, context, social/cultural, and historical/political influences.
  • Understanding language structure is important for building NLP systems.
    • Words.
    • Morphology.
    • Parts of speech.
    • Syntax parsing.
    • Semantics.
    • Textual entailment.

Syntax Parsing

  • Example parse tree:
  • S ├── NP │ ├── Det │ │ └── The │ ├── Adj │ │ └── Quick │ ├── Adj │ │ └── Brown │ └── N │ └── fox └── [1]_ ├── V │ └── Jumps └── [2]_ ├── [3]_ │ └── Over └── [4]_ ├── Det │ └── The ├── Adj │ └── Lazy └── N └── dog

From Stems to Lemmas

  • Lemma: Canonical form of a set of words.
    • Example: {processed, processes, processing} -> lemma: process.
  • Lemmatization: NLP pipeline task - breaks text into lemmas.
  • Stemming: Breaks text into stems.
  • Irregular forms: {is, are, was, were} -> lemma: be; stems: is, are, wa(s), wer(e).

Morphology

  • Changing "car" to "cars" is an example of inflectional morphology.
  • Regular expressions are not a subset of natural languages (False).

Subword Tokenization - WordPiece

  • Greedily creates a vocabulary of subword units based on frequency.
  • Starts with characters.
  • Merges frequent subword pairs to maximize training data likelihood.
  • Continues until vocabulary size or likelihood threshold is reached.
  • BERT, DistilBERT: "unlikely" -> ["un", "##like", "##ly"].

Tokenization

  • Simple character-level tokenization can lose semantic information in words/phrases.

Representation of Words

  • Encapsulate meaning of words.
  • Feature extraction: One-hot vectors.
  • Bag of words approach.
  • Words as discrete symbols.

Properties of Semantic Queries

  • Representing words as vectors allows mathematical calculations.
    • Example: mariecurievector = we['woman'] + we['Europe'] + we['physics'] + … we['scientist'] - we['male']
  • Analogy questions:
    • If UK → London, then France → ???
    • we[?]=we[france]+(we[london]we[uk])we[?] = we['france'] + (we['london'] - we['uk'])
  • Guessing words based on surrounding words:
    • "Better ___ than sorry" → safe.

word2vec skip-gram model

  • Two-layer network.
  • Hidden layer: n neurons (vector dimensions).
  • Input/output layers: M neurons (vocabulary size).
  • Output layer activation: softmax (for classification).

Topic Modelling in Practice

  • Calculate K topics in a large corpus of documents.
  • Topics as features of each document.
  • Topic distribution gives the "DNA" of documents.
  • Similarity: apply a distance metric to compare each document pair topic distribution.

Document Embeddings

  • Converting documents to numerical data.
  • Pre-trained language models: BERT, RoBERTa, ALBERT, S-BERT.
  • Multilingual models: XLM-R, mBERT, mBART, IndicBERT.
  • Ensure data language is part of the multilingual language model.

Improving RNNs

  • Techniques:
    • Gradient clipping.
    • Bidirectional RNNs & Multi-Layer RNNs (or Stacked RNNs).
    • Long Short-Term Memory (LSTM) RNNs.
      • Additional "memory" for long term information, controlled by gates.
      • LSTMs became the default for most NLP tasks.
    • Gated Recurrent Unit (GRU).
      • Similar to LSTM, but simpler (two gates).
      • More efficient, but possibly less accurate.

Attention

  • Key-value store containing input's information, looked up with a query (current context).
  • List of vectors (one for each token) associated with keys.
  • Increases "memory" decoder can refer to.
  • Mapping shared with the decoder at each time step.
  • Two common types of attention:
    • Encoder-decoder attention (cross-attention; RNN Seq2Seq and Transformer).
    • Self-attention (Transformer).

Self-Attention

  • Create Query (Q), Key (K), and Value (V) vectors from each encoder input vector (embedding).
  • Encoded representation of input as key-value pairs, (K,V), of dimension dk.
  1. Create Q vector qi, K vector ki, and a V vector vi by multiplying the word embedding by three matrices learned during training.
  2. Calculate the self-attention score for each word.
  3. Divide scores by dk\sqrt{d_k} to stabilise gradients.
  4. Pass the result through a SoftMax operation.
  5. Multiply each value vector by the SoftMax score.
  6. Sum up the weighted value vectors.

Self-Attention Numerical Example

  • Given: Q = [1,0], K1 = [1,0], K2 = [0,1], V1 = [2,2], V2 = [1,2], d_k = 2, 21.4\sqrt{2} \approx 1.4, exp(0.7)2exp(0.7) \approx 2
  • Step 1: Dot products of Q with each key
    • Q . K1 = (1×1)+(0×0)=1(1 \times 1) + (0 \times 0) = 1
    • Q . K2 = (1×0)+(0×1)=0(1 \times 0) + (0 \times 1) = 0
  • Step 2: Scale dot products by dk=2=1.4\sqrt{d_k} = \sqrt{2} = 1.4
    • Scaled scores = [1/1.4, 0/1.4] = [0.7, 0]
  • Step 3: Apply Softmax
    • exp(0.7)2exp(0.7) \approx 2, exp(0)1exp(0) \approx 1
    • Softmax denominator = 3, Attention weights = [2/3, 1/3]
  • Step 4: Attention output for Q:
    • (2/3×V1)+(1/3×V2)=[4/3,4/3]+[1/3,2/3]=[5/3,6/3]=[5/3,2](2/3 \times V1) + (1/3 \times V2) = [4/3, 4/3] + [1/3, 2/3] = [5/3, 6/3] = [5/3, 2]

Positional Encoding (PE)

  • Attention layers see input as set of vectors with no sequential order.

  • PE accounts for word order in the input sequence (Transformer model).

  • PE vector is added to the embedding vector.

  • Embeddings represent token in d-dimensional space (similar meanings closer).

  • PE does not encode relative position of tokens.

  • After adding PE, tokens are closer based on meaning and position in d-dimensional space.

  • Formula:

  • Example: pos=0, d_model=4:

    • PE(pos=0) = [sin(0/10000^(0/4)), cos(0/10000^(0/4)), sin(0/10000^(2/4)), cos(0/10000^(2/4))] = [0,1,0,1].

Transformers Architecture

  • illustrates the architecture of the transformer model, highlighting the encoder and decoder components, multi-head attention, and positional encoding.

Transfer Learning

  • Improve model performance using data/models trained on a different task.
  • Two steps:
    • Pretraining: ML model trained for one task.
    • Adaptation: Model adjusted and used in another task.
    • Fine-tuning: Using same model for both tasks.

Scaling Laws

  • Larger models train faster.
  • BERT was trained with 3 epochs; GPT-3 with less than one epoch.

ELMo

  • Task-specific combination of intermediate layer representations in the biLM.
  • For each token tk, a L-layer biLM computes a set of 2L+12L+1 representations
  • h<em>k,0LMh<em>{k,0}^{LM} is the token layer and h</em>k,jLM=[h<em>k,jLM,h</em>k,jLM]h</em>{k,j}^{LM} = [\overrightarrow{h<em>{k,j}^{LM}}, \overleftarrow{h</em>{k,j}^{LM}}] for each biLSTM layer.
  • ELMo collapses all layers in R into a single vector for downstream models.
  • Simplest case -> ELMo selects the top layer as in TagLM and CoVe.

BERT: pre-training process

  • BERT’s MLM (Masked Language Model) and NSP (Next Sentence Prediction) training objectives combined
  • Predict if the second sentence is connected to the first.
  • Input sequence through the Transformer model.
  • Output of the [CLS] token transformed into a 2×1\mathbf{2 \times 1} shaped vector using a simple classification layer (learned matrices of weights and biases).
  • Calculating the probability of IsNextSequence with softmax.
  • Masked LM and Next Sentence Prediction are trained together, minimizing combined loss function.

XLM-R

  • Follows the same approach as XLM.
  • Scaled-up version of XLM-100.
  • Sample streams of text from each language and train the model to predict masked tokens.
  • Sentence piece with a unigram language model applying subword tokenization on the raw text.
  • Sample batches from different languages using the same sampling distribution of α=0.3\alpha = 0.3.
  • Extend MLM to Translation Language Modelling (TLM)
    • TLM extends MLM to pairs of parallel sentences.
    • Predict masked English word, model attends to both English sentence and its French translation

SentenceBERT

  • Sentences passed through pooling layers result in two 768-dimensional vectors u and v.
  • Three approaches for optimizing different objectives using u and v.
  • NLP Task examples:
    • Natural Language Inference
    • Given sentences A and B, predict if B(hypothesis) is true (entailment), false (contradiction), or undetermined (neutral) given A (premise).
    • Semantic Textual Similarity
    • Measure semantic similarity, normalized on scale of 0 – 1.

Token Classification

  • Transformer + Conditional Random Field (CRF).
  • Fine-tune an encoder, optionally, with Conditional Random Field (CRF) for non-local information gathering.
  • Use the WikiNeural dataset for the multilingual Named Entity Recognition (NER) task.

How to model NLP Tasks?

  • Define the problem clearly.
  • Analyze the data.
    • Multilingual?
    • Dialectal?
    • Domain-specific?
    • Conversational?
    • Chronological?
  • Core: Input/Output?
  • Data Size and Quality.
  • Select appropriate model architecture (e.g., sequence-to-sequence, classification/regression).
  • Select appropriate approaches given data (e.g., fine-tuning with labelled data / continual pre-training for unlabelled data).
  • Identify suitable metrics (implicit & explicit evaluation).

Example

  • Categorize customer feedback (online retail): positive, negative, or neutral.
  • Written statement and 1-5 star reviews.
  • System extracts key information (materials, methods, results) from scientific papers.
  • Papers may have structured format Abstract, Introduction, Methods, Results, and Conclusion.

Key Considerations for Extraction Systems

  1. Discuss data sources, webscraping
  2. Consider Manual annotation for a subset of papers may be necessary.
  3. Consider using domain-specific models.
  4. Discuss approaches to extract information
  5. Discuss evaluation for the task

Modeling Tasks: Multiple Choice

  • Abstractive text summarisation with Transformers:
    • Advantage: Ability to generate new sentences that capture the meaning of the original text
  • Advantages of Transformers over RNNs for sequence-to-sequence tasks:
    • Better at capturing long-range dependencies
    • More suitable for parallelisation

Types of Attention

  • Encoder self-attention
  • Masked decoder self-attention
  • Encoder-decoder self-attention

Pre-training Decoder

  • Language models generate text by predicting the next token given a prompt.

  • Given a prompt x=(x<em>1,x</em>2,,x<em>n)x = (x<em>1, x</em>2, …, x<em>n), a causal language model estimates the probability of the next token x</em>n+1x</em>{n+1}. θ\theta is the parameter of the language model

  • Models are trained using causal language modelling loss:

  • During inference (generation), the model generates new tokens by sampling from the probability distribution of the vocabulary given preceding tokens.

Emergent Abilities of Large Language Models: GPT-2 (2019)

  • GPT-2 (1.5B parameters).
  • Same architecture as GPT, just bigger.
  • Trained on much more data: 40GB of internet text data (WebText).
  • Scrape links posted on Reddit w/ at least 3 upvotes.

Zero-shot Capabilities

  • One key emergent ability in GPT-2 is zero-shot.
  • Ability to do many tasks without any examples, and no gradient updates, by simply:
    • Specifying the right sequence prediction problem (e.g. question answering)
    • Passage: Tom Brady… Q: Where was Tom Brady born? A: …

Zero-shot Chain-of-thought prompting

  • Example:
    • “Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?\ \ A: Let’s think step by step.”
    • There are 16 balls in total. Half of the balls are golf balls. That means there are 8 golf balls. Half of the golf balls are blue. That means there are 4 blue golf balls.

Prompt engineering strategies

  • Clear task description.
  • Precise, specific, and clear description of the problem and instruct the LLM to perform as we expect.

Working with LLMs

  • Zero-shot / Few-shot Scenarios (learning and evaluation).
    • No finetuning, prompt engineering/CoT can improve performance.
    • Limited by context length (input token length).
    • Complex tasks will probably need gradient steps.
  • Instruction Fine-tuning.
    • Simple and straightforward, generalize on unseen tasks.
    • Collecting demonstrations for so many tasks is expensive.
    • Mismatch between LM objective and human preferences.
  • Reinforcement Learning with Human Feedback (RLHF).
    • Reward model with human feedback.
    • Use these to further improve the model

LLMs

  • Valid approaches for continual pre-training of LLMs:
    • Training on domain-specific text corpora to adapt the model to a specific field
    • Updating the model with recent data to incorporate new knowledge

Statistical Metrics for LLM Evaluation

  • Enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup.
  • Each predicted token (word) are compared with the ground truth.
  • Mostly only Accuracy and F1 Score are used. Any classification-based metric could be used.

\begin{tabular}{|l|l|}
\hline
Metric & Equation \
\hline
Accuracy & (TP + TN) / (TP + TN + FP + FN) \
\hline
Precision & TP / (TP+FP) \
\hline
Recall & TP / (TP+FN) \
\hline
F1 Score & 2 * (precision * recall)/ (precision + recall) \
\hline
AUPRC & P / (P +N) \
\hline
\end{tabular}

Deploying and Serving a Model

  • Understand application requirements for deployment:
    • Dimension hardware and software requirements.
    • Deal with speed, costs and accuracy requirements.
  • Constraints related to privacy, throughput, latency.
  • Heavily compress large models can be a challenge when trying to keep accuracy at similar levels
  • Some techniques can be used to heavily compress large models after they were trained, such as: pruning, quantization, knowledge distillation
  • Achieved by implementing serverless architectures for inference and training, but also utilising caching where appropriate.

MLOps Tools

  • mlflow: manage the ML lifecycle.
  • seldon: deploy machine learning models at scale, on platforms such as Kubernetes.
  • Jenkins: enables Continuous Integration and Continuous Delivery (CI/CD).
  • Kubeflow: open-source container-orchestration system based on Kubernetes.
  • Airflow: monitor, schedule, and manage your ML workflows

RAG

  • Relevant in implementing effective RAG systems:
    • Vector databases for storing document embeddings
    • Chunking strategies for breaking down source documents
    • Retrieval depth vs. computational cost trade-offs