Revision
Module Overview
- NLP advancements: Transformers, Foundation Models / Large Language Models, Multimodality in AI.
- Skills to build NLP models for:
- Classification/Regression :: Language Understanding.
- Machine translation :: Language Generation.
- LLM-based agents :: Building Frameworks.
- Building NLP pipelines:
- Preparing training data.
- Choosing algorithms and techniques for real-world problems.
- Emphasis on deep learning and transfer learning for model training/tuning.
Learning Outcomes
- Describe the NLP process lifecycle and theoretical fundamentals (KC).
- Demonstrate ability to build processes for solving specific NLP problems (CP).
- Build appropriate NLP transformation pipelines for training computational models (KCPT).
- Describe and deploy experiments, comparing techniques and selecting algorithms (KCPT).
- Build experiment scripts using a programming language, and produce NLP models (CPT).
- Deploy NLP models as Web service inference endpoints and test services (KCPT).
- C-Cognitive/Analytical; K-Subject Knowledge; T-Transferable Skills; P-Professional/Practical skills.
Module Content
- Introduction to NLP
- Traditional/Linguistic approaches to NLP
- From Feature Vectors to Language Embeddings
- Topic Modelling & Artificial Neural Networks for NLP [RNN, LSTM/GRU, CNN]
- Intro to Transformers and NLP Tasks [How to model for them? Supervised/Unsupervised]
- Transformer-based Language Models [w/ Multilinguality, Low-resource Languages, Creoles]
- Autoregressive decoders [Large Language Models]
- NLP Pipeline [LM Fine-tuning / Instruction fine-tuning / Zero-shot / Few-shot]
- Evaluation in NLP [Task-based Evaluation, LLM Evaluation for Toxicity/Bias/Truthfulness, …]
- MLOps & Deployment [Hardware requirements in training scalable models, MiniBatching, GPUs]
- Advanced approaches in NLP [Retrieval-augmented Generation, …]
- Revision Lecture (you are here)
Assessment Strategy
- Group formation (SurreyLearn): week 6
- Self-enrolling groups in an editable shared Excel sheet.
- Students are responsible for forming groups.
- No group changes after the deadline.
- Thorough discussion within peers is necessary.
- Individual examination (50% marks)
- Multiple choice, multi-select, fill in the blanks, true/false, short answers.
- Refer to module slides, classroom discussions, reading list and supplementary material on SurreyLearn.
- Group coursework (50% marks)
- Build a proof-of-concept NLP pipeline.
- Demonstrate technical understanding and good practice (voice over screen capture recording).
- Submit code, final report, and demo video; deadline in week 12.
Assessment - Formative feedback
- Weekly lab exercises provide feedback on understanding and practice.
- Labs and questions support coursework progress.
- No language models for plagiarising NLP coursework.
- Academic misconduct will penalise the ENTIRE component mark!
- Start coursework early!
Starting with Words
- Words are the data.
- Average language has ~200K unique words.
- Zipf’s law:
- Word frequency is inversely proportional to its rank.
Zipf's Law
- According to Zipf's Law, the product of a word's frequency and its rank is approximately a constant.
- The frequency of a word is inversely proportional to its rank in the frequency table.
NLP Leverages Linguistics
- Linguistics: Study of language including morphology, meaning, context, social/cultural, and historical/political influences.
- Understanding language structure is important for building NLP systems.
- Words.
- Morphology.
- Parts of speech.
- Syntax parsing.
- Semantics.
- Textual entailment.
Syntax Parsing
- Example parse tree:
- S ├── NP │ ├── Det │ │ └── The │ ├── Adj │ │ └── Quick │ ├── Adj │ │ └── Brown │ └── N │ └── fox └── [1]_ ├── V │ └── Jumps └── [2]_ ├── [3]_ │ └── Over └── [4]_ ├── Det │ └── The ├── Adj │ └── Lazy └── N └── dog
From Stems to Lemmas
- Lemma: Canonical form of a set of words.
- Example: {processed, processes, processing} -> lemma: process.
- Lemmatization: NLP pipeline task - breaks text into lemmas.
- Stemming: Breaks text into stems.
- Irregular forms: {is, are, was, were} -> lemma: be; stems: is, are, wa(s), wer(e).
Morphology
- Changing "car" to "cars" is an example of inflectional morphology.
- Regular expressions are not a subset of natural languages (False).
Subword Tokenization - WordPiece
- Greedily creates a vocabulary of subword units based on frequency.
- Starts with characters.
- Merges frequent subword pairs to maximize training data likelihood.
- Continues until vocabulary size or likelihood threshold is reached.
- BERT, DistilBERT: "unlikely" -> ["un", "##like", "##ly"].
Tokenization
- Simple character-level tokenization can lose semantic information in words/phrases.
Representation of Words
- Encapsulate meaning of words.
- Feature extraction: One-hot vectors.
- Bag of words approach.
- Words as discrete symbols.
Properties of Semantic Queries
- Representing words as vectors allows mathematical calculations.
- Example: mariecurievector = we['woman'] + we['Europe'] + we['physics'] + … we['scientist'] - we['male']
- Analogy questions:
- If UK → London, then France → ???
- Guessing words based on surrounding words:
- "Better ___ than sorry" → safe.
word2vec skip-gram model
- Two-layer network.
- Hidden layer: n neurons (vector dimensions).
- Input/output layers: M neurons (vocabulary size).
- Output layer activation: softmax (for classification).
Topic Modelling in Practice
- Calculate K topics in a large corpus of documents.
- Topics as features of each document.
- Topic distribution gives the "DNA" of documents.
- Similarity: apply a distance metric to compare each document pair topic distribution.
Document Embeddings
- Converting documents to numerical data.
- Pre-trained language models: BERT, RoBERTa, ALBERT, S-BERT.
- Multilingual models: XLM-R, mBERT, mBART, IndicBERT.
- Ensure data language is part of the multilingual language model.
Improving RNNs
- Techniques:
- Gradient clipping.
- Bidirectional RNNs & Multi-Layer RNNs (or Stacked RNNs).
- Long Short-Term Memory (LSTM) RNNs.
- Additional "memory" for long term information, controlled by gates.
- LSTMs became the default for most NLP tasks.
- Gated Recurrent Unit (GRU).
- Similar to LSTM, but simpler (two gates).
- More efficient, but possibly less accurate.
Attention
- Key-value store containing input's information, looked up with a query (current context).
- List of vectors (one for each token) associated with keys.
- Increases "memory" decoder can refer to.
- Mapping shared with the decoder at each time step.
- Two common types of attention:
- Encoder-decoder attention (cross-attention; RNN Seq2Seq and Transformer).
- Self-attention (Transformer).
Self-Attention
- Create Query (Q), Key (K), and Value (V) vectors from each encoder input vector (embedding).
- Encoded representation of input as key-value pairs, (K,V), of dimension dk.
- Create Q vector qi, K vector ki, and a V vector vi by multiplying the word embedding by three matrices learned during training.
- Calculate the self-attention score for each word.
- Divide scores by to stabilise gradients.
- Pass the result through a SoftMax operation.
- Multiply each value vector by the SoftMax score.
- Sum up the weighted value vectors.
Self-Attention Numerical Example
- Given: Q = [1,0], K1 = [1,0], K2 = [0,1], V1 = [2,2], V2 = [1,2], d_k = 2, ,
- Step 1: Dot products of Q with each key
- Q . K1 =
- Q . K2 =
- Step 2: Scale dot products by
- Scaled scores = [1/1.4, 0/1.4] = [0.7, 0]
- Step 3: Apply Softmax
- ,
- Softmax denominator = 3, Attention weights = [2/3, 1/3]
- Step 4: Attention output for Q:
Positional Encoding (PE)
Attention layers see input as set of vectors with no sequential order.
PE accounts for word order in the input sequence (Transformer model).
PE vector is added to the embedding vector.
Embeddings represent token in d-dimensional space (similar meanings closer).
PE does not encode relative position of tokens.
After adding PE, tokens are closer based on meaning and position in d-dimensional space.
Formula:
Example: pos=0, d_model=4:
- PE(pos=0) = [sin(0/10000^(0/4)), cos(0/10000^(0/4)), sin(0/10000^(2/4)), cos(0/10000^(2/4))] = [0,1,0,1].
Transformers Architecture
- illustrates the architecture of the transformer model, highlighting the encoder and decoder components, multi-head attention, and positional encoding.
Transfer Learning
- Improve model performance using data/models trained on a different task.
- Two steps:
- Pretraining: ML model trained for one task.
- Adaptation: Model adjusted and used in another task.
- Fine-tuning: Using same model for both tasks.
Scaling Laws
- Larger models train faster.
- BERT was trained with 3 epochs; GPT-3 with less than one epoch.
ELMo
- Task-specific combination of intermediate layer representations in the biLM.
- For each token tk, a L-layer biLM computes a set of representations
- is the token layer and for each biLSTM layer.
- ELMo collapses all layers in R into a single vector for downstream models.
- Simplest case -> ELMo selects the top layer as in TagLM and CoVe.
BERT: pre-training process
- BERT’s MLM (Masked Language Model) and NSP (Next Sentence Prediction) training objectives combined
- Predict if the second sentence is connected to the first.
- Input sequence through the Transformer model.
- Output of the [CLS] token transformed into a shaped vector using a simple classification layer (learned matrices of weights and biases).
- Calculating the probability of IsNextSequence with softmax.
- Masked LM and Next Sentence Prediction are trained together, minimizing combined loss function.
XLM-R
- Follows the same approach as XLM.
- Scaled-up version of XLM-100.
- Sample streams of text from each language and train the model to predict masked tokens.
- Sentence piece with a unigram language model applying subword tokenization on the raw text.
- Sample batches from different languages using the same sampling distribution of .
- Extend MLM to Translation Language Modelling (TLM)
- TLM extends MLM to pairs of parallel sentences.
- Predict masked English word, model attends to both English sentence and its French translation
SentenceBERT
- Sentences passed through pooling layers result in two 768-dimensional vectors u and v.
- Three approaches for optimizing different objectives using u and v.
- NLP Task examples:
- Natural Language Inference
- Given sentences A and B, predict if B(hypothesis) is true (entailment), false (contradiction), or undetermined (neutral) given A (premise).
- Semantic Textual Similarity
- Measure semantic similarity, normalized on scale of 0 – 1.
Token Classification
- Transformer + Conditional Random Field (CRF).
- Fine-tune an encoder, optionally, with Conditional Random Field (CRF) for non-local information gathering.
- Use the WikiNeural dataset for the multilingual Named Entity Recognition (NER) task.
How to model NLP Tasks?
- Define the problem clearly.
- Analyze the data.
- Multilingual?
- Dialectal?
- Domain-specific?
- Conversational?
- Chronological?
- Core: Input/Output?
- Data Size and Quality.
- Select appropriate model architecture (e.g., sequence-to-sequence, classification/regression).
- Select appropriate approaches given data (e.g., fine-tuning with labelled data / continual pre-training for unlabelled data).
- Identify suitable metrics (implicit & explicit evaluation).
Example
- Categorize customer feedback (online retail): positive, negative, or neutral.
- Written statement and 1-5 star reviews.
- System extracts key information (materials, methods, results) from scientific papers.
- Papers may have structured format Abstract, Introduction, Methods, Results, and Conclusion.
Key Considerations for Extraction Systems
- Discuss data sources, webscraping
- Consider Manual annotation for a subset of papers may be necessary.
- Consider using domain-specific models.
- Discuss approaches to extract information
- Discuss evaluation for the task
Modeling Tasks: Multiple Choice
- Abstractive text summarisation with Transformers:
- Advantage: Ability to generate new sentences that capture the meaning of the original text
- Advantages of Transformers over RNNs for sequence-to-sequence tasks:
- Better at capturing long-range dependencies
- More suitable for parallelisation
Types of Attention
- Encoder self-attention
- Masked decoder self-attention
- Encoder-decoder self-attention
Pre-training Decoder
Language models generate text by predicting the next token given a prompt.
Given a prompt , a causal language model estimates the probability of the next token . is the parameter of the language model
Models are trained using causal language modelling loss:
During inference (generation), the model generates new tokens by sampling from the probability distribution of the vocabulary given preceding tokens.
Emergent Abilities of Large Language Models: GPT-2 (2019)
- GPT-2 (1.5B parameters).
- Same architecture as GPT, just bigger.
- Trained on much more data: 40GB of internet text data (WebText).
- Scrape links posted on Reddit w/ at least 3 upvotes.
Zero-shot Capabilities
- One key emergent ability in GPT-2 is zero-shot.
- Ability to do many tasks without any examples, and no gradient updates, by simply:
- Specifying the right sequence prediction problem (e.g. question answering)
- Passage: Tom Brady… Q: Where was Tom Brady born? A: …
Zero-shot Chain-of-thought prompting
- Example:
- “Q: A juggler can juggle 16 balls. Half of the balls are golf balls, and half of the golf balls are blue. How many blue golf balls are there?\ \ A: Let’s think step by step.”
- There are 16 balls in total. Half of the balls are golf balls. That means there are 8 golf balls. Half of the golf balls are blue. That means there are 4 blue golf balls.
Prompt engineering strategies
- Clear task description.
- Precise, specific, and clear description of the problem and instruct the LLM to perform as we expect.
Working with LLMs
- Zero-shot / Few-shot Scenarios (learning and evaluation).
- No finetuning, prompt engineering/CoT can improve performance.
- Limited by context length (input token length).
- Complex tasks will probably need gradient steps.
- Instruction Fine-tuning.
- Simple and straightforward, generalize on unseen tasks.
- Collecting demonstrations for so many tasks is expensive.
- Mismatch between LM objective and human preferences.
- Reinforcement Learning with Human Feedback (RLHF).
- Reward model with human feedback.
- Use these to further improve the model
LLMs
- Valid approaches for continual pre-training of LLMs:
- Training on domain-specific text corpora to adapt the model to a specific field
- Updating the model with recent data to incorporate new knowledge
Statistical Metrics for LLM Evaluation
- Enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup.
- Each predicted token (word) are compared with the ground truth.
- Mostly only Accuracy and F1 Score are used. Any classification-based metric could be used.
\begin{tabular}{|l|l|}
\hline
Metric & Equation \
\hline
Accuracy & (TP + TN) / (TP + TN + FP + FN) \
\hline
Precision & TP / (TP+FP) \
\hline
Recall & TP / (TP+FN) \
\hline
F1 Score & 2 * (precision * recall)/ (precision + recall) \
\hline
AUPRC & P / (P +N) \
\hline
\end{tabular}
Deploying and Serving a Model
- Understand application requirements for deployment:
- Dimension hardware and software requirements.
- Deal with speed, costs and accuracy requirements.
- Constraints related to privacy, throughput, latency.
- Heavily compress large models can be a challenge when trying to keep accuracy at similar levels
- Some techniques can be used to heavily compress large models after they were trained, such as: pruning, quantization, knowledge distillation
- Achieved by implementing serverless architectures for inference and training, but also utilising caching where appropriate.
MLOps Tools
- mlflow: manage the ML lifecycle.
- seldon: deploy machine learning models at scale, on platforms such as Kubernetes.
- Jenkins: enables Continuous Integration and Continuous Delivery (CI/CD).
- Kubeflow: open-source container-orchestration system based on Kubernetes.
- Airflow: monitor, schedule, and manage your ML workflows
RAG
- Relevant in implementing effective RAG systems:
- Vector databases for storing document embeddings
- Chunking strategies for breaking down source documents
- Retrieval depth vs. computational cost trade-offs