Contextual Word Embeddings: BERT and Beyond

Coursework and Generative AI Policy

Coursework #1 is due on March 13th at 4:30 PM and is worth 20% of the final grade.
Coursework #2 (for MSci students) involves a paper review and is worth 10% of the final grade.
Tuesday lab times can be used for catching up on labs or working on the coursework.
All labs and courseworks are examinable.
Code from labs (including scikit-learn/transformers) can be used.
Generative AI can be used during assessed exercises, but its usage must be noted in the final question.
Material from the coursework may appear on the final exam.
Plagiarism is strictly prohibited; it's better to skip a question than to copy.

Contextual Word Embeddings: BERT and Beyond

Overview of what will be covered:
- Representing individual words as vectors.
- Importance of considering the context in which words appear.
- Neural Language Models (e.g., BERT, GPT) and their underlying techniques, including:
  - Self-Attention
  - Sub-word Tokenization
  - Transformers
  - Pre-training and Fine-tuning

Document Similarity and Word Similarity

Review of document similarity using Term Frequency (TF) vectors.
Example:
- A: I checked the {check: 1, time: 1, watch: 1} time on my watch.
- B: I checked the {check: 1, time: 1, clock: 1} time on my clock.
- C: I saw an elephant {saw: 1, elephant: 1, zoo: 1} at the zoo.
- sim(A, B) > sim(A, C)
Limitation: TF vectors do not capture the similarity between individual words.
It would be helpful if we could also tell if individual words were similar…
sim(watch, clock) > sim(watch, elephant)

Word Vectors (Embeddings)

Representing words as vectors.
Exercise: Deducing the meaning of "tarant" from context.
Example text: "Alice stepped into the furniture store… searching for the perfect tarant… a small, cozy tarant… Its deep blue fabric…"

Distributional Word Vectors

John Rupert Firth: "You shall know a word by the company it keeps."
Strategy: Represent each word by the context it appears in.
Example:
- watch = {video: 3168, apple: 1702, time: 868, …}
- clock = {time: 2614, hour: 806, alarm: 438, …}
- elephant = {baby: 108, species: 100, ears: 97, …}
Sliding context window to capture surrounding words.

Sparse Vectors and IBM Model

Representing each word as a sparse vector of a context window.
This approach is referred to as the IBM Model.
Link to the original paper: http://aclweb.org/anthology/J/J92/J92-4003.pdf
We can compare these vectors using cosine similarity:
- sim(watch, clock) > sim(watch, elephant)
Improvements can be made using techniques like TF-IDF.

Problems with Sparse Vectors

Vectors are generally very large, leading to memory and compute intensity.
Example vector sizes:
- |watch| = 21,916
- |clock| = 9,093
- |elephant| = 5,477
- (where |word| represents the number of non-zero values of word’s vector)

Dense Vectors and Dimensionality Reduction

"Compressing" sparse vectors into dense vectors.
Using dimensionality reduction techniques like Truncated Singular Value Decomposition (SVD).
Matrix factorization approaches: turning a matrix into a product of multiple matrices.
Example:
- V \times n = V \times V \times n \times n \times n \times V
- our sparse vectors = dense vectors (left singular vectors) (diagonal matrix) (right singular vectors)

Truncated Singular Value Decomposition (SVD)

Formula: Matrix Factorization
Example:
- {video: 3168, apple: 1702, time: 868, …} becomes
- [0.6, 0.3, 0.1, 0.9, 0.2]
In practice, n is usually in the hundreds or low thousands.

Static Word Embeddings

These are often called static word embeddings / vectors.
Example:
- watch [0.6, 0.3, 0.1, 0.9, 0.2]
- clock [0.5, 0.3, 0.2, 0.9, 0.1]
- elephant [0.1, 0.9, 0.9, 0.2, 0.3]
Maintain useful properties of the sparse vectors, such as:
- sim(watch, clock) > sim(watch, elephant)
Various other techniques to construct static word embeddings:
- Word2Vec
- GloVe

Advantages and Disadvantages of Static Word Embeddings

Advantage:
- Handles synonymy – when two words have similar/identical meanings.
  - Similar words will have similar embeddings; dissimilar words will have dissimilar embeddings.
  - Based on the assumption that synonymous words will appear in similar contexts across a corpus.
Disadvantage:
- Doesn’t handle polysemy – when one word has multiple meanings.
  - Static word embeddings always map the same word to the same embedding.
  - Embedding is the weighted average across all contexts (which can be multiple meanings).
  - Less frequent meanings are under-represented.
- Example: vector(“match”) = + + + … depending on context

Integrating Context into Vectors: Contextual Word Vectors

Shift from vector(word) to vector(word|context).
Using deep learning to accomplish this.
Example:
- vector(“match”|“I lit the _ ”) = similar
- vector(“match”|“The _ burned”) = similar
- vector(“match”|“They won the _ ”) = not similar

Early Work on Context Vectors: ELMo

ELMo: Embeddings from Language Models
Used (now older) deep learning methods for a neural language model.
Found that the internal representations in the neural model were good context vectors.
Context vectors were useful for other language tasks like classification and document similarity.
https://aclanthology.org/N18-1202.pdf

Big Innovations in Contextual Word Vectors

Self-Attention – a neural network structure that builds a new word representation based on its context.
Subword-tokenization – limits the size of the vocabulary, allowing the networks to learn more robust representations.
Transformers – a neural network structure that combines multiple self-attention blocks and allows text encoding/generation.
Language Model Pre-Training – a technique for training neural networks that can be applied to a variety of other tasks.

Self-Attention: Determining Word Importance

Example: "The match burns brightly."
Definitions of match:
1. (noun) A competitive sporting event.
2. (noun) A device made of wood or paper that ignites with friction.
3. (verb) To agree with; to be equal to.
4. (noun) A pair of items or entities with mutually suitable characteristics.
Question: Which words help determine that the meaning of "match" is (2)?

Relevance of Words in Context

Some words are more important for determining the meaning of a word.
- Words just before can indicate the part-of-speech.
- Words across the sentence can identify the topic.
- Many words are filler and not useful for distinguishing meaning.

Creating a Contextualized Vector for "match"

Goal: Create a function that adds context to word vectors.
Inputs: Word vectors without context.
Context vector maker:
- The match burns brightly

Adding Context to Word Vectors

The inputs will be the word vectors without context - no other outside knowledge
Idea:
- Context vectors will be some combination of the ‘match’ vector with the other word vectors in the sentence.
- But some words are more important than others so we need relevance weights

Weighting the Importance of Words

Need a function that tells how much attention to give based on relevance
relevance(‘the’ | ‘match’) = G(0.1 0.5 0.2) = 12.1
relevance(‘burns’ | ’match’) = G(0.4 0.1 0.8) = 89.3
Calculate the relevance of one word (e.g. ‘burns’) for the context of another word (e.g. ‘match’)
Inputs are the word vectors without context
How to calculate the relevance from the word vectors?
- Could you use similarity? No: ‘match’ and ‘burns’ are not very similar
- Will need to do some transformations to the word vectors

Relevance for Attention

How important is ‘burns’ to understand the word ‘match’?
relevance(‘burns’ | ‘match’) = [match] [burns] = 89.3
- Input is vectors for ‘match’ and ‘burn’
- Use two matrices WQ and WK
  - These matrices are learnt during training
- Matrix multiply the input word vectors to get a query vector and a key vector.
- Then dot-product these to get the relevance

Softmax Function for Relevance Scores

Relevance scores need to add up to 1.
Raw relevance scores may not be nicely between 0 and 1.
Use the softmax function to ensure scores add up to 1.

Relevance Scores Used to Weigh Transformed Word Vectors

Multiply each input vector by matrix W_V to get their value vectors.
Matrix W_V is also learned during the training process.
Add them up using the softmaxed relevance scores as weights

The Self-Attention Equation

Using matrices:
- Q = Input vectors multiplied by W_Q = queries
- K = Input vectors multiplied by W_K = keys
- V = Input vectors multiplied by W_V = values
Work with matrices instead of individual vectors
This equation encapsulates all the previous steps in one-go e.g. QKT is calculating the relevance scores
Self attention adds one extra step:
- Divide by sqrt(dk) where dk is size of the embeddings
- Helps with back-propagation of the network

Learning Embeddings and Weight Matrices

Attention relies on multiple matrices WQ, WK and W_V
Each token also has a word vector to use as input
Where do these come from?
- They are all learned during the extensive training process

Self-Attention Terminology

Self-attention is named because it is paying attention to the same text as the text it is working on.
The queries, keys and values are all from the same text.
Attention can also be applied between texts (e.g. an English text and a Spanish text).

Self-Attention Summary

Self-attention allows a language model to weight which other tokens are important when interpreting a token
It uses an equation that transforms the input vectors into queries, keys and values
The queries and keys are used to calculate the relevance scores of other tokens to the token of interest
The softmaxed relevance scores are used as weights to combine the value vectors
All of the weight matrices and word embedding vectors are learnt during training

Subword Tokenization: Addressing New Words

The problem with new words:
- Actually new words.
- Misspellings.
- Words that weren’t in the training set.
Language models have a hard time with new words
- They know nothing about them
- Have to treat them as OOV - out of vocabulary
- No learned embeddings and probabilities of zero of them occurring
- We learnt about smoothing before “I think I’m going to take a staycation next month”

Subwords to Handle New Words

Core idea: Split uncommon words into 2 or more parts (potentially syllables).
Will depend on the language and type of text (e.g. tweets versus science).
Why does this help?
- Much more likely to have seen sub