Contextual Word Embeddings: BERT and Beyond
Coursework and Generative AI Policy
- Coursework #1 is due on March 13th at 4:30 PM and is worth 20% of the final grade.
- Coursework #2 (for MSci students) involves a paper review and is worth 10% of the final grade.
- Tuesday lab times can be used for catching up on labs or working on the coursework.
- All labs and courseworks are examinable.
- Code from labs (including scikit-learn/transformers) can be used.
- Generative AI can be used during assessed exercises, but its usage must be noted in the final question.
- Material from the coursework may appear on the final exam.
- Plagiarism is strictly prohibited; it's better to skip a question than to copy.
Contextual Word Embeddings: BERT and Beyond
- Overview of what will be covered:
- Representing individual words as vectors.
- Importance of considering the context in which words appear.
- Neural Language Models (e.g., BERT, GPT) and their underlying techniques, including:
- Self-Attention
- Sub-word Tokenization
- Transformers
- Pre-training and Fine-tuning
Document Similarity and Word Similarity
- Review of document similarity using Term Frequency (TF) vectors.
- Example:
- A: I checked the {check: 1, time: 1, watch: 1} time on my watch.
- B: I checked the {check: 1, time: 1, clock: 1} time on my clock.
- C: I saw an elephant {saw: 1, elephant: 1, zoo: 1} at the zoo.
- sim(A, B) > sim(A, C)
- Limitation: TF vectors do not capture the similarity between individual words.
- It would be helpful if we could also tell if individual words were similar…
- sim(watch, clock) > sim(watch, elephant)
Word Vectors (Embeddings)
- Representing words as vectors.
- Exercise: Deducing the meaning of "tarant" from context.
- Example text: "Alice stepped into the furniture store… searching for the perfect tarant… a small, cozy tarant… Its deep blue fabric…"
Distributional Word Vectors
- John Rupert Firth: "You shall know a word by the company it keeps."
- Strategy: Represent each word by the context it appears in.
- Example:
- watch = {video: 3168, apple: 1702, time: 868, …}
- clock = {time: 2614, hour: 806, alarm: 438, …}
- elephant = {baby: 108, species: 100, ears: 97, …}
- Sliding context window to capture surrounding words.
Sparse Vectors and IBM Model
- Representing each word as a sparse vector of a context window.
- This approach is referred to as the IBM Model.
- Link to the original paper: http://aclweb.org/anthology/J/J92/J92-4003.pdf
- We can compare these vectors using cosine similarity:
- sim(watch, clock) > sim(watch, elephant)
- Improvements can be made using techniques like TF-IDF.
Problems with Sparse Vectors
- Vectors are generally very large, leading to memory and compute intensity.
- Example vector sizes:
- |watch| = 21,916
- |clock| = 9,093
- |elephant| = 5,477
- (where |word| represents the number of non-zero values of word’s vector)
Dense Vectors and Dimensionality Reduction
- "Compressing" sparse vectors into dense vectors.
- Using dimensionality reduction techniques like Truncated Singular Value Decomposition (SVD).
- Matrix factorization approaches: turning a matrix into a product of multiple matrices.
- Example:
- V \times n = V \times V \times n \times n \times n \times V
- our sparse vectors = dense vectors (left singular vectors) (diagonal matrix) (right singular vectors)
Truncated Singular Value Decomposition (SVD)
- Formula: Matrix Factorization
- Example:
- {video: 3168, apple: 1702, time: 868, …} becomes
- [0.6, 0.3, 0.1, 0.9, 0.2]
- In practice, n is usually in the hundreds or low thousands.
Static Word Embeddings
- These are often called static word embeddings / vectors.
- Example:
- watch [0.6, 0.3, 0.1, 0.9, 0.2]
- clock [0.5, 0.3, 0.2, 0.9, 0.1]
- elephant [0.1, 0.9, 0.9, 0.2, 0.3]
- Maintain useful properties of the sparse vectors, such as:
- sim(watch, clock) > sim(watch, elephant)
- Various other techniques to construct static word embeddings:
- Word2Vec
- GloVe
Advantages and Disadvantages of Static Word Embeddings
- Advantage:
- Handles synonymy – when two words have similar/identical meanings.
- Similar words will have similar embeddings; dissimilar words will have dissimilar embeddings.
- Based on the assumption that synonymous words will appear in similar contexts across a corpus.
- Handles synonymy – when two words have similar/identical meanings.
- Disadvantage:
- Doesn’t handle polysemy – when one word has multiple meanings.
- Static word embeddings always map the same word to the same embedding.
- Embedding is the weighted average across all contexts (which can be multiple meanings).
- Less frequent meanings are under-represented.
- Example: vector(“match”) = + + + … depending on context
- Doesn’t handle polysemy – when one word has multiple meanings.
Integrating Context into Vectors: Contextual Word Vectors
- Shift from vector(word) to vector(word|context).
- Using deep learning to accomplish this.
- Example:
- vector(“match”|“I lit the _ ”) = similar
- vector(“match”|“The _ burned”) = similar
- vector(“match”|“They won the _ ”) = not similar
Early Work on Context Vectors: ELMo
- ELMo: Embeddings from Language Models
- Used (now older) deep learning methods for a neural language model.
- Found that the internal representations in the neural model were good context vectors.
- Context vectors were useful for other language tasks like classification and document similarity.
- https://aclanthology.org/N18-1202.pdf
Big Innovations in Contextual Word Vectors
- Self-Attention – a neural network structure that builds a new word representation based on its context.
- Subword-tokenization – limits the size of the vocabulary, allowing the networks to learn more robust representations.
- Transformers – a neural network structure that combines multiple self-attention blocks and allows text encoding/generation.
- Language Model Pre-Training – a technique for training neural networks that can be applied to a variety of other tasks.
Self-Attention: Determining Word Importance
- Example: "The match burns brightly."
- Definitions of match:
- (noun) A competitive sporting event.
- (noun) A device made of wood or paper that ignites with friction.
- (verb) To agree with; to be equal to.
- (noun) A pair of items or entities with mutually suitable characteristics.
- Question: Which words help determine that the meaning of "match" is (2)?
Relevance of Words in Context
- Some words are more important for determining the meaning of a word.
- Words just before can indicate the part-of-speech.
- Words across the sentence can identify the topic.
- Many words are filler and not useful for distinguishing meaning.
Creating a Contextualized Vector for "match"
- Goal: Create a function that adds context to word vectors.
- Inputs: Word vectors without context.
- Context vector maker:
- The match burns brightly
Adding Context to Word Vectors
- The inputs will be the word vectors without context - no other outside knowledge
- Idea:
- Context vectors will be some combination of the ‘match’ vector with the other word vectors in the sentence.
- But some words are more important than others so we need relevance weights
Weighting the Importance of Words
Need a function that tells how much attention to give based on relevance
relevance(‘the’ | ‘match’) = G(0.1 0.5 0.2) = 12.1
relevance(‘burns’ | ’match’) = G(0.4 0.1 0.8) = 89.3Calculate the relevance of one word (e.g. ‘burns’) for the context of another word (e.g. ‘match’)
Inputs are the word vectors without context
How to calculate the relevance from the word vectors?
- Could you use similarity? No: ‘match’ and ‘burns’ are not very similar
- Will need to do some transformations to the word vectors
Relevance for Attention
- How important is ‘burns’ to understand the word ‘match’?
- relevance(‘burns’ | ‘match’) = [match] [burns] = 89.3
- Input is vectors for ‘match’ and ‘burn’
- Use two matrices WQ and WK
- These matrices are learnt during training
- Matrix multiply the input word vectors to get a query vector and a key vector.
- Then dot-product these to get the relevance
Softmax Function for Relevance Scores
- Relevance scores need to add up to 1.
- Raw relevance scores may not be nicely between 0 and 1.
- Use the softmax function to ensure scores add up to 1.
Relevance Scores Used to Weigh Transformed Word Vectors
- Multiply each input vector by matrix W_V to get their value vectors.
- Matrix W_V is also learned during the training process.
- Add them up using the softmaxed relevance scores as weights
The Self-Attention Equation
- Using matrices:
- Q = Input vectors multiplied by W_Q = queries
- K = Input vectors multiplied by W_K = keys
- V = Input vectors multiplied by W_V = values
- Work with matrices instead of individual vectors
- This equation encapsulates all the previous steps in one-go e.g. QKT is calculating the relevance scores
- Self attention adds one extra step:
- Divide by sqrt(dk) where dk is size of the embeddings
- Helps with back-propagation of the network
Learning Embeddings and Weight Matrices
- Attention relies on multiple matrices WQ, WK and W_V
- Each token also has a word vector to use as input
- Where do these come from?
- They are all learned during the extensive training process
Self-Attention Terminology
- Self-attention is named because it is paying attention to the same text as the text it is working on.
- The queries, keys and values are all from the same text.
- Attention can also be applied between texts (e.g. an English text and a Spanish text).
Self-Attention Summary
- Self-attention allows a language model to weight which other tokens are important when interpreting a token
- It uses an equation that transforms the input vectors into queries, keys and values
- The queries and keys are used to calculate the relevance scores of other tokens to the token of interest
- The softmaxed relevance scores are used as weights to combine the value vectors
- All of the weight matrices and word embedding vectors are learnt during training
Subword Tokenization: Addressing New Words
- The problem with new words:
- Actually new words.
- Misspellings.
- Words that weren’t in the training set.
- Language models have a hard time with new words
- They know nothing about them
- Have to treat them as OOV - out of vocabulary
- No learned embeddings and probabilities of zero of them occurring
- We learnt about smoothing before “I think I’m going to take a staycation next month”
Subwords to Handle New Words
- Core idea: Split uncommon words into 2 or more parts (potentially syllables).
- Will depend on the language and type of text (e.g. tweets versus science).
- Why does this help?
- Much more likely to have seen sub