MD

Contextual Word Embeddings: BERT and Beyond

Coursework and Generative AI Policy

  • Coursework #1 is due on March 13th at 4:30 PM and is worth 20% of the final grade.
  • Coursework #2 (for MSci students) involves a paper review and is worth 10% of the final grade.
  • Tuesday lab times can be used for catching up on labs or working on the coursework.
  • All labs and courseworks are examinable.
  • Code from labs (including scikit-learn/transformers) can be used.
  • Generative AI can be used during assessed exercises, but its usage must be noted in the final question.
  • Material from the coursework may appear on the final exam.
  • Plagiarism is strictly prohibited; it's better to skip a question than to copy.

Contextual Word Embeddings: BERT and Beyond

  • Overview of what will be covered:
    • Representing individual words as vectors.
    • Importance of considering the context in which words appear.
    • Neural Language Models (e.g., BERT, GPT) and their underlying techniques, including:
      • Self-Attention
      • Sub-word Tokenization
      • Transformers
      • Pre-training and Fine-tuning

Document Similarity and Word Similarity

  • Review of document similarity using Term Frequency (TF) vectors.
  • Example:
    • A: I checked the {check: 1, time: 1, watch: 1} time on my watch.
    • B: I checked the {check: 1, time: 1, clock: 1} time on my clock.
    • C: I saw an elephant {saw: 1, elephant: 1, zoo: 1} at the zoo.
    • sim(A, B) > sim(A, C)
  • Limitation: TF vectors do not capture the similarity between individual words.
  • It would be helpful if we could also tell if individual words were similar…
  • sim(watch, clock) > sim(watch, elephant)

Word Vectors (Embeddings)

  • Representing words as vectors.
  • Exercise: Deducing the meaning of "tarant" from context.
  • Example text: "Alice stepped into the furniture store… searching for the perfect tarant… a small, cozy tarant… Its deep blue fabric…"

Distributional Word Vectors

  • John Rupert Firth: "You shall know a word by the company it keeps."
  • Strategy: Represent each word by the context it appears in.
  • Example:
    • watch = {video: 3168, apple: 1702, time: 868, …}
    • clock = {time: 2614, hour: 806, alarm: 438, …}
    • elephant = {baby: 108, species: 100, ears: 97, …}
  • Sliding context window to capture surrounding words.

Sparse Vectors and IBM Model

  • Representing each word as a sparse vector of a context window.
  • This approach is referred to as the IBM Model.
  • Link to the original paper: http://aclweb.org/anthology/J/J92/J92-4003.pdf
  • We can compare these vectors using cosine similarity:
    • sim(watch, clock) > sim(watch, elephant)
  • Improvements can be made using techniques like TF-IDF.

Problems with Sparse Vectors

  • Vectors are generally very large, leading to memory and compute intensity.
  • Example vector sizes:
    • |watch| = 21,916
    • |clock| = 9,093
    • |elephant| = 5,477
    • (where |word| represents the number of non-zero values of word’s vector)

Dense Vectors and Dimensionality Reduction

  • "Compressing" sparse vectors into dense vectors.
  • Using dimensionality reduction techniques like Truncated Singular Value Decomposition (SVD).
  • Matrix factorization approaches: turning a matrix into a product of multiple matrices.
  • Example:
    • V \times n = V \times V \times n \times n \times n \times V
    • our sparse vectors = dense vectors (left singular vectors) (diagonal matrix) (right singular vectors)

Truncated Singular Value Decomposition (SVD)

  • Formula: Matrix Factorization
  • Example:
    • {video: 3168, apple: 1702, time: 868, …} becomes
    • [0.6, 0.3, 0.1, 0.9, 0.2]
  • In practice, n is usually in the hundreds or low thousands.

Static Word Embeddings

  • These are often called static word embeddings / vectors.
  • Example:
    • watch [0.6, 0.3, 0.1, 0.9, 0.2]
    • clock [0.5, 0.3, 0.2, 0.9, 0.1]
    • elephant [0.1, 0.9, 0.9, 0.2, 0.3]
  • Maintain useful properties of the sparse vectors, such as:
    • sim(watch, clock) > sim(watch, elephant)
  • Various other techniques to construct static word embeddings:
    • Word2Vec
    • GloVe

Advantages and Disadvantages of Static Word Embeddings

  • Advantage:
    • Handles synonymy – when two words have similar/identical meanings.
      • Similar words will have similar embeddings; dissimilar words will have dissimilar embeddings.
      • Based on the assumption that synonymous words will appear in similar contexts across a corpus.
  • Disadvantage:
    • Doesn’t handle polysemy – when one word has multiple meanings.
      • Static word embeddings always map the same word to the same embedding.
      • Embedding is the weighted average across all contexts (which can be multiple meanings).
      • Less frequent meanings are under-represented.
    • Example: vector(“match”) = + + + … depending on context

Integrating Context into Vectors: Contextual Word Vectors

  • Shift from vector(word) to vector(word|context).
  • Using deep learning to accomplish this.
  • Example:
    • vector(“match”|“I lit the _ ”) = similar
    • vector(“match”|“The _ burned”) = similar
    • vector(“match”|“They won the _ ”) = not similar

Early Work on Context Vectors: ELMo

  • ELMo: Embeddings from Language Models
  • Used (now older) deep learning methods for a neural language model.
  • Found that the internal representations in the neural model were good context vectors.
  • Context vectors were useful for other language tasks like classification and document similarity.
  • https://aclanthology.org/N18-1202.pdf

Big Innovations in Contextual Word Vectors

  • Self-Attention – a neural network structure that builds a new word representation based on its context.
  • Subword-tokenization – limits the size of the vocabulary, allowing the networks to learn more robust representations.
  • Transformers – a neural network structure that combines multiple self-attention blocks and allows text encoding/generation.
  • Language Model Pre-Training – a technique for training neural networks that can be applied to a variety of other tasks.

Self-Attention: Determining Word Importance

  • Example: "The match burns brightly."
  • Definitions of match:
    1. (noun) A competitive sporting event.
    2. (noun) A device made of wood or paper that ignites with friction.
    3. (verb) To agree with; to be equal to.
    4. (noun) A pair of items or entities with mutually suitable characteristics.
  • Question: Which words help determine that the meaning of "match" is (2)?

Relevance of Words in Context

  • Some words are more important for determining the meaning of a word.
    • Words just before can indicate the part-of-speech.
    • Words across the sentence can identify the topic.
    • Many words are filler and not useful for distinguishing meaning.

Creating a Contextualized Vector for "match"

  • Goal: Create a function that adds context to word vectors.
  • Inputs: Word vectors without context.
  • Context vector maker:
    • The match burns brightly

Adding Context to Word Vectors

  • The inputs will be the word vectors without context - no other outside knowledge
  • Idea:
    • Context vectors will be some combination of the ‘match’ vector with the other word vectors in the sentence.
    • But some words are more important than others so we need relevance weights

Weighting the Importance of Words

  • Need a function that tells how much attention to give based on relevance
    relevance(‘the’ | ‘match’) = G(0.1 0.5 0.2) = 12.1
    relevance(‘burns’ | ’match’) = G(0.4 0.1 0.8) = 89.3

  • Calculate the relevance of one word (e.g. ‘burns’) for the context of another word (e.g. ‘match’)

  • Inputs are the word vectors without context

  • How to calculate the relevance from the word vectors?

    • Could you use similarity? No: ‘match’ and ‘burns’ are not very similar
    • Will need to do some transformations to the word vectors

Relevance for Attention

  • How important is ‘burns’ to understand the word ‘match’?
  • relevance(‘burns’ | ‘match’) = [match] [burns] = 89.3
    • Input is vectors for ‘match’ and ‘burn’
    • Use two matrices WQ and WK
      • These matrices are learnt during training
    • Matrix multiply the input word vectors to get a query vector and a key vector.
    • Then dot-product these to get the relevance

Softmax Function for Relevance Scores

  • Relevance scores need to add up to 1.
  • Raw relevance scores may not be nicely between 0 and 1.
  • Use the softmax function to ensure scores add up to 1.

Relevance Scores Used to Weigh Transformed Word Vectors

  • Multiply each input vector by matrix W_V to get their value vectors.
  • Matrix W_V is also learned during the training process.
  • Add them up using the softmaxed relevance scores as weights

The Self-Attention Equation

  • Using matrices:
    • Q = Input vectors multiplied by W_Q = queries
    • K = Input vectors multiplied by W_K = keys
    • V = Input vectors multiplied by W_V = values
  • Work with matrices instead of individual vectors
  • This equation encapsulates all the previous steps in one-go e.g. QKT is calculating the relevance scores
  • Self attention adds one extra step:
    • Divide by sqrt(dk) where dk is size of the embeddings
    • Helps with back-propagation of the network

Learning Embeddings and Weight Matrices

  • Attention relies on multiple matrices WQ, WK and W_V
  • Each token also has a word vector to use as input
  • Where do these come from?
    • They are all learned during the extensive training process

Self-Attention Terminology

  • Self-attention is named because it is paying attention to the same text as the text it is working on.
  • The queries, keys and values are all from the same text.
  • Attention can also be applied between texts (e.g. an English text and a Spanish text).

Self-Attention Summary

  • Self-attention allows a language model to weight which other tokens are important when interpreting a token
  • It uses an equation that transforms the input vectors into queries, keys and values
  • The queries and keys are used to calculate the relevance scores of other tokens to the token of interest
  • The softmaxed relevance scores are used as weights to combine the value vectors
  • All of the weight matrices and word embedding vectors are learnt during training

Subword Tokenization: Addressing New Words

  • The problem with new words:
    • Actually new words.
    • Misspellings.
    • Words that weren’t in the training set.
  • Language models have a hard time with new words
    • They know nothing about them
    • Have to treat them as OOV - out of vocabulary
    • No learned embeddings and probabilities of zero of them occurring
    • We learnt about smoothing before “I think I’m going to take a staycation next month”

Subwords to Handle New Words

  • Core idea: Split uncommon words into 2 or more parts (potentially syllables).
  • Will depend on the language and type of text (e.g. tweets versus science).
  • Why does this help?
    • Much more likely to have seen sub