Word Embeddings - Comprehensive Notes

Recap

Raw count vectors
PMI - PPMI
Weighting / Laplace smoothing PMI
Syntax-based co-occurrences
tf-idf
Cosine similarity

From Sparse to Dense Vectors

Co-occurrence matrices in reality have a large number of words.
For each word, tf-idf and PPMI vectors are:
- Long (length $|V|$ = 20,000 to 50,000)
- Sparse (most elements are equal to zero)
Techniques exist to learn lower-dimensional vectors for words:
- Short (length = 50 to 1000, usually around 300)
- Dense (most elements are non-zero)
- These dense vectors in a latent space are called embeddings.

Learning Embeddings (Dense Vectors)

Two main types of models:
- Count-based models
  - Distributed semantics models
- Predictive models
  - Neural network models

Count-Based Models

Compute statistics of how often a word co-occurs with its neighbor words in a large text corpus.
Then, map these count-statistics down to a small, dense vector for each word.
Count-based models learn vectors by doing dimensionality reduction on a term-context matrix.
- The term-context matrix contains information on how frequently each “word” (rows) is seen in some “context” (columns).
They factorize this matrix to yield a lower-dimensional matrix of words and features, where each row yields a dense vector representation for each word.

Specific Count-Based Models

Singular Value Decomposition (SVD) 🡪 Linear algebra
Latent Semantic Analysis (LSA)
GloVe (Pennington, Socher, Manning, 2014)
- General idea 🡪

Predictive Models

Directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).
Neural-network-inspired models:
- word2vec (Mikolov et al., 2013)
- FastText (Bojanowski et al., 2016)

Analogy Using Cosine Distance

Espresso and cappuccino are close in the vector space according to cosine distance.

Singular Value Decomposition (SVD)

Any rectangular $w \times c$ matrix $X$ can be expressed as the product of 3 matrices:
- $X = USV^T$
- $U$ : a $w \times m$ matrix where the $w$ rows correspond to rows of the original matrix $X$ , but the $m$ columns represent a dimension (feature) in a new latent space.
- $S$ : diagonal $m \times m$ matrix of singular values expressing the importance of each dimension (feature).
- $V^T$ : transposed $m \times c$ matrix where the $c$ columns correspond to the columns of the original matrix $X$ , but the $m$ rows correspond to singular values.
Classic linear algebra result.
Reference: Golub, G. H., & Reinsch, C. (1971). Singular value decomposition and least squares solutions. In Linear Algebra (pp. 134-151). Springer, Berlin, Heidelberg.

Term-Context Matrix Example

A term-context matrix $X$ example showing word co-occurences.

SVD Applied to Term-Context Matrix

Formula: $X = U \Sigma V^T$

SVD and Low-Rank Approximation

If we keep the top-k singular values, we obtain a low-rank approximation of the original matrix $X$ .

SVD for Word Embeddings

We use the matrix $U$ .
Each row of $U$ is a k-dimensional vector representing a word in the vocabulary.
- $k = 300$ is commonly used.

Word Embeddings

Each word in the vocabulary is represented by a low-dimensional vector (usually 300 dimensions).
All words are embedded into the same space.
Similar words have similar vectors, meaning their vectors are close to each other in the vector space.

Uses of Word Embeddings

Word embeddings are successfully used for various Natural Language Processing applications (usually simply for initialization):
- Semantic similarity
- Word Sense Disambiguation
- Semantic Role Labeling
- Named Entity Recognition
- Summarization
- Question Answering
- Textual Entailment
- Coreference Resolution
- Sentiment analysis
- etc.

Word2Vec

Models for efficiently creating word embeddings.
Popular embedding method.
Code available on the web.
Assumption: similar words appear with similar contexts.
Intuition: two words that share similar contexts are associated with vectors that are close to each other in the vector space.
Key idea: predict rather than count!

Word2Vec Approach

Instead of counting how often each word $w$ occurs near a context word (e.g., “apricot”), train a classifier on a binary prediction task:
- Is $w$ likely to show up near "apricot"?
- The learned classifier weights are taken as the word embeddings.

Implicitly Supervised Training Data in Word2Vec

A word $s$ near apricot acts as gold ‘correct answer’ to the question: “Is word $w$ likely to show up near apricot?”
No need for hand-labeled supervision!
The idea comes from neural language modeling
- Bengio et al. (2003)
- Collobert et al. (2011)

Word2Vec Input

The input: one-hot vectors
- bananas: (1,0,0,0)
- monkey: (0,1,0,0)
- likes: (0,0,1,0)
- every: (0,0,0,1)
Vocabulary size $|V| = 4$

Word2Vec Flavors

CBOW (Continuous bag-of-words)
- Goal: Predict the middle word given the words of the context
Skip-gram
- Goal: Predict the context words given the middle word

Word2Vec: CBOW – High Level

Goal: Predict the middle word given the words of the context
The resulting projection matrix P is the embedding matrix!

Word2Vec: Skip-gram – High Level

Goal: Predict the context words given the middle word
The resulting projection matrix P is the embedding matrix!

Word2Vec: The Model - Details CBOW

The Continuous Bag-of-Words (CBOW) is a model for learning word vectors.
It predicts the target word from source context words.
Both the input vector x and the output y are one-hot encoded word representations.
The hidden layer (Matrix W) is the word embedding of size N.

Word2Vec Architectures and Training Methods

Architectures:
- Continuous bag-of-words (CBOW)
- Skip-gram
Training methods:
- Softmax
- Negative sampling

Softmax and Negative Sampling

Softmax:
- A function used, in the context of word2vec and word embedding, to predict the context words (or target words) for a given input word.
Negative sampling:
- A technique introduced to address the computational inefficiency of softmax in training word embeddings.
- Instead of predicting the entire vocabulary, select a small number of negative samples (typically a few dozen) and the true context words.
- The negative examples are words that do not appear in the context of the target word.
- The model is trained to assign higher probabilities to the true context words and lower probabilities to the negative samples.

Word2Vec: Skip-gram – Example

Training sentence: … lemon , a tablespoon of apricot jam a pinch …
- Target word: apricot
- Context window: 2 words (tablespoon, of, jam, a)
For each positive example, we'll create k negative examples.
The skip-gram model is trained to predict the probabilities of a word being a context word for the given target.

Word2Vec: Skip-gram – Training Objective

An initial set of embeddings P for target words, and M for context words.
Motivation: Over the entire training set, we’d like to adjust those word vectors such that we:
- Maximize the similarity of the positive target word, context word pairs (t,c)
- Minimize the similarity of the negative (t,c) pairs

Word2Vec – Summary: How to Learn Word2Vec Embeddings

Choose the embedding dimension, e.g., $d=300$
Start with V random 300-dimensional vectors as initial embeddings
Take a corpus and take pairs of words that co-occur as positive examples
Construct negative examples
Train a logistic regression classifier to distinguish positive from negative examples
Throw away the classifier and keep the embeddings!

Usefulness of Word Embeddings

Can be used as features in classifiers
Capture generalizations across word types
Can be used to analyze language usage patterns in large corpora
- e.g., to study change in word meaning

Tracking Changes in Meaning

In the early 20th century broadcast referred to “casting out seeds”; with the rise of television and radio its meaning shifted to “transmitting signals”.

Embedding Learning Algorithms

Word2Vec [1]
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
GloVe [2] - Global Vectors for Word Representation
- exploit global statistical information
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
fastText [3]
- exploit character level information, useful for Out Of Vocabulary (OOV) words
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
- [3] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

GloVe - Global Vectors for Word Representation

Goal:
- Takes advantage of global count statistics instead of only local information.
- Learning embeddings based on a co-occurrence matrix and trains word vectors so their differences predict co-occurrence ratios.

GloVe Advantage

The model leverages statistical information by training only on the non-zero elements in a word-word co-occurrence matrix, rather than:
- on the entire sparse matrix (e.g., SVD)
- on individual context windows in a large corpus (e.g. word2vec)
Global corpus statistics are captured directly by the model

Glove - Example

Can certain aspects of meaning be extracted directly from co-occurrence probabilities?
Consider two words $i$ and $j$ that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase, for which we might take $i = ice$ and $j = steam$ .
The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various “probe” words (i.e., context words), $k$ .

Glove - Example

For words $k$ related to $i = ice$ but not $j = steam$ , say $k = solid$ , the ratio should be large.
Similarly, for words $k$ related to $j = steam$ but not $i = ice$ , say $k = gas$ , the ratio should be small.
For words $k$ like water or fashion, that are either related to both $i = ice$ and $j = steam$ , or to neither, the ratio should be close to “1”.

fastText: Motivation

Limitation of Word2Vec:
- rare words
- Out Of Vocabulary (OOV) words
- Blends: Obamacare, mockumentary
- Noise due to spelling errors: signficant
Solution: exploit character level information

fastText

FastText is an extension of word2vec
Each word is represented as itself plus a bag of constituent n-grams, with special boundary symbols “
For example: with n-gram = 3 the word “where” would be represented by the character n-grams:
Parameters:
- minimum ngram length: 3, maximum ngram length: 4

fastText: Main Characteristics

Subword Embeddings
It breaks words down into smaller character n-grams (subwords) and learns embeddings for these subwords.
This allows FastText to capture morphological and syntactic information, making it effective for handling out-of-vocabulary words and languages with rich morphology.

Properties of Embeddings

Similarity depends on how we defined the context
Small context window size, ±2
- nearest words are syntactically similar words in same taxonomy:
- Hogwarts nearest neighbors are other fictional schools:
  - Sunnydale
  - Evernight
Large context window size, ±5
- nearest words are related words in same semantic field:
- Hogwarts nearest neighbors are Harry Potter world:
  - Dumbledore
  - Malfoy

Analogy: Embeddings Capture Relational Meaning

Sometimes referred to as the classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson 1973)
- to solve: “apple is to tree as grape is to _ ”
Given the analogy a : a* , b : b* where word b* is to be found
We actually search for a word that is similar to b, and a*, but different from a
- man : woman , king : queen

Analogy Evaluation and Hyperparameters

More data helps
- Wikipedia is better than news text!
Dimensionality
- Good dimension is ~300

Evaluation of Word Embeddings

Intrinsic evaluation: Evaluate the representation directly without training another model
- Typically simple tasks where success or failure is (almost) entirely a function of the representation
- Easy to compute, but doesn’t say much about the embeddings as features
Extrinsic evaluation: Evaluate the impact of the representation on another task
- Typically, a neural network
- Can be more practically useful, but depends on the quality of the model for the task being tested

Embeddings Reflect Cultural Bias

Ask “Paris : France :: Tokyo : x”
- x = Japan
Ask “father : doctor :: mother : x”
- x = nurse
Ask “man : computer programmer :: woman : x”
- x = homemaker

Biases in Word Embeddings

Implicit Association test (Greenwald et al 1998): How associated are
- concepts (flowers, insects) & attributes (pleasantness, unpleasantness)?
- Studied by measuring timing latencies for categorization.
Psychological findings on US participants:
- African-American names are associated with unpleasant words (more than European-American names)
- Male names associated more with math, female names with arts
- Old people's names with unpleasant words, young people with pleasant words.
Caliskan et al. replication with embeddings:
- African-American names (Leroy, Shaniqua) had a higher GloVe cosine with unpleasant words (abuse, stink, ugly)
- European American names (Brad, Greg, Courtney) had a higher cosine with pleasant words (love, peace, miracle)
Embeddings reflect and replicate all sorts of pernicious biases!
How to make a racist AI without really trying
*Social Media, Customer Feedback
*Computing Sentiment Scores for sentences of text

Impact of Contexts

Context window size: Should we use large or small context windows?
- Large context windows makes topically similar words closer (eg: sport, baseball, referee, etc are grouped)
- Smaller context windows focus on syntactic or functional similarities (eg: batting, running, jumping, etc are grouped)
Positional contexts: Should the context features be different for words in different positions?
- Eg: For word 1, if the previous word is cat, and for word 2, cat appears two words after it, should both instances of cat be treated similarly?
  - Or should they be treated differently by encoding the position in the context?
- Positional contexts seem to help if we care about grouping syntactic function or words with similar parts-of-speech

Syntactic Windows

Idea: Instead of using proximal words in the sentence, use the dependency tree to decide on which words are proximal

Preprocessing Text for Word Embeddings

Several choices available
- Should the words be lemmatized?
  - good, better, best map to good
  - give, gives, giving, gave, etc map to give
- Should words retain their capitalization?
  - Eg: Should Apple and apple be treated as the same word?
- Should very rare or frequent words be filtered out?
  - Eg: of vs. octothorpe
- Should some sentences be filtered out?
  - Eg: Long sentences, short sentences
And many more. Can be treated as hyperparameters

Text Pre-processing

Common pre-processing:
- Tokenization
- Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
- Stop word removal
- Lemmatization or Stemming
In word embedding representation not all of these steps are always necessary.

Pre-processing for GloVe and Word2Vec

Tokenization
- Necessary
- Both GloVe and Word2Vec work with individual words as tokens, so you must tokenize your text.
Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
- Optional
- Depending on the task
- To be verified w.r.t. pre-trained versions of word2vec and GloVe

Pre-processing for GloVe and Word2Vec: Stop word removal

The specific vocabulary of a model depends on the training data and the preprocessing choices made during model training.
Stop words are not typically included in the vocabulary of GloVe and word2vec.
If you train your own model, you have control over whether to include or exclude stop words from the vocabulary.
Optional
Removing stop words can reduce noise, but it is not always necessary, especially if the model include stop words in their vocabulary.

Pre-processing for GloVe and Word2Vec: Lemmatization and Stemming

Pretrained GloVe and Word2Vec models typically do not stem or lemmatize words as part of their training process.
These models are trained on large text corpora and generally use the original word forms from the text data.
Stemming or lemmatizing words would change word forms and potentially disrupt the context in which they appear.
GloVe and Word2Vec models rely on word co-occurrences, and altering the words could hinder their ability to capture meaningful relationships between words.
Optional
Stemming and lemmatization are preprocessing steps that are typically applied to text data before training models like GloVe and Word2Vec when creating custom embeddings.

Word Embedding Pre-processing

Pay attention to the type of tokenization the model is based on.
Understand what the characteristics of the task are with respect to which model is used.
Check what the characteristics of the pre-trained model are compared to the use of the other preprocessing phases.

Problems/Open Research Questions

Antonyms tend to be embedded together
Unclear how similarity is defined
- cat closer to dog or tiger?
Embeddings may exhibit gender, racial, ethnic and other social biases
- Eg: female names are embedded closer to stereotypically female social roles
Obvious things are not talked about in text
- Eg: most sheep are white, but “black sheep” may be more frequent than “white sheep”