Word Embeddings - Comprehensive Notes
Recap
- Raw count vectors
- PMI - PPMI
- Weighting / Laplace smoothing PMI
- Syntax-based co-occurrences
- tf-idf
- Cosine similarity
From Sparse to Dense Vectors
- Co-occurrence matrices in reality have a large number of words.
- For each word, tf-idf and PPMI vectors are:
- Long (length ∣V∣ = 20,000 to 50,000)
- Sparse (most elements are equal to zero)
- Techniques exist to learn lower-dimensional vectors for words:
- Short (length = 50 to 1000, usually around 300)
- Dense (most elements are non-zero)
- These dense vectors in a latent space are called embeddings.
Learning Embeddings (Dense Vectors)
- Two main types of models:
- Count-based models
- Distributed semantics models
- Predictive models
Count-Based Models
- Compute statistics of how often a word co-occurs with its neighbor words in a large text corpus.
- Then, map these count-statistics down to a small, dense vector for each word.
- Count-based models learn vectors by doing dimensionality reduction on a term-context matrix.
- The term-context matrix contains information on how frequently each “word” (rows) is seen in some “context” (columns).
- They factorize this matrix to yield a lower-dimensional matrix of words and features, where each row yields a dense vector representation for each word.
Specific Count-Based Models
- Singular Value Decomposition (SVD) 🡪 Linear algebra
- Latent Semantic Analysis (LSA)
- GloVe (Pennington, Socher, Manning, 2014)
Predictive Models
- Directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).
- Neural-network-inspired models:
- word2vec (Mikolov et al., 2013)
- FastText (Bojanowski et al., 2016)
Analogy Using Cosine Distance
- Espresso and cappuccino are close in the vector space according to cosine distance.
Singular Value Decomposition (SVD)
- Any rectangular w×c matrix X can be expressed as the product of 3 matrices:
- X=USVT
- U: a w×m matrix where the w rows correspond to rows of the original matrix X, but the m columns represent a dimension (feature) in a new latent space.
- S: diagonal m×m matrix of singular values expressing the importance of each dimension (feature).
- VT: transposed m×c matrix where the c columns correspond to the columns of the original matrix X, but the m rows correspond to singular values.
- Classic linear algebra result.
- Reference: Golub, G. H., & Reinsch, C. (1971). Singular value decomposition and least squares solutions. In Linear Algebra (pp. 134-151). Springer, Berlin, Heidelberg.
Term-Context Matrix Example
- A term-context matrix X example showing word co-occurences.
SVD Applied to Term-Context Matrix
- Formula: X=UΣVT
SVD and Low-Rank Approximation
- If we keep the top-k singular values, we obtain a low-rank approximation of the original matrix X.
SVD for Word Embeddings
- We use the matrix U.
- Each row of U is a k-dimensional vector representing a word in the vocabulary.
- k=300 is commonly used.
Word Embeddings
- Each word in the vocabulary is represented by a low-dimensional vector (usually 300 dimensions).
- All words are embedded into the same space.
- Similar words have similar vectors, meaning their vectors are close to each other in the vector space.
Uses of Word Embeddings
- Word embeddings are successfully used for various Natural Language Processing applications (usually simply for initialization):
- Semantic similarity
- Word Sense Disambiguation
- Semantic Role Labeling
- Named Entity Recognition
- Summarization
- Question Answering
- Textual Entailment
- Coreference Resolution
- Sentiment analysis
- etc.
Word2Vec
- Models for efficiently creating word embeddings.
- Popular embedding method.
- Code available on the web.
- Assumption: similar words appear with similar contexts.
- Intuition: two words that share similar contexts are associated with vectors that are close to each other in the vector space.
- Key idea: predict rather than count!
Word2Vec Approach
- Instead of counting how often each word w occurs near a context word (e.g., “apricot”), train a classifier on a binary prediction task:
- Is w likely to show up near "apricot"?
- The learned classifier weights are taken as the word embeddings.
Implicitly Supervised Training Data in Word2Vec
- A word s near apricot acts as gold ‘correct answer’ to the question: “Is word w likely to show up near apricot?”
- No need for hand-labeled supervision!
- The idea comes from neural language modeling
- Bengio et al. (2003)
- Collobert et al. (2011)
- The input: one-hot vectors
- bananas: (1,0,0,0)
- monkey: (0,1,0,0)
- likes: (0,0,1,0)
- every: (0,0,0,1)
- Vocabulary size ∣V∣=4
Word2Vec Flavors
- CBOW (Continuous bag-of-words)
- Goal: Predict the middle word given the words of the context
- Skip-gram
- Goal: Predict the context words given the middle word
Word2Vec: CBOW – High Level
- Goal: Predict the middle word given the words of the context
The resulting projection matrix P is the embedding matrix!
Word2Vec: Skip-gram – High Level
- Goal: Predict the context words given the middle word
The resulting projection matrix P is the embedding matrix!
Word2Vec: The Model - Details CBOW
- The Continuous Bag-of-Words (CBOW) is a model for learning word vectors.
- It predicts the target word from source context words.
- Both the input vector x and the output y are one-hot encoded word representations.
- The hidden layer (Matrix W) is the word embedding of size N.
Word2Vec Architectures and Training Methods
- Architectures:
- Continuous bag-of-words (CBOW)
- Skip-gram
- Training methods:
Softmax and Negative Sampling
- Softmax:
- A function used, in the context of word2vec and word embedding, to predict the context words (or target words) for a given input word.
- Negative sampling:
- A technique introduced to address the computational inefficiency of softmax in training word embeddings.
- Instead of predicting the entire vocabulary, select a small number of negative samples (typically a few dozen) and the true context words.
- The negative examples are words that do not appear in the context of the target word.
- The model is trained to assign higher probabilities to the true context words and lower probabilities to the negative samples.
Word2Vec: Skip-gram – Example
- Training sentence: … lemon , a tablespoon of apricot jam a pinch …
- Target word: apricot
- Context window: 2 words (tablespoon, of, jam, a)
- For each positive example, we'll create k negative examples.
- The skip-gram model is trained to predict the probabilities of a word being a context word for the given target.
Word2Vec: Skip-gram – Training Objective
- An initial set of embeddings P for target words, and M for context words.
- Motivation: Over the entire training set, we’d like to adjust those word vectors such that we:
- Maximize the similarity of the positive target word, context word pairs (t,c)
- Minimize the similarity of the negative (t,c) pairs
Word2Vec – Summary: How to Learn Word2Vec Embeddings
- Choose the embedding dimension, e.g., d=300
- Start with V random 300-dimensional vectors as initial embeddings
- Take a corpus and take pairs of words that co-occur as positive examples
- Construct negative examples
- Train a logistic regression classifier to distinguish positive from negative examples
- Throw away the classifier and keep the embeddings!
Usefulness of Word Embeddings
- Can be used as features in classifiers
- Capture generalizations across word types
- Can be used to analyze language usage patterns in large corpora
- e.g., to study change in word meaning
Tracking Changes in Meaning
- In the early 20th century broadcast referred to “casting out seeds”; with the rise of television and radio its meaning shifted to “transmitting signals”.
Embedding Learning Algorithms
- Word2Vec [1]
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- GloVe [2] - Global Vectors for Word Representation
- exploit global statistical information
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
- fastText [3]
- exploit character level information, useful for Out Of Vocabulary (OOV) words
- [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
- [3] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.
GloVe - Global Vectors for Word Representation
- Goal:
- Takes advantage of global count statistics instead of only local information.
- Learning embeddings based on a co-occurrence matrix and trains word vectors so their differences predict co-occurrence ratios.
GloVe Advantage
- The model leverages statistical information by training only on the non-zero elements in a word-word co-occurrence matrix, rather than:
- on the entire sparse matrix (e.g., SVD)
- on individual context windows in a large corpus (e.g. word2vec)
- Global corpus statistics are captured directly by the model
Glove - Example
- Can certain aspects of meaning be extracted directly from co-occurrence probabilities?
- Consider two words i and j that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase, for which we might take i=ice and j=steam.
- The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various “probe” words (i.e., context words), k.
Glove - Example
- For words k related to i=ice but not j=steam, say k=solid, the ratio should be large.
- Similarly, for words k related to j=steam but not i=ice, say k=gas, the ratio should be small.
- For words k like water or fashion, that are either related to both i=ice and j=steam, or to neither, the ratio should be close to “1”.
fastText: Motivation
- Limitation of Word2Vec:
- rare words
- Out Of Vocabulary (OOV) words
- Blends: Obamacare, mockumentary
- Noise due to spelling errors: signficant
- Solution: exploit character level information
fastText
- FastText is an extension of word2vec
- Each word is represented as itself plus a bag of constituent n-grams, with special boundary symbols “
- For example: with n-gram = 3 the word “where” would be represented by the character n-grams:
- Parameters:
- minimum ngram length: 3, maximum ngram length: 4
fastText: Main Characteristics
- Subword Embeddings
- It breaks words down into smaller character n-grams (subwords) and learns embeddings for these subwords.
- This allows FastText to capture morphological and syntactic information, making it effective for handling out-of-vocabulary words and languages with rich morphology.
Properties of Embeddings
- Similarity depends on how we defined the context
- Small context window size, ±2
- nearest words are syntactically similar words in same taxonomy:
- Hogwarts nearest neighbors are other fictional schools:
- Large context window size, ±5
- nearest words are related words in same semantic field:
- Hogwarts nearest neighbors are Harry Potter world:
Analogy: Embeddings Capture Relational Meaning
- Sometimes referred to as the classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson 1973)
- to solve: “apple is to tree as grape is to _ ”
- Given the analogy a : a* , b : b* where word b* is to be found
- We actually search for a word that is similar to b, and a*, but different from a
- man : woman , king : queen
Analogy Evaluation and Hyperparameters
- More data helps
- Wikipedia is better than news text!
- Dimensionality
Evaluation of Word Embeddings
- Intrinsic evaluation: Evaluate the representation directly without training another model
- Typically simple tasks where success or failure is (almost) entirely a function of the representation
- Easy to compute, but doesn’t say much about the embeddings as features
- Extrinsic evaluation: Evaluate the impact of the representation on another task
- Typically, a neural network
- Can be more practically useful, but depends on the quality of the model for the task being tested
Embeddings Reflect Cultural Bias
- Ask “Paris : France :: Tokyo : x”
- Ask “father : doctor :: mother : x”
- Ask “man : computer programmer :: woman : x”
Biases in Word Embeddings
- Implicit Association test (Greenwald et al 1998): How associated are
- concepts (flowers, insects) & attributes (pleasantness, unpleasantness)?
- Studied by measuring timing latencies for categorization.
- Psychological findings on US participants:
- African-American names are associated with unpleasant words (more than European-American names)
- Male names associated more with math, female names with arts
- Old people's names with unpleasant words, young people with pleasant words.
- Caliskan et al. replication with embeddings:
- African-American names (Leroy, Shaniqua) had a higher GloVe cosine with unpleasant words (abuse, stink, ugly)
- European American names (Brad, Greg, Courtney) had a higher cosine with pleasant words (love, peace, miracle)
- Embeddings reflect and replicate all sorts of pernicious biases!
How to make a racist AI without really trying
*Social Media, Customer Feedback
*Computing Sentiment Scores for sentences of text
Impact of Contexts
- Context window size: Should we use large or small context windows?
- Large context windows makes topically similar words closer (eg: sport, baseball, referee, etc are grouped)
- Smaller context windows focus on syntactic or functional similarities (eg: batting, running, jumping, etc are grouped)
- Positional contexts: Should the context features be different for words in different positions?
- Eg: For word 1, if the previous word is cat, and for word 2, cat appears two words after it, should both instances of cat be treated similarly?
- Or should they be treated differently by encoding the position in the context?
- Positional contexts seem to help if we care about grouping syntactic function or words with similar parts-of-speech
Syntactic Windows
- Idea: Instead of using proximal words in the sentence, use the dependency tree to decide on which words are proximal
Preprocessing Text for Word Embeddings
- Several choices available
- Should the words be lemmatized?
- good, better, best map to good
- give, gives, giving, gave, etc map to give
- Should words retain their capitalization?
- Eg: Should Apple and apple be treated as the same word?
- Should very rare or frequent words be filtered out?
- Should some sentences be filtered out?
- Eg: Long sentences, short sentences
- And many more. Can be treated as hyperparameters
Text Pre-processing
- Common pre-processing:
- Tokenization
- Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
- Stop word removal
- Lemmatization or Stemming
- In word embedding representation not all of these steps are always necessary.
Pre-processing for GloVe and Word2Vec
- Tokenization
- Necessary
- Both GloVe and Word2Vec work with individual words as tokens, so you must tokenize your text.
- Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
- Optional
- Depending on the task
- To be verified w.r.t. pre-trained versions of word2vec and GloVe
Pre-processing for GloVe and Word2Vec: Stop word removal
- The specific vocabulary of a model depends on the training data and the preprocessing choices made during model training.
- Stop words are not typically included in the vocabulary of GloVe and word2vec.
- If you train your own model, you have control over whether to include or exclude stop words from the vocabulary.
- Optional
- Removing stop words can reduce noise, but it is not always necessary, especially if the model include stop words in their vocabulary.
Pre-processing for GloVe and Word2Vec: Lemmatization and Stemming
- Pretrained GloVe and Word2Vec models typically do not stem or lemmatize words as part of their training process.
- These models are trained on large text corpora and generally use the original word forms from the text data.
- Stemming or lemmatizing words would change word forms and potentially disrupt the context in which they appear.
- GloVe and Word2Vec models rely on word co-occurrences, and altering the words could hinder their ability to capture meaningful relationships between words.
- Optional
- Stemming and lemmatization are preprocessing steps that are typically applied to text data before training models like GloVe and Word2Vec when creating custom embeddings.
Word Embedding Pre-processing
- Pay attention to the type of tokenization the model is based on.
- Understand what the characteristics of the task are with respect to which model is used.
- Check what the characteristics of the pre-trained model are compared to the use of the other preprocessing phases.
Problems/Open Research Questions
- Antonyms tend to be embedded together
- Unclear how similarity is defined
- cat closer to dog or tiger?
- Embeddings may exhibit gender, racial, ethnic and other social biases
- Eg: female names are embedded closer to stereotypically female social roles
- Obvious things are not talked about in text
- Eg: most sheep are white, but “black sheep” may be more frequent than “white sheep”