Word Embeddings - Comprehensive Notes

Recap

  • Raw count vectors
  • PMI - PPMI
  • Weighting / Laplace smoothing PMI
  • Syntax-based co-occurrences
  • tf-idf
  • Cosine similarity

From Sparse to Dense Vectors

  • Co-occurrence matrices in reality have a large number of words.
  • For each word, tf-idf and PPMI vectors are:
    • Long (length V|V| = 20,000 to 50,000)
    • Sparse (most elements are equal to zero)
  • Techniques exist to learn lower-dimensional vectors for words:
    • Short (length = 50 to 1000, usually around 300)
    • Dense (most elements are non-zero)
    • These dense vectors in a latent space are called embeddings.

Learning Embeddings (Dense Vectors)

  • Two main types of models:
    • Count-based models
      • Distributed semantics models
    • Predictive models
      • Neural network models

Count-Based Models

  • Compute statistics of how often a word co-occurs with its neighbor words in a large text corpus.
  • Then, map these count-statistics down to a small, dense vector for each word.
  • Count-based models learn vectors by doing dimensionality reduction on a term-context matrix.
    • The term-context matrix contains information on how frequently each “word” (rows) is seen in some “context” (columns).
  • They factorize this matrix to yield a lower-dimensional matrix of words and features, where each row yields a dense vector representation for each word.

Specific Count-Based Models

  • Singular Value Decomposition (SVD) 🡪 Linear algebra
  • Latent Semantic Analysis (LSA)
  • GloVe (Pennington, Socher, Manning, 2014)
    • General idea 🡪

Predictive Models

  • Directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).
  • Neural-network-inspired models:
    • word2vec (Mikolov et al., 2013)
    • FastText (Bojanowski et al., 2016)

Analogy Using Cosine Distance

  • Espresso and cappuccino are close in the vector space according to cosine distance.

Singular Value Decomposition (SVD)

  • Any rectangular w×cw \times c matrix XX can be expressed as the product of 3 matrices:
    • X=USVTX = USV^T
    • UU: a w×mw \times m matrix where the ww rows correspond to rows of the original matrix XX, but the mm columns represent a dimension (feature) in a new latent space.
    • SS: diagonal m×mm \times m matrix of singular values expressing the importance of each dimension (feature).
    • VTV^T: transposed m×cm \times c matrix where the cc columns correspond to the columns of the original matrix XX, but the mm rows correspond to singular values.
  • Classic linear algebra result.
  • Reference: Golub, G. H., & Reinsch, C. (1971). Singular value decomposition and least squares solutions. In Linear Algebra (pp. 134-151). Springer, Berlin, Heidelberg.

Term-Context Matrix Example

  • A term-context matrix XX example showing word co-occurences.

SVD Applied to Term-Context Matrix

  • Formula: X=UΣVTX = U \Sigma V^T

SVD and Low-Rank Approximation

  • If we keep the top-k singular values, we obtain a low-rank approximation of the original matrix XX.

SVD for Word Embeddings

  • We use the matrix UU.
  • Each row of UU is a k-dimensional vector representing a word in the vocabulary.
    • k=300k = 300 is commonly used.

Word Embeddings

  • Each word in the vocabulary is represented by a low-dimensional vector (usually 300 dimensions).
  • All words are embedded into the same space.
  • Similar words have similar vectors, meaning their vectors are close to each other in the vector space.

Uses of Word Embeddings

  • Word embeddings are successfully used for various Natural Language Processing applications (usually simply for initialization):
    • Semantic similarity
    • Word Sense Disambiguation
    • Semantic Role Labeling
    • Named Entity Recognition
    • Summarization
    • Question Answering
    • Textual Entailment
    • Coreference Resolution
    • Sentiment analysis
    • etc.

Word2Vec

  • Models for efficiently creating word embeddings.
  • Popular embedding method.
  • Code available on the web.
  • Assumption: similar words appear with similar contexts.
  • Intuition: two words that share similar contexts are associated with vectors that are close to each other in the vector space.
  • Key idea: predict rather than count!

Word2Vec Approach

  • Instead of counting how often each word ww occurs near a context word (e.g., “apricot”), train a classifier on a binary prediction task:
    • Is ww likely to show up near "apricot"?
    • The learned classifier weights are taken as the word embeddings.

Implicitly Supervised Training Data in Word2Vec

  • A word ss near apricot acts as gold ‘correct answer’ to the question: “Is word ww likely to show up near apricot?”
  • No need for hand-labeled supervision!
  • The idea comes from neural language modeling
    • Bengio et al. (2003)
    • Collobert et al. (2011)

Word2Vec Input

  • The input: one-hot vectors
    • bananas: (1,0,0,0)
    • monkey: (0,1,0,0)
    • likes: (0,0,1,0)
    • every: (0,0,0,1)
  • Vocabulary size V=4|V| = 4

Word2Vec Flavors

  1. CBOW (Continuous bag-of-words)
    • Goal: Predict the middle word given the words of the context
  2. Skip-gram
    • Goal: Predict the context words given the middle word

Word2Vec: CBOW – High Level

  • Goal: Predict the middle word given the words of the context
    The resulting projection matrix P is the embedding matrix!

Word2Vec: Skip-gram – High Level

  • Goal: Predict the context words given the middle word
    The resulting projection matrix P is the embedding matrix!

Word2Vec: The Model - Details CBOW

  • The Continuous Bag-of-Words (CBOW) is a model for learning word vectors.
  • It predicts the target word from source context words.
  • Both the input vector x and the output y are one-hot encoded word representations.
  • The hidden layer (Matrix W) is the word embedding of size N.

Word2Vec Architectures and Training Methods

  • Architectures:
    • Continuous bag-of-words (CBOW)
    • Skip-gram
  • Training methods:
    • Softmax
    • Negative sampling

Softmax and Negative Sampling

  • Softmax:
    • A function used, in the context of word2vec and word embedding, to predict the context words (or target words) for a given input word.
  • Negative sampling:
    • A technique introduced to address the computational inefficiency of softmax in training word embeddings.
    • Instead of predicting the entire vocabulary, select a small number of negative samples (typically a few dozen) and the true context words.
    • The negative examples are words that do not appear in the context of the target word.
    • The model is trained to assign higher probabilities to the true context words and lower probabilities to the negative samples.

Word2Vec: Skip-gram – Example

  • Training sentence: … lemon , a tablespoon of apricot jam a pinch …
    • Target word: apricot
    • Context window: 2 words (tablespoon, of, jam, a)
  • For each positive example, we'll create k negative examples.
  • The skip-gram model is trained to predict the probabilities of a word being a context word for the given target.

Word2Vec: Skip-gram – Training Objective

  • An initial set of embeddings P for target words, and M for context words.
  • Motivation: Over the entire training set, we’d like to adjust those word vectors such that we:
    • Maximize the similarity of the positive target word, context word pairs (t,c)
    • Minimize the similarity of the negative (t,c) pairs

Word2Vec – Summary: How to Learn Word2Vec Embeddings

  1. Choose the embedding dimension, e.g., d=300d=300
  2. Start with V random 300-dimensional vectors as initial embeddings
  3. Take a corpus and take pairs of words that co-occur as positive examples
  4. Construct negative examples
  5. Train a logistic regression classifier to distinguish positive from negative examples
  6. Throw away the classifier and keep the embeddings!

Usefulness of Word Embeddings

  • Can be used as features in classifiers
  • Capture generalizations across word types
  • Can be used to analyze language usage patterns in large corpora
    • e.g., to study change in word meaning

Tracking Changes in Meaning

  • In the early 20th century broadcast referred to “casting out seeds”; with the rise of television and radio its meaning shifted to “transmitting signals”.

Embedding Learning Algorithms

  • Word2Vec [1]
    • [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • GloVe [2] - Global Vectors for Word Representation
    • exploit global statistical information
    • [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    • [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
  • fastText [3]
    • exploit character level information, useful for Out Of Vocabulary (OOV) words
    • [1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    • [2] Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
    • [3] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135-146.

GloVe - Global Vectors for Word Representation

  • Goal:
    • Takes advantage of global count statistics instead of only local information.
    • Learning embeddings based on a co-occurrence matrix and trains word vectors so their differences predict co-occurrence ratios.

GloVe Advantage

  • The model leverages statistical information by training only on the non-zero elements in a word-word co-occurrence matrix, rather than:
    • on the entire sparse matrix (e.g., SVD)
    • on individual context windows in a large corpus (e.g. word2vec)
  • Global corpus statistics are captured directly by the model

Glove - Example

  • Can certain aspects of meaning be extracted directly from co-occurrence probabilities?
  • Consider two words ii and jj that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase, for which we might take i=icei = ice and j=steamj = steam.
  • The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various “probe” words (i.e., context words), kk.

Glove - Example

  • For words kk related to i=icei = ice but not j=steamj = steam, say k=solidk = solid, the ratio should be large.
  • Similarly, for words kk related to j=steamj = steam but not i=icei = ice, say k=gask = gas, the ratio should be small.
  • For words kk like water or fashion, that are either related to both i=icei = ice and j=steamj = steam, or to neither, the ratio should be close to “1”.

fastText: Motivation

  • Limitation of Word2Vec:
    • rare words
    • Out Of Vocabulary (OOV) words
    • Blends: Obamacare, mockumentary
    • Noise due to spelling errors: signficant
  • Solution: exploit character level information

fastText

  • FastText is an extension of word2vec
  • Each word is represented as itself plus a bag of constituent n-grams, with special boundary symbols “
  • For example: with n-gram = 3 the word “where” would be represented by the character n-grams:
  • Parameters:
    • minimum ngram length: 3, maximum ngram length: 4

fastText: Main Characteristics

  • Subword Embeddings
  • It breaks words down into smaller character n-grams (subwords) and learns embeddings for these subwords.
  • This allows FastText to capture morphological and syntactic information, making it effective for handling out-of-vocabulary words and languages with rich morphology.

Properties of Embeddings

  • Similarity depends on how we defined the context
  • Small context window size, ±2
    • nearest words are syntactically similar words in same taxonomy:
    • Hogwarts nearest neighbors are other fictional schools:
      • Sunnydale
      • Evernight
  • Large context window size, ±5
    • nearest words are related words in same semantic field:
    • Hogwarts nearest neighbors are Harry Potter world:
      • Dumbledore
      • Malfoy

Analogy: Embeddings Capture Relational Meaning

  • Sometimes referred to as the classic parallelogram model of analogical reasoning (Rumelhart and Abrahamson 1973)
    • to solve: “apple is to tree as grape is to _
  • Given the analogy a : a* , b : b* where word b* is to be found
  • We actually search for a word that is similar to b, and a*, but different from a
    • man : woman , king : queen

Analogy Evaluation and Hyperparameters

  • More data helps
    • Wikipedia is better than news text!
  • Dimensionality
    • Good dimension is ~300

Evaluation of Word Embeddings

  1. Intrinsic evaluation: Evaluate the representation directly without training another model
    • Typically simple tasks where success or failure is (almost) entirely a function of the representation
    • Easy to compute, but doesn’t say much about the embeddings as features
  2. Extrinsic evaluation: Evaluate the impact of the representation on another task
    • Typically, a neural network
    • Can be more practically useful, but depends on the quality of the model for the task being tested

Embeddings Reflect Cultural Bias

  • Ask “Paris : France :: Tokyo : x”
    • x = Japan
  • Ask “father : doctor :: mother : x”
    • x = nurse
  • Ask “man : computer programmer :: woman : x”
    • x = homemaker

Biases in Word Embeddings

  • Implicit Association test (Greenwald et al 1998): How associated are
    • concepts (flowers, insects) & attributes (pleasantness, unpleasantness)?
    • Studied by measuring timing latencies for categorization.
  • Psychological findings on US participants:
    • African-American names are associated with unpleasant words (more than European-American names)
    • Male names associated more with math, female names with arts
    • Old people's names with unpleasant words, young people with pleasant words.
  • Caliskan et al. replication with embeddings:
    • African-American names (Leroy, Shaniqua) had a higher GloVe cosine with unpleasant words (abuse, stink, ugly)
    • European American names (Brad, Greg, Courtney) had a higher cosine with pleasant words (love, peace, miracle)
  • Embeddings reflect and replicate all sorts of pernicious biases!
    How to make a racist AI without really trying
    *Social Media, Customer Feedback
    *Computing Sentiment Scores for sentences of text

Impact of Contexts

  • Context window size: Should we use large or small context windows?
    • Large context windows makes topically similar words closer (eg: sport, baseball, referee, etc are grouped)
    • Smaller context windows focus on syntactic or functional similarities (eg: batting, running, jumping, etc are grouped)
  • Positional contexts: Should the context features be different for words in different positions?
    • Eg: For word 1, if the previous word is cat, and for word 2, cat appears two words after it, should both instances of cat be treated similarly?
      • Or should they be treated differently by encoding the position in the context?
    • Positional contexts seem to help if we care about grouping syntactic function or words with similar parts-of-speech

Syntactic Windows

  • Idea: Instead of using proximal words in the sentence, use the dependency tree to decide on which words are proximal

Preprocessing Text for Word Embeddings

  • Several choices available
    • Should the words be lemmatized?
      • good, better, best map to good
      • give, gives, giving, gave, etc map to give
    • Should words retain their capitalization?
      • Eg: Should Apple and apple be treated as the same word?
    • Should very rare or frequent words be filtered out?
      • Eg: of vs. octothorpe
    • Should some sentences be filtered out?
      • Eg: Long sentences, short sentences
  • And many more. Can be treated as hyperparameters

Text Pre-processing

  • Common pre-processing:
    • Tokenization
    • Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
    • Stop word removal
    • Lemmatization or Stemming
  • In word embedding representation not all of these steps are always necessary.

Pre-processing for GloVe and Word2Vec

  • Tokenization
    • Necessary
    • Both GloVe and Word2Vec work with individual words as tokens, so you must tokenize your text.
  • Normalization (Lowercasing, handling numerals, special characters, punctuation, etc.)
    • Optional
    • Depending on the task
    • To be verified w.r.t. pre-trained versions of word2vec and GloVe

Pre-processing for GloVe and Word2Vec: Stop word removal

  • The specific vocabulary of a model depends on the training data and the preprocessing choices made during model training.
  • Stop words are not typically included in the vocabulary of GloVe and word2vec.
  • If you train your own model, you have control over whether to include or exclude stop words from the vocabulary.
  • Optional
  • Removing stop words can reduce noise, but it is not always necessary, especially if the model include stop words in their vocabulary.

Pre-processing for GloVe and Word2Vec: Lemmatization and Stemming

  • Pretrained GloVe and Word2Vec models typically do not stem or lemmatize words as part of their training process.
  • These models are trained on large text corpora and generally use the original word forms from the text data.
  • Stemming or lemmatizing words would change word forms and potentially disrupt the context in which they appear.
  • GloVe and Word2Vec models rely on word co-occurrences, and altering the words could hinder their ability to capture meaningful relationships between words.
  • Optional
  • Stemming and lemmatization are preprocessing steps that are typically applied to text data before training models like GloVe and Word2Vec when creating custom embeddings.

Word Embedding Pre-processing

  • Pay attention to the type of tokenization the model is based on.
  • Understand what the characteristics of the task are with respect to which model is used.
  • Check what the characteristics of the pre-trained model are compared to the use of the other preprocessing phases.

Problems/Open Research Questions

  • Antonyms tend to be embedded together
  • Unclear how similarity is defined
    • cat closer to dog or tiger?
  • Embeddings may exhibit gender, racial, ethnic and other social biases
    • Eg: female names are embedded closer to stereotypically female social roles
  • Obvious things are not talked about in text
    • Eg: most sheep are white, but “black sheep” may be more frequent than “white sheep”