LU

Lecture Notes on Topic Modeling and Word Embeddings

Transforming Text into Numbers

  • The challenge is transforming text data into numerical representations (vectors or matrices) for analysis.
  • Initial approaches involve breaking text into words and aggregating words with similar meanings using stemming or lemmatization.
  • Limitation: These methods often disregard the real meaning of words and cannot handle words with multiple meanings or phrases where meaning arises from the combination of words.
  • Applying TF-IDF (Term Frequency-Inverse Document Frequency) helps understand the relationships between words in long texts.

Addressing Semantic Understanding

  • The goal is to overcome the limitations of bag-of-words approaches by incorporating semantic understanding.
  • Two techniques to achieve this: topic modeling and word embeddings.

Sparsity Problem and Dimensionality Reduction

  • Vectors created from text are often very sparse, meaning they contain many zeros.
  • Sparsity: Presence of many zeros in the vector representation.
  • The document matrix represents words (columns) in a corpus across documents (rows).
  • Sparsity leads to:
    • Wasted space and processing time.
    • Difficulty in calculating similarity between vectors.
    • Challenge: Dimensionality reduction is needed to compress the matrix and remove zeros.

Topic Modeling

  • Topic Modeling: An unsupervised technique for finding topics in a matrix.
  • It involves shuffling words and documents to reveal patterns where groups of words are prevalent in groups of documents.
  • These groups represent topics.
  • Example: A topic could be related to "air pollution" with words like "air," "pollution," "power," and "environmental."
  • This is not hard clustering like K-means; a word can belong to multiple clusters or topics.
  • Topic modeling is a dimensionality reduction technique.

Latent Dirichlet Allocation (LDA)

  • LDA is a popular probabilistic topic model.
  • The core idea involves decomposing the document-word matrix into two matrices:
    • Document-topic representation.
    • Term-topic representation.
  • Documents are classified based on topics, and words are associated with topics.
  • This results in matrices with fewer zeros and lower dimensions.

Practical Application of Topic Modeling

  • Example: Applying topic modeling to abstracts from scientific papers.
  • Identified topics may include "genetics" (gene, DNA, genetic), "life," "brain," and "data."
  • Topic modeling helps summarize texts and understand the main themes.
  • A profile or signature can be created for each document based on the prevalence of different topics.
  • Topic modeling can track interest in topics over time.

Advantages of Topic Modeling

  • Language agnostic: No need to understand the language.
  • Unsupervised: Requires specifying the number of topics.
  • Quick and common way to summarize large amounts of text.

Drawbacks of Topic Modeling

  • Not a well-defined generative model; it struggles with generalizing to new documents.
  • Every time a new document is added, the model needs to be rerun.
  • Computationally intensive for large datasets.
  • Prone to overfitting (depending on the number of topics).
  • Doesn't work well for short texts.

Word Embeddings

  • Word embeddings aim to transform words into dense vectors that capture semantic relationships.
  • Dense vectors are preferred over sparse vectors to save space and resources.
  • Vectors can represent word morphology, context, global corpus statistics, and relationships between words.

Creating Word Embeddings

  • Start with a corpus and transform each word into a one-hot vector.
  • A one-hot vector is a vector with a single one and all other values as zeros.
  • Use a neural network to predict the next word in a sequence.
  • The hidden layer of the neural network represents words in a lower-dimensional space.
  • The neural network is trained to adjust weights to accurately predict the next word.
  • Once the training phase finishes, the weights in the hidden layer represent the word embeddings.

Applications of Word Embeddings

  • Predicting the next word (similar to predictive text on cell phones).
  • Capturing relationships between words.
  • Performing algebraic operations on word vectors (e.g., King - Queen = Man).
  • Representing verb tenses (e.g., walking - walked).
  • Finding words with similar meanings close to each other in a high-dimensional space.
  • Facilitating machine translation by aligning word embeddings across languages.

Example: King, Queen, Man Relationship

  • Word embeddings can capture gender relationships.
  • Subtracting the vector for “queen” from the vector for “king” results in a vector similar to the vector for “man”.
  • This demonstrates that word embeddings can capture semantic meanings based on relationships.

Handling Words with Multiple Meanings

  • Represent different senses of a word with different one-hot vectors.
  • For example, “queen” (the person), “queen” (the song), and “queen” (chess piece) would have distinct vectors.

Tracking Word Usage

  • Word embeddings can track how the usage of a word evolves over time.