Lecture Notes on Topic Modeling and Word Embeddings
Transforming Text into Numbers
- The challenge is transforming text data into numerical representations (vectors or matrices) for analysis.
- Initial approaches involve breaking text into words and aggregating words with similar meanings using stemming or lemmatization.
- Limitation: These methods often disregard the real meaning of words and cannot handle words with multiple meanings or phrases where meaning arises from the combination of words.
- Applying TF-IDF (Term Frequency-Inverse Document Frequency) helps understand the relationships between words in long texts.
Addressing Semantic Understanding
- The goal is to overcome the limitations of bag-of-words approaches by incorporating semantic understanding.
- Two techniques to achieve this: topic modeling and word embeddings.
Sparsity Problem and Dimensionality Reduction
- Vectors created from text are often very sparse, meaning they contain many zeros.
- Sparsity: Presence of many zeros in the vector representation.
- The document matrix represents words (columns) in a corpus across documents (rows).
- Sparsity leads to:
- Wasted space and processing time.
- Difficulty in calculating similarity between vectors.
- Challenge: Dimensionality reduction is needed to compress the matrix and remove zeros.
Topic Modeling
- Topic Modeling: An unsupervised technique for finding topics in a matrix.
- It involves shuffling words and documents to reveal patterns where groups of words are prevalent in groups of documents.
- These groups represent topics.
- Example: A topic could be related to "air pollution" with words like "air," "pollution," "power," and "environmental."
- This is not hard clustering like K-means; a word can belong to multiple clusters or topics.
- Topic modeling is a dimensionality reduction technique.
Latent Dirichlet Allocation (LDA)
- LDA is a popular probabilistic topic model.
- The core idea involves decomposing the document-word matrix into two matrices:
- Document-topic representation.
- Term-topic representation.
- Documents are classified based on topics, and words are associated with topics.
- This results in matrices with fewer zeros and lower dimensions.
Practical Application of Topic Modeling
- Example: Applying topic modeling to abstracts from scientific papers.
- Identified topics may include "genetics" (gene, DNA, genetic), "life," "brain," and "data."
- Topic modeling helps summarize texts and understand the main themes.
- A profile or signature can be created for each document based on the prevalence of different topics.
- Topic modeling can track interest in topics over time.
Advantages of Topic Modeling
- Language agnostic: No need to understand the language.
- Unsupervised: Requires specifying the number of topics.
- Quick and common way to summarize large amounts of text.
Drawbacks of Topic Modeling
- Not a well-defined generative model; it struggles with generalizing to new documents.
- Every time a new document is added, the model needs to be rerun.
- Computationally intensive for large datasets.
- Prone to overfitting (depending on the number of topics).
- Doesn't work well for short texts.
Word Embeddings
- Word embeddings aim to transform words into dense vectors that capture semantic relationships.
- Dense vectors are preferred over sparse vectors to save space and resources.
- Vectors can represent word morphology, context, global corpus statistics, and relationships between words.
Creating Word Embeddings
- Start with a corpus and transform each word into a one-hot vector.
- A one-hot vector is a vector with a single one and all other values as zeros.
- Use a neural network to predict the next word in a sequence.
- The hidden layer of the neural network represents words in a lower-dimensional space.
- The neural network is trained to adjust weights to accurately predict the next word.
- Once the training phase finishes, the weights in the hidden layer represent the word embeddings.
Applications of Word Embeddings
- Predicting the next word (similar to predictive text on cell phones).
- Capturing relationships between words.
- Performing algebraic operations on word vectors (e.g., King - Queen = Man).
- Representing verb tenses (e.g., walking - walked).
- Finding words with similar meanings close to each other in a high-dimensional space.
- Facilitating machine translation by aligning word embeddings across languages.
Example: King, Queen, Man Relationship
- Word embeddings can capture gender relationships.
- Subtracting the vector for “queen” from the vector for “king” results in a vector similar to the vector for “man”.
- This demonstrates that word embeddings can capture semantic meanings based on relationships.
Handling Words with Multiple Meanings
- Represent different senses of a word with different one-hot vectors.
- For example, “queen” (the person), “queen” (the song), and “queen” (chess piece) would have distinct vectors.
Tracking Word Usage
- Word embeddings can track how the usage of a word evolves over time.