Lecture Notes on Topic Modeling and Word Embeddings

Transforming Text into Numbers

The challenge is transforming text data into numerical representations (vectors or matrices) for analysis.
Initial approaches involve breaking text into words and aggregating words with similar meanings using stemming or lemmatization.
Limitation: These methods often disregard the real meaning of words and cannot handle words with multiple meanings or phrases where meaning arises from the combination of words.
Applying TF-IDF (Term Frequency-Inverse Document Frequency) helps understand the relationships between words in long texts.

The goal is to overcome the limitations of bag-of-words approaches by incorporating semantic understanding.
Two techniques to achieve this: topic modeling and word embeddings.

Vectors created from text are often very sparse, meaning they contain many zeros.
Sparsity: Presence of many zeros in the vector representation.
The document matrix represents words (columns) in a corpus across documents (rows).
Sparsity leads to:
- Wasted space and processing time.
- Difficulty in calculating similarity between vectors.
- Challenge: Dimensionality reduction is needed to compress the matrix and remove zeros.

Topic Modeling: An unsupervised technique for finding topics in a matrix.
It involves shuffling words and documents to reveal patterns where groups of words are prevalent in groups of documents.
These groups represent topics.
Example: A topic could be related to "air pollution" with words like "air," "pollution," "power," and "environmental."
This is not hard clustering like K-means; a word can belong to multiple clusters or topics.
Topic modeling is a dimensionality reduction technique.

LDA is a popular probabilistic topic model.
The core idea involves decomposing the document-word matrix into two matrices:
- Document-topic representation.
- Term-topic representation.
Documents are classified based on topics, and words are associated with topics.
This results in matrices with fewer zeros and lower dimensions.

Example: Applying topic modeling to abstracts from scientific papers.
Identified topics may include "genetics" (gene, DNA, genetic), "life," "brain," and "data."
Topic modeling helps summarize texts and understand the main themes.
A profile or signature can be created for each document based on the prevalence of different topics.
Topic modeling can track interest in topics over time.

Not a well-defined generative model; it struggles with generalizing to new documents.
Every time a new document is added, the model needs to be rerun.
Computationally intensive for large datasets.
Prone to overfitting (depending on the number of topics).
Doesn't work well for short texts.

Word embeddings aim to transform words into dense vectors that capture semantic relationships.
Dense vectors are preferred over sparse vectors to save space and resources.
Vectors can represent word morphology, context, global corpus statistics, and relationships between words.

Start with a corpus and transform each word into a one-hot vector.
A one-hot vector is a vector with a single one and all other values as zeros.
Use a neural network to predict the next word in a sequence.
The hidden layer of the neural network represents words in a lower-dimensional space.
The neural network is trained to adjust weights to accurately predict the next word.
Once the training phase finishes, the weights in the hidden layer represent the word embeddings.

Predicting the next word (similar to predictive text on cell phones).
Capturing relationships between words.
Performing algebraic operations on word vectors (e.g., King - Queen = Man).
Representing verb tenses (e.g., walking - walked).
Finding words with similar meanings close to each other in a high-dimensional space.
Facilitating machine translation by aligning word embeddings across languages.

Word embeddings can capture gender relationships.
Subtracting the vector for “queen” from the vector for “king” results in a vector similar to the vector for “man”.
This demonstrates that word embeddings can capture semantic meanings based on relationships.

Represent different senses of a word with different one-hot vectors.
For example, “queen” (the person), “queen” (the song), and “queen” (chess piece) would have distinct vectors.