DATA MINING PRELIMS (TOPIC MODELING)

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/23

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

24 Terms

1
New cards

TOPIC MODELING

is a natural language processing (NLP) technique for determining the topics in a document.
Discover patterns of words in a collection of documents

is a type of statistical language models used for uncovering hidden structure in a collection of texts

Some applications of topic modeling include text summarization, recommender systems, spam filters.

2
New cards

Latent Semantic Analysis (LSA/LSI)

Probabilistic Latent Semantic Analysis (pLSA)

Latent Dirichlet Allocation (LDA)

Enumerate LDA ALGORITHMS

3
New cards

Latent Dirichlet Allocation (LDA)

an unsupervised clustering technique that is commonly used for text analysis.

a type of topic modeling in which words are represented as topics, and documents are represented as a collection of these word topics

4
New cards

Dirichlet distribution

LDA puts documents into a triangle where corners are topics and this distribution is called ________ _______

5
New cards

alpha = 1

samples are more evenly distributed over the space
Parameter alpha is equal to?

6
New cards

alpha > 1

samples are gathering in the middle
Parameter alpha is equal to?

7
New cards

alpha < 1

samples tend towards corners

Parameter alpha is equal to?

8
New cards

Parameter alpha

a k-dimensional vector, where each component corresponds to each corner or topic

9
New cards

psi

the distribution of words for each topic K

10
New cards

phi

the distribution of topics for each document

11
New cards

alpha (α)

LDA Parameters
Dirichlet prior concentration parameter that represents document-topic density — with a higher _____, documents are assumed to be made up of more topics and result in more specific topic distribution per document.

12
New cards

Beta (β)

LDA Parameters
the same prior concentration parameter that represents topic-word density — with high ______, topics are assumed to make up most of the words and result in a more specific word distribution per topic.

13
New cards

Omega (θ)

LDA Parameters
distribution of topics over a document as multinomial distribution

14
New cards

CORPUS

(collection of documents) can be represented as a document-term matrix.

15
New cards

Preprocess
Train
Score
Evaluate

LDA Process

16
New cards

pyLDAvis

Popular visualization package
Better understanding and interpreting individual topics - manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter

Better understanding the relationships between the topics - exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

17
New cards

Intertopic Distance Plot

Better understanding the relationships between the topics - exploring ____________________ the can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

18
New cards

Eye Balling Models
Intrinsic Evaluation Metrics
Human Judgments

Approaches for the evaluation of topic models

19
New cards

Eye Balling Models

Top N words

Topics / Documents

20
New cards

Intrinsic Evaluation Metrics

Capturing model semantics

Topics interpretability

21
New cards

Human Judgments

What is a topic

22
New cards

Topic Coherence
Perplexity

Extrinsic Evaluation Metrics

23
New cards

Topic Coherence

measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic

24
New cards

measures how well the model predicts unseen or held-out documents. A lower perplexity score indicates better model performance. Lower perplexity scores indicate that the model can better predict the words in unseen documents, suggesting a better understanding of the underlying topics.

Perplexity