DATA MINING PRELIMS (TOPIC MODELING)

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

TOPIC MODELING

is a natural language processing (NLP) technique for determining the topics in a document.
Discover patterns of words in a collection of documents

is a type of statistical language models used for uncovering hidden structure in a collection of texts

Some applications of topic modeling include text summarization, recommender systems, spam filters.

New cards

Latent Semantic Analysis (LSA/LSI)

Probabilistic Latent Semantic Analysis (pLSA)

Latent Dirichlet Allocation (LDA)

Enumerate LDA ALGORITHMS

New cards

Latent Dirichlet Allocation (LDA)

an unsupervised clustering technique that is commonly used for text analysis.

a type of topic modeling in which words are represented as topics, and documents are represented as a collection of these word topics

New cards

Dirichlet distribution

LDA puts documents into a triangle where corners are topics and this distribution is called ________ _______

New cards

alpha = 1

samples are more evenly distributed over the space
Parameter alpha is equal to?

New cards

alpha > 1

samples are gathering in the middle
Parameter alpha is equal to?

New cards

alpha < 1

samples tend towards corners

Parameter alpha is equal to?

New cards

Parameter alpha

a k-dimensional vector, where each component corresponds to each corner or topic

New cards

psi

the distribution of words for each topic K

New cards

phi

the distribution of topics for each document

New cards

alpha (α)

LDA Parameters
Dirichlet prior concentration parameter that represents document-topic density — with a higher _____, documents are assumed to be made up of more topics and result in more specific topic distribution per document.

New cards

Beta (β)

LDA Parameters
the same prior concentration parameter that represents topic-word density — with high ______, topics are assumed to make up most of the words and result in a more specific word distribution per topic.

New cards

Omega (θ)

LDA Parameters
distribution of topics over a document as multinomial distribution

New cards

CORPUS

(collection of documents) can be represented as a document-term matrix.

New cards

Preprocess
Train
Score
Evaluate

LDA Process

New cards

pyLDAvis

Popular visualization package
Better understanding and interpreting individual topics - manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter

Better understanding the relationships between the topics - exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

New cards

Intertopic Distance Plot

Better understanding the relationships between the topics - exploring ____________________ the can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.

New cards

Eye Balling Models
Intrinsic Evaluation Metrics
Human Judgments

Approaches for the evaluation of topic models

New cards

Eye Balling Models

Top N words

Topics / Documents

New cards

Intrinsic Evaluation Metrics

Capturing model semantics

Topics interpretability

New cards

Human Judgments

What is a topic

New cards

Topic Coherence
Perplexity

Extrinsic Evaluation Metrics

New cards

Topic Coherence

measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic

New cards

measures how well the model predicts unseen or held-out documents. A lower perplexity score indicates better model performance. Lower perplexity scores indicate that the model can better predict the words in unseen documents, suggesting a better understanding of the underlying topics.

Perplexity