1/23
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
TOPIC MODELING
is a natural language processing (NLP) technique for determining the topics in a document.
Discover patterns of words in a collection of documents
is a type of statistical language models used for uncovering hidden structure in a collection of texts
Some applications of topic modeling include text summarization, recommender systems, spam filters.
Latent Semantic Analysis (LSA/LSI)
Probabilistic Latent Semantic Analysis (pLSA)
Latent Dirichlet Allocation (LDA)
Enumerate LDA ALGORITHMS
Latent Dirichlet Allocation (LDA)
an unsupervised clustering technique that is commonly used for text analysis.
a type of topic modeling in which words are represented as topics, and documents are represented as a collection of these word topics
Dirichlet distribution
LDA puts documents into a triangle where corners are topics and this distribution is called ________ _______
alpha = 1
samples are more evenly distributed over the space
Parameter alpha is equal to?
alpha > 1
samples are gathering in the middle
Parameter alpha is equal to?
alpha < 1
samples tend towards corners
Parameter alpha is equal to?
Parameter alpha
a k-dimensional vector, where each component corresponds to each corner or topic
psi
the distribution of words for each topic K
phi
the distribution of topics for each document
alpha (α)
LDA Parameters
Dirichlet prior concentration parameter that represents document-topic density — with a higher _____, documents are assumed to be made up of more topics and result in more specific topic distribution per document.
Beta (β)
LDA Parameters
the same prior concentration parameter that represents topic-word density — with high ______, topics are assumed to make up most of the words and result in a more specific word distribution per topic.
Omega (θ)
LDA Parameters
distribution of topics over a document as multinomial distribution
CORPUS
(collection of documents) can be represented as a document-term matrix.
Preprocess
Train
Score
Evaluate
LDA Process
pyLDAvis
Popular visualization package
Better understanding and interpreting individual topics - manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter
Better understanding the relationships between the topics - exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.
Intertopic Distance Plot
Better understanding the relationships between the topics - exploring ____________________ the can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics.
Eye Balling Models
Intrinsic Evaluation Metrics
Human Judgments
Approaches for the evaluation of topic models
Eye Balling Models
Top N words
Topics / Documents
Intrinsic Evaluation Metrics
Capturing model semantics
Topics interpretability
Human Judgments
What is a topic
Topic Coherence
Perplexity
Extrinsic Evaluation Metrics
Topic Coherence
measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic
measures how well the model predicts unseen or held-out documents. A lower perplexity score indicates better model performance. Lower perplexity scores indicate that the model can better predict the words in unseen documents, suggesting a better understanding of the underlying topics.
Perplexity