Clustering and Topic Modelling

0.0(0)
Studied by 0 people
call kaiCall Kai
Locked
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/16

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 3:23 PM on 7/4/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai
Chat

No analytics yet

Send a link to your students to track their progress

17 Terms

1
New cards

clustering

The task of partitioning a dataset into several groups so that observations (data instances) that are similar are assigned to the same group (cluster), and instances that are dissimilar are places in different groups.

2
New cards

Euclidean distance

This is the ordinary distance between two points.

Imagine two houses on a map.

The straight-line distance between them is the Euclidean distance.

For text:

  • Each document becomes a point.

  • Documents that are close together are considered similar.

Good for

Data where the actual values matter.

3
New cards

Cosine distance

Instead of comparing actual distances, cosine distance compares direction.

Imagine two arrows.

  • Same direction → very similar

  • Different directions → less similar

For text documents, cosine distance is often better because it ignores document length.

4
New cards

Why choose Cosine instead of Euclidean?

Because documents have different lengths.

Euclidean distance treats long documents as different.

Cosine distance focuses on which words are important, not how long the document is.

5
New cards

K-means

How many groups (clusters) do you want?

K = 3 → 3 clusters

Limitation: K-means only works with numerical features.

So text must first be converted into numbers (e.g., using TF-IDF).

6
New cards

Hierarchical Clustering

Instead of immediately creating K groups, hierarchical clustering builds a tree of clusters called a dendrogram.

Later you can decide how many groups you want.

7
New cards

topic modelling

Topic modelling is another unsupervised method.

But instead of grouping documents, it discovers the hidden topics inside documents.

Example: 5000 unlabeled news articles → topic modelling discovers:

  • Topic 1 (election, government, president, parliament)

  • Topic 2 (football, coach, player, goal)

  • Topic 3 (health, hospital, doctor, patient)

It discovers these topics automatically.

8
New cards

Difference between clustering and topic modelling

Clustering groups documents → Cluster A = sports articles; Cluster B = politics articles.

Topic moddeling groups words that often occur together. It then determines which topics appear inside each document:

  • A document can contain 70% politics and 30% economics.

Unlike clustering documents can belong to multiple topics.

9
New cards

Two assumptions of topic modelling

  1. A document can discuss multiple topics.

  2. A word can appear in multiple topics.

10
New cards

Latent Dirichlet Allocation (LDA)

Best-known topic modelling algorithm.

  • How many topics do you want?

Then LDA discovers those topics automatically.

11
New cards

Two outputs of LDA

  1. Words per topic (which word best describe each topic?)

Example: football, player, goal, coach → topic: sports

  1. Topics per document (which topics occur in this document?)

Example: document 80% sports, 20% politics; another document 100% politics.

LDA gives probabilities rather than one foxed label.

12
New cards

iteration

Repeating the same steps until the solution no longer changes.

13
New cards

convergence

The point where cluster assignments stop changing.

14
New cards

agglomerative clustering

Starts with each data point as its own cluster and repeatedly merges clusters.

15
New cards

simgle-linkage

Merges clusters based on the closest pair of points between them.

16
New cards

complete-linkage

Merges clusters based on the farthest pair of points between them, producing more balanced clusters.

17
New cards

text clustering

Applying clustering methods to text documents after converting them into numerical features.