1/16
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai | Chat |
|---|
No analytics yet
Send a link to your students to track their progress
clustering
The task of partitioning a dataset into several groups so that observations (data instances) that are similar are assigned to the same group (cluster), and instances that are dissimilar are places in different groups.
Euclidean distance
This is the ordinary distance between two points.
Imagine two houses on a map.
The straight-line distance between them is the Euclidean distance.
For text:
Each document becomes a point.
Documents that are close together are considered similar.
Good for
Data where the actual values matter.
Cosine distance
Instead of comparing actual distances, cosine distance compares direction.
Imagine two arrows.
Same direction → very similar
Different directions → less similar
For text documents, cosine distance is often better because it ignores document length.
Why choose Cosine instead of Euclidean?
Because documents have different lengths.
Euclidean distance treats long documents as different.
Cosine distance focuses on which words are important, not how long the document is.
K-means
How many groups (clusters) do you want?
K = 3 → 3 clusters
Limitation: K-means only works with numerical features.
So text must first be converted into numbers (e.g., using TF-IDF).
Hierarchical Clustering
Instead of immediately creating K groups, hierarchical clustering builds a tree of clusters called a dendrogram.
Later you can decide how many groups you want.
topic modelling
Topic modelling is another unsupervised method.
But instead of grouping documents, it discovers the hidden topics inside documents.
Example: 5000 unlabeled news articles → topic modelling discovers:
Topic 1 (election, government, president, parliament)
Topic 2 (football, coach, player, goal)
Topic 3 (health, hospital, doctor, patient)
It discovers these topics automatically.
Difference between clustering and topic modelling
Clustering groups documents → Cluster A = sports articles; Cluster B = politics articles.
Topic moddeling groups words that often occur together. It then determines which topics appear inside each document:
A document can contain 70% politics and 30% economics.
Unlike clustering documents can belong to multiple topics.
Two assumptions of topic modelling
A document can discuss multiple topics.
A word can appear in multiple topics.
Latent Dirichlet Allocation (LDA)
Best-known topic modelling algorithm.
How many topics do you want?
Then LDA discovers those topics automatically.
Two outputs of LDA
Words per topic (which word best describe each topic?)
Example: football, player, goal, coach → topic: sports
Topics per document (which topics occur in this document?)
Example: document 80% sports, 20% politics; another document 100% politics.
LDA gives probabilities rather than one foxed label.
iteration
Repeating the same steps until the solution no longer changes.
convergence
The point where cluster assignments stop changing.
agglomerative clustering
Starts with each data point as its own cluster and repeatedly merges clusters.
simgle-linkage
Merges clusters based on the closest pair of points between them.
complete-linkage
Merges clusters based on the farthest pair of points between them, producing more balanced clusters.
text clustering
Applying clustering methods to text documents after converting them into numerical features.