Clustering

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/24

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

25 Terms

New cards

What is clustering?

Cluster analysis is a technique used to group similar data points based on their characteristics.

It helps:

Visualising high dimensional data

Structuring data
Classification

New cards

How does clustering relate to EDA?

Cluster analysis is a key EDA tool because it:

Shows hidden patterns within data
Reveals structure in high-dimensional datasets that are otherwise hard to visualize
Generates hypotheses by identifying meaningful segments (e.g., customer types)
Reduces data complexity by summarizing observations into clusters
Supports visualization, such as dendrograms or scatter plots with clusters colored distinctly

New cards

Main limitation of clustering

Different ways to measure distance between points which changes which points are grouped together
Using different clustering methods gives different groups
Sometimes clusters don’t match real/meaningful groups in data
There’s no correct way to form clusters

New cards

Types of clustering

Hierarchical clustering
K-means clustering

New cards

What is hierarchical clustering, and what are its two main approaches?

Hierarchical clustering builds a tree (hierarchy) of clusters.

Two main approaches:

Agglomerative (bottom-up):
Start with each point alone, then keep merging the closest clusters until there’s only one big cluster.
Divisive (top-down):
Start with all points in one cluster, then keep splitting into smaller clusters step by step.

New cards

Distance measures used in clustering

Euclidean distance
Straight-line distance between two points, like drawing a line between them. Good for simple number data.
Manhattan distance:
Add steps you’d take along a grid
Cosine Similarity:
Looks at the angle between two data points

New cards

Linkage methods in Hierarchical clustering

Decide how we measure the distance between clusters when we merge them.

Methods:

Single linkage

Complete linkage

New cards

Single Linkage

Use the closest points between two clusters → can create long, chain-like clusters

<p>Use the <strong>closest points</strong> between two clusters → can create long, chain-like clusters</p>

New cards

Complete Linkage

Use furthest points between 2 clusters → makes tighter, rounder clusters

New cards

Agglomerate (bottom-up) Clustering Method Steps:

Treat each observation as its own cluster
Measure distances between each observation (each cluster) using a distance metric (eg. Euclidean)
Find 2 closest clusters and merge them
Recalculate distances between clusters using single or complete linkage
Repeat steps 2-4 until only 1 cluster remains

New cards

Advantages vs Disadvantages of hierarchical clustering

Advantages:

don’t need to choose number of clusters before starting
produces dendogram which shows order in which clusters were merged
Great for small-medium datasets

Disadvantages:

Slow for large datasets

Sensitive to outliers and messy data
Once clusters are merged they can’t be split

New cards

Dendogram

Tree diagram that shows how data points were grouped in hierarchical clustering

How to read:

X-axis:
- Shows data points (observations before clustering)
Y-axis:
- Shows how far apart clusters were when combines
- Height of each horizontal line (called a “merge”) shows distance between clusters
  Low line = Very similar
  High line = Less similar

What does long horizontal line mean:
- Shows a big jump in distance (2 very different clusters merged)
- Helps spot clear breaks in data where you can cut the tree

How to choose number of clusters
- Cut the tree horizontally at a height where there’s a stable number of clusters (a small change in height won’t change the number of clusters) -→ look for long horizontal lines
- Wherever the cut crosses vertical lines tell you how many clusters you’ll get

New cards

Choosing number of clusters

No one correct answer
Too few clusters makes clustering pointless
Too many clusters may be hard to interpret or use

New cards

Non-hierarchical clustering

Used when exact number of clusters is known
Example: k-means clustering

New cards

K-means Clustering Optimising Clusters

In clusters (k groups), each point is closest to its own cluster’s mean (centroid)
A point cant be closer to the mean of a different cluster
Minimises sum of squares within cluster (tries to keep points in a cluster as close to the centroid as possible

New cards

Steps of k-means clustering

Pick how many clusters (K) you want.
Choose starting points (centroids) for each cluster (often randomly).
Put each data point into the closest group (based on distance).
Move each centroid to the average location of the points in the cluster.
Repeat steps 3–4 until there’s minimal change

New cards

Algorithms for solutions to k-mean

These alogirthms are used because k-means is a NP-hard problem

There’s no quick way to guarantee finding the best clustering

Instead of trying all combinations we use the following heuristics:

Lloyd’s Algorithm
Hartigan-Wong theorem

New cards

Lloyd’s Algorithm

Pick starting centroids (often random)
Put each point in a cluster with the closest centroid
Move each centroid to the mean of all observations in the cluster
Repeat steps 2-3 until there’s no more change

New cards

Label Switching in k-means clustering

Cluster labels (eg.1,2,3) are arbitrary and can change. The labels don’t metter

New cards

Outputs from the kmeans() function in R?

cluster: A vector showing the cluster assignment of each data point

centers: Coordinates of the centroids (mean values of variables) of each cluster
size: Number of observations in each cluster

New cards

Why is comparing cluster solutions difficult?

There’s no correct answer
Different methods of clustering give different results

New cards

Robustness

Means cluster solutions stay consistent even if algorithm changes

Check robustness by:

Compare results from different methods
Use Rand Index(RI) or Adjusted Rand Index(ARI)

New cards

Rand Index

Measures how similar cluster groupings using 2 different methods are

0=no agreement

1= perfect agreement

New cards

Adjusted Rand Index

Improves rand index by adjusting for agreement that might happen due to chance

0=no agreement

1=perfect agreement

New cards

Advantages + disadvantages k-means clustering

Advantages:

Fast when there is a large number of observations

the number of clusters is known
when there is no obvious hierarchy in the data

Disadvantages:

Must choose number of clusters at the start and its not always obvious
Random starting centroids give different result every time
Doesn’t always find best solution (depends on starting centroids)