Clustering

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/20

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

21 Terms

1
New cards

Unsupervised learning

is about ML techniques we can use when working with unlabelled data

2
New cards

types of unsupervised learning

clustering. Anomaly detections. Density estimation

3
New cards

clustering

identify groups within a dataset

4
New cards

anomaly detections

learn what ‘normal’ data looks like and then use that to learn abnormal instances

5
New cards

density estimation

estimate the probability density function (PDF) of the process that generated the dataset

6
New cards

clustering info

trying to extract some information about your unlabelled dataset. Group samples based on how similar their feature values are. No class labels. Need to learn groups based on the feature matrix

7
New cards

hard clustering

if you train k-means to assign instances to a specific cluster

8
New cards

soft-clustering train k-means to get a score on how related each instance it is to every cluster. The score can be a form of a distance metric between the instance and the centroids etc.

9
New cards

how does kMeans work?

calculate new centroids for each closter. Re-label all instances with new centroids. Repeat the process until the algorithm converges

10
New cards

kmeans performance

depends on starting points. Performance metric inertia: mean square distance between each point and its centroid. Smaller inertia > better model

11
New cards

speed optimisation

mini-batch k-means: in each round we only assign a small batch of instances then we re-estimate centroids using the mini batches. Faster processing but lower-quality clustering

12
New cards

Choosing the right K

specify the target no. clusters (k). Use inertia is a metric that k-means uses to estimate clustering quality. As K increases inertia drops. Elbow

13
New cards

Silhouette Score

mean silhouette coefficient over all instances. We want the clustering with the highest silhouette score

14
New cards

k-means limits

no guarantees that k-means produces good clustering

15
New cards

DBSCAN

density-based clustering algorithm. Clusters are continuous regions of high density. Works well if clusters are dense enough

16
New cards

DBSCAN method

  1. for each instance count how many instances are within distance ε (the ε-neighbourhood) 2. if an instance has more than min_samples in its ε-neighbourhood it is a core instance (i.e. it is in a dense region) 3. all instances in the neighbourhood of a core instance are in the same cluster (i.e chaining of core instances forms larger clusters) 4. instances that don’t belong to a core instance neighbourhood are considered an anomaly
17
New cards

Gaussian Mixture Model

clustering algorithm. A probabilistic model that assumes that the instances are generated from a mixture of several Gaussian distributions

18
New cards

Hierarchical Clustering

family of algorithms that build nested clusters by merging or splitting them successively

19
New cards

Clustering performance (no labelled datasets)

compare models to find the best performance. Use inertia. Use silhouette coefficient

20
New cards

Rand index

compares the similarity between 2 assignments of cluster indexes

21
New cards

Fowlkes-Mallows scores

geometric mean precision and recal