Clustering

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/20

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

21 Terms

New cards

Unsupervised learning

is about ML techniques we can use when working with unlabelled data

New cards

types of unsupervised learning

clustering. Anomaly detections. Density estimation

New cards

clustering

identify groups within a dataset

New cards

anomaly detections

learn what ‘normal’ data looks like and then use that to learn abnormal instances

New cards

density estimation

estimate the probability density function (PDF) of the process that generated the dataset

New cards

clustering info

trying to extract some information about your unlabelled dataset. Group samples based on how similar their feature values are. No class labels. Need to learn groups based on the feature matrix

New cards

hard clustering

if you train k-means to assign instances to a specific cluster

New cards

soft-clustering train k-means to get a score on how related each instance it is to every cluster. The score can be a form of a distance metric between the instance and the centroids etc.

New cards

how does kMeans work?

calculate new centroids for each closter. Re-label all instances with new centroids. Repeat the process until the algorithm converges

New cards

kmeans performance

depends on starting points. Performance metric inertia: mean square distance between each point and its centroid. Smaller inertia > better model

New cards

speed optimisation

mini-batch k-means: in each round we only assign a small batch of instances then we re-estimate centroids using the mini batches. Faster processing but lower-quality clustering

New cards

Choosing the right K

specify the target no. clusters (k). Use inertia is a metric that k-means uses to estimate clustering quality. As K increases inertia drops. Elbow

New cards

Silhouette Score

mean silhouette coefficient over all instances. We want the clustering with the highest silhouette score

New cards

k-means limits

no guarantees that k-means produces good clustering

New cards

DBSCAN

density-based clustering algorithm. Clusters are continuous regions of high density. Works well if clusters are dense enough

New cards

DBSCAN method

for each instance count how many instances are within distance ε (the ε-neighbourhood) 2. if an instance has more than min_samples in its ε-neighbourhood it is a core instance (i.e. it is in a dense region) 3. all instances in the neighbourhood of a core instance are in the same cluster (i.e chaining of core instances forms larger clusters) 4. instances that don’t belong to a core instance neighbourhood are considered an anomaly

New cards

Gaussian Mixture Model

clustering algorithm. A probabilistic model that assumes that the instances are generated from a mixture of several Gaussian distributions

New cards

Hierarchical Clustering

family of algorithms that build nested clusters by merging or splitting them successively

New cards

Clustering performance (no labelled datasets)

compare models to find the best performance. Use inertia. Use silhouette coefficient

New cards

Rand index

compares the similarity between 2 assignments of cluster indexes