1/20
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Unsupervised learning
is about ML techniques we can use when working with unlabelled data
types of unsupervised learning
clustering. Anomaly detections. Density estimation
clustering
identify groups within a dataset
anomaly detections
learn what ‘normal’ data looks like and then use that to learn abnormal instances
density estimation
estimate the probability density function (PDF) of the process that generated the dataset
clustering info
trying to extract some information about your unlabelled dataset. Group samples based on how similar their feature values are. No class labels. Need to learn groups based on the feature matrix
hard clustering
if you train k-means to assign instances to a specific cluster
soft-clustering train k-means to get a score on how related each instance it is to every cluster. The score can be a form of a distance metric between the instance and the centroids etc.
how does kMeans work?
calculate new centroids for each closter. Re-label all instances with new centroids. Repeat the process until the algorithm converges
kmeans performance
depends on starting points. Performance metric inertia: mean square distance between each point and its centroid. Smaller inertia > better model
speed optimisation
mini-batch k-means: in each round we only assign a small batch of instances then we re-estimate centroids using the mini batches. Faster processing but lower-quality clustering
Choosing the right K
specify the target no. clusters (k). Use inertia is a metric that k-means uses to estimate clustering quality. As K increases inertia drops. Elbow
Silhouette Score
mean silhouette coefficient over all instances. We want the clustering with the highest silhouette score
k-means limits
no guarantees that k-means produces good clustering
DBSCAN
density-based clustering algorithm. Clusters are continuous regions of high density. Works well if clusters are dense enough
DBSCAN method
Gaussian Mixture Model
clustering algorithm. A probabilistic model that assumes that the instances are generated from a mixture of several Gaussian distributions
Hierarchical Clustering
family of algorithms that build nested clusters by merging or splitting them successively
Clustering performance (no labelled datasets)
compare models to find the best performance. Use inertia. Use silhouette coefficient
Rand index
compares the similarity between 2 assignments of cluster indexes
Fowlkes-Mallows scores
geometric mean precision and recal