1/26
Flashcards covering key concepts, algorithms (K-means, Hierarchical, DBSCAN), applications, and limitations related to cluster analysis from the lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Cluster Analysis
Finding groups of objects such that objects in a group are similar to one another and different from objects in other groups, maximizing inter-cluster distances and minimizing intra-cluster distances.
Inter-cluster distances
Distances between different clusters, which are maximized in cluster analysis.
Intra-cluster distances
Distances between objects within the same cluster, which are minimized in cluster analysis.
Partitional Clustering
A type of clustering that divides data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical Clustering
A type of clustering that produces a set of nested clusters organized as a hierarchical tree.
Center-Based Cluster
A set of objects where an object in the cluster is closer (more similar) to the 'center' of its cluster than to the center of any other cluster.
Centroid
The average of all the points in a continuous cluster.
Medoid
The most 'representative' point of a categorical cluster.
K-means Clustering
A partitional clustering approach where each cluster is associated with a centroid, and each point is assigned to the cluster with the closest centroid. The number of clusters, K, must be specified as an input parameter.
Initial Centroids (K-means)
Often randomly chosen data points that serve as the starting centers for K-means clusters, influencing the final clustering result.
Sum of Squared Error (SSE)
A measure used in K-means clustering, calculated by squaring the distance of each point to its nearest cluster centroid and summing these errors; the algorithm aims to minimize SSE.
K-means convergence
The process by which the K-means algorithm settles, typically in the first few iterations, on a stable set of cluster centroids.
Vector quantization
An application of K-means clustering used for lossy data compression, such as clustering colors in an image.
K-means limitations
Challenges for K-means when clusters have differing sizes, densities, non-globular shapes, or when the data contains outliers.
Pre-processing (Clustering)
Steps taken before clustering, such as normalizing data and eliminating outliers, to improve clustering results.
Post-processing (Clustering)
Refining steps taken after clustering, such as eliminating small clusters, splitting 'loose' clusters, or merging 'close' clusters.
Dendrogram
A tree-like diagram used to visualize hierarchical clustering, recording the sequences of merges or splits.
Agglomerative Clustering
A hierarchical clustering technique that starts with individual points as clusters and iteratively merges the two closest clusters until a single cluster remains.
Divisive Clustering
A hierarchical clustering technique that starts with one all-inclusive cluster and iteratively splits clusters until each cluster contains a single point.
Density-Based Cluster
A cluster defined as a dense region of points separated by low-density regions from other high-density regions, useful for irregular shapes, noise, and outliers.
DBSCAN
Density Based Spatial Clustering of Applications with Noise, an algorithm that defines a cluster as a maximal set of density-connected points, using parameters Eps (radius) and MinPts (minimum points).
Eps (DBSCAN parameter)
The specified radius used in DBSCAN to determine the neighborhood of a point, within which other points are counted for density.
MinPts (DBSCAN parameter)
The specified minimum number of points required within Eps for a point to be considered a core point in DBSCAN.
Core point (DBSCAN)
A data point that has more than MinPts within its Eps neighborhood, indicating it is at the interior of a cluster.
Border point (DBSCAN)
A data point that has fewer than MinPts within its Eps neighborhood but is within the Eps neighborhood of a core point.
Noise point (DBSCAN)
Any data point that is neither a core point nor a border point in DBSCAN.
Data structures for clustering efficiency
Techniques like k-d trees and R-trees proposed to improve the efficiency of distance computations in clustering algorithms.