CAP 4770 - Lecture 5: Clustering

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/26

Earn XP

Description and Tags

Flashcards covering key concepts, algorithms (K-means, Hierarchical, DBSCAN), applications, and limitations related to cluster analysis from the lecture notes.

Computer Science

Software Engineering

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

27 Terms

New cards

Cluster Analysis

Finding groups of objects such that objects in a group are similar to one another and different from objects in other groups, maximizing inter-cluster distances and minimizing intra-cluster distances.

New cards

Inter-cluster distances

Distances between different clusters, which are maximized in cluster analysis.

New cards

Intra-cluster distances

Distances between objects within the same cluster, which are minimized in cluster analysis.

New cards

Partitional Clustering

A type of clustering that divides data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.

New cards

Hierarchical Clustering

A type of clustering that produces a set of nested clusters organized as a hierarchical tree.

New cards

Center-Based Cluster

A set of objects where an object in the cluster is closer (more similar) to the 'center' of its cluster than to the center of any other cluster.

New cards

Centroid

The average of all the points in a continuous cluster.

New cards

Medoid

The most 'representative' point of a categorical cluster.

New cards

K-means Clustering

A partitional clustering approach where each cluster is associated with a centroid, and each point is assigned to the cluster with the closest centroid. The number of clusters, K, must be specified as an input parameter.

New cards

Initial Centroids (K-means)

Often randomly chosen data points that serve as the starting centers for K-means clusters, influencing the final clustering result.

New cards

Sum of Squared Error (SSE)

A measure used in K-means clustering, calculated by squaring the distance of each point to its nearest cluster centroid and summing these errors; the algorithm aims to minimize SSE.

New cards

K-means convergence

The process by which the K-means algorithm settles, typically in the first few iterations, on a stable set of cluster centroids.

New cards

Vector quantization

An application of K-means clustering used for lossy data compression, such as clustering colors in an image.

New cards

K-means limitations

Challenges for K-means when clusters have differing sizes, densities, non-globular shapes, or when the data contains outliers.

New cards

Pre-processing (Clustering)

Steps taken before clustering, such as normalizing data and eliminating outliers, to improve clustering results.

New cards

Post-processing (Clustering)

Refining steps taken after clustering, such as eliminating small clusters, splitting 'loose' clusters, or merging 'close' clusters.

New cards

Dendrogram

A tree-like diagram used to visualize hierarchical clustering, recording the sequences of merges or splits.

New cards

Agglomerative Clustering

A hierarchical clustering technique that starts with individual points as clusters and iteratively merges the two closest clusters until a single cluster remains.

New cards

Divisive Clustering

A hierarchical clustering technique that starts with one all-inclusive cluster and iteratively splits clusters until each cluster contains a single point.

New cards

Density-Based Cluster

A cluster defined as a dense region of points separated by low-density regions from other high-density regions, useful for irregular shapes, noise, and outliers.

New cards

DBSCAN

Density Based Spatial Clustering of Applications with Noise, an algorithm that defines a cluster as a maximal set of density-connected points, using parameters Eps (radius) and MinPts (minimum points).

New cards

Eps (DBSCAN parameter)

The specified radius used in DBSCAN to determine the neighborhood of a point, within which other points are counted for density.

New cards

MinPts (DBSCAN parameter)

The specified minimum number of points required within Eps for a point to be considered a core point in DBSCAN.

New cards

Core point (DBSCAN)

A data point that has more than MinPts within its Eps neighborhood, indicating it is at the interior of a cluster.

New cards

Border point (DBSCAN)

A data point that has fewer than MinPts within its Eps neighborhood but is within the Eps neighborhood of a core point.

New cards

Noise point (DBSCAN)

Any data point that is neither a core point nor a border point in DBSCAN.

New cards

Data structures for clustering efficiency

Techniques like k-d trees and R-trees proposed to improve the efficiency of distance computations in clustering algorithms.