Clustering Notes
Clustering Concepts
Clustering
Clustering is the task of dividing a set of objects into groups (clusters) such that objects in the same group are more similar to each other than to those in other groups. Cluster analysis is a fundamental unsupervised learning technique aimed at grouping a set of objects such that those within the same group are more similar to each other than to those in other groups. Clustering aims to organize data based on intrinsic similarity without predefined labels.
Properties of clusters:
Clusters may have different sizes, shapes, densities
Clusters may form a hierarchy
Clusters may be overlapping or disjoint
Objectives of Cluster Analysis
Discovering Structure in Data: Cluster analysis uncovers hidden patterns or groupings in data without prior labelling.
Reducing Complexity: In datasets with high dimensionality or large volumes, clustering simplifies analysis by summarizing data into representative groups.
Facilitating Exploratory Data Analysis (EDA): Clustering serves as a powerful tool in EDA by revealing underlying data distributions and relationships. It aids in hypothesis generation and guides subsequent analytical or modeling efforts.
Enabling Segmentation for Actionable Insights: By segmenting data into meaningful clusters, organizations can tailor strategies to specific groups. For instance, businesses can customize marketing campaigns for different customer segments identified through clustering.
Connection to Unsupervised Learning & Knowledge Discovery
Cluster analysis is a cornerstone of unsupervised learning, where the goal is to infer patterns from unlabelled data. It plays a critical role in knowledge discovery by:
Identifying Novel Patterns: Uncovering previously unknown groupings or associations within data.
Informing Feature Selection: Highlighting relevant features that contribute to cluster formation.
Supporting Anomaly Detection: Detecting outliers as data points that do not fit into any cluster.
Clustering vs Classification
Aspect | Clustering | Classification |
|---|---|---|
Learning Type | Unsupervised | Supervised |
Data Labels | No predefined labels | Requires labeled data |
Objective | Group similar data points | Assign data points to predefined categories |
Output | Natural groupings (clusters) | Predicted labels for new data points |
Use Cases | Market segmentation, image compression | Spam detection, disease diagnosis |
Clustering is Subjective
Examples include dividing items into clusters of children vs. adults, trousers vs. shorts, or by clothing color such as red vs. blue vs. green.
Evaluating Clusters
Internal Evaluation Criteria: Based on intra-cluster and inter-cluster similarities.
External Evaluation Criteria: Compares clustering results to ground truth labels, if available.
Ground truth labels are the true, correct class or category assignments for each data point, as established by: human annotation, verified datasets, domain expert knowledge, or experimental design.
These labels are treated as the "gold standard" or "correct answers" against which clustering results are compared.
Datasets of animals with known species labels can be used to evaluate agreement. High agreement occurs if the algorithm groups all cats together and all dogs together; low agreement occurs if the algorithm mixes them.
Agreement can be quantified using metrics like: Adjusted Rand Index, Normalized Mutual Information, or Fowlkes-Mallows Index.
Relative Evaluation Criteria: Used to compare clustering outcomes across different models or parameter values, often when neither internal quality nor ground truth labels alone are conclusive.
An example is the Silhouette score (highest = optimal), but there are other metrics such as the Calinski-Harabasz Index.
We often don’t know beforehand how many clusters (k) we should use.
Clustering Approaches Overview
Clustering approaches can be categorized based on how they define clusters and their underlying assumptions. Each method suits different types of data, application contexts, and scalability needs:
Partitioning: Construct various partitions and then evaluate them by some criterion
Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion
Density-based: Guided by connectivity and density functions
Model-based: Hypothesize a model for each cluster and find best fit of models to data
Partitioning
Partitioning methods aim to divide a dataset into a predefined number of clusters, optimizing a specific objective function, e.g., minimizing within-cluster variance:
Example:
Imagine you have 10 data points and want to split them into 2 clusters. Each time you assign points to clusters, you can calculate:
How far each point is from the center (centroid) of its assigned cluster
Square those distances
Sum them up for all points in the cluster → this is the "within-cluster variance"
Example Use Cases:
Customer segmentation (grouping customers by purchasing behaviour)
Image colour quantisation (reducing number of colours in an image)
Market basket analysis (grouping similar buying patterns)
Works Well When:
Clusters are spherical, equally sized, similar densities
Dataset is moderate to large, low to medium dimensions
Need fast, efficient clustering
Avoid If:
Clusters are non-spherical, varying sizes
Many outliers → sensitive to outliers
Clusters are not linearly separable
Representative Algorithms:
k-means
k-medoids
CLARA (Clustering Large Applications)
How it works:
Start with k initial cluster centers
Assign data points to the closest center
Iteratively update centers to minimize intra -cluster distance
Strengths:
Efficient for large datasets
Simple implementation
Works best with spherical, equally sized clusters
Limitations:
Requires predefining k
Sensitive to outliers
Struggles with non-convex or varying density clusters
Distance Metrics
A metric is a function that measures the distance between two points x and y, satisfying these 4 key properties:
Property | Formula | Meaning |
|---|---|---|
Non-negativity | Distance is never negative, e.g. Distance between two cities = 0 km or more | |
Identity (of indiscernibles) | d(x,y) = 0 if and only if x = y | Zero distance only if they’re the same point, e.g. Object compared to itself → distance = 0. |
Symmetry | Distance is the same both ways (A → B same as B → A), e.g. Paris → London same as London → Paris | |
Triangle inequality | . | Shortest distance is direct; detour is longer, e.g. Walking direct vs. via another city |
These properties ensure distance behaves consistently and intuitively.
Why is distance fundamental in clustering?
Clustering aims to group similar things together → we need a way to measure “similarity.”
A distance metric tells the algorithm: "Are these two points close enough to belong together?"
Examples:
Customers with similar spending → short distance
Patients with similar gene expression → short distance
Documents with similar topics → short distance
Points closer together → more similar → same cluster. Points far apart → dissimilar → different clusters.
Key distance metrics for partitioning:
Metric | Best Suited For | Works With | How to Interpret Distance |
|---|---|---|---|
Euclidean | Continuous numeric data; compact, spherical clusters | k-means, k-medoids | Smaller = more similar |
Manhattan (L1) | Grid-like data; sparse vectors | k-medoids | Smaller = more similar |
Minkowski | Generalization of Euclidean and Manhattan | k-medoids | Smaller = more similar |
Mahalanobis | Correlated features; elliptical clusters | k-medoids | Smaller = more similar |
Cosine distance | High-dimensional sparse data; text analysis | k-medoids | Closer to 0 = more similar; closer to 1 = more dissimilar |
Jaccard distance | Binary/categorical data (market basket) | k-medoids | Closer to 0 = more similar; closer to 1 = more dissimilar |
Hamming distance | Binary strings; genetic sequence comparisons | k-medoids | Smaller = more similar |
Gower distance | Mixed numeric + categorical data | k-medoids | Smaller = more similar |
Chebyshev distance | Emphasis on largest attribute difference | k-medoids | Smaller = more similar |
Canberra distance | Small value sensitivity; ecological/environmental data | k-medoids | Smaller = more similar |
Bray-Curtis dissimilarity | Ecological species abundance data | k-medoids | Closer to 0 = more similar; closer to 1 = more dissimilar |
Curse of Dimensionality
When you have many features (columns) (e.g., 1000 genes per sample):
Distances between all points start to look similar.
Everything looks equally far apart → clustering becomes random or unstable.
Imagine being in a huge hall: No one feels close to anyone—everyone is “far” in every direction.
Solution: Use dimensionality reduction (e.g., PCA) to simplify before clustering.
Scaling Issues
Big-scale features dominate distance → bias results
Suppose:
Feature 1: Age (values like 20, 30, 40) Feature 2: Income (values like 10,000, 50,000, 100,000)
Income values are way bigger than age. When measuring distance:→ Income dominates → age differences get ignored.
Result: Clusters based mostly on income → even if age was more important.
Euclidean Distance
Euclidean distance measures the straight-line distance between points. In clustering algorithms like K-Means, it's used to assign points to the nearest cluster centroid.
Interpretation:
Clusters are shaped like spheres or circles because Euclidean distance is sensitive to radius.
Points closer to a centroid (shorter line) belong to that cluster.
Manhattan Distance
Manhattan distance, also known as "city block" distance, calculates the distance between two points by only moving along axes at right angles. It's useful in grid-like path scenarios
Interpretation:
Clusters will have diamond or square boundaries rather than circles.
A point is assigned to a cluster whose centroid is closest in total block distance.
Minkowski Distance
Minkowski distance provides flexibility in measuring distance by adjusting the parameter p, allowing for a balance between Manhattan and Euclidean distances based on the dataset's characteristics.
→ diamond shape (Manhattan) → circle (Euclidean) p > 2 → shapes get more “rounded square”
Interpretation:
Shows how distance metric controls the cluster shape:
Lower p → elongated along axes
Higher p → tighter along diagonals
The same dataset could be clustered differently by adjusting p.
Cosine Similarity
Cosine similarity is particularly useful in high-dimensional spaces, such as text mining, where the magnitude of vectors may not be as informative as their direction. Algorithms like spherical K- Means utilize cosine similarity.
Interpretation:
Points with similar directions (angles) are grouped together into clusters.
Magnitude doesn’t affect similarity → even if two vectors have different lengths but same angle → high similarity.
Jaccard Distance
In clustering, Jaccard distance measures the dissimilarity between sets of data points, typically represented as binary (0 or 1) or categorical data. It's calculated by finding the difference between the number of unique elements (union) and shared elements (intersection) of two sets, divided by the number of unique elements. This distance is commonly used in scenarios where data is best represented as sets, such as in document clustering or analysing categorical variables.
Interpretation:
More overlap → smaller Jaccard distance (more similar).
Less overlap → higher Jaccard distance (more dissimilar).
Clusters are formed by grouping items with high overlap (low distance).
k-means Algorithm
The k-means algorithm is one of the most widely used unsupervised machine learning algorithms for clustering tasks. It partitions a dataset into k distinct, non-overlapping clusters based on feature similarity.
K-means operates under the assumption that clusters are spherical and equally sized, making it particularly suitable for well-separated and convex clusters in Euclidean space.
The primary goal of the k-means algorithm is to divide a set of n observations into k clusters such that the within-cluster sum of squares (WCSS) — also referred to as inertia — is minimized. In simpler terms, each cluster should be formed such that data points within a cluster are as similar as possible to each other (intra-cluster similarity) and as dissimilar as possible from points in other clusters (inter-cluster dissimilarity).
Step | Description |
|---|---|
1. Initialization | Pick k random points as starting centroids |
2. Assignment | Use Euclidean distance to assign each point to the nearest centroid |
3. Update | For each cluster, update the centroid by averaging the assigned points |
4. Repeat | Loop steps 2 & 3 until centroids stop moving |
Convergence Criteria
K-means converges when:
No change in cluster assignments: The mapping of each point to a cluster remains constant between iterations.
Centroids are stable: The location of cluster centroids no longer changes (or changes are below a predefined tolerance).
Predefined iterations reached: A practical limit on iterations prevents infinite loops in edge cases.
Challenges of K-Means
Despite its simplicity and efficiency, k-means suffers from several well-known challenges:
Local Minima
The algorithm may converge to a local optimum rather than the global one due to random initialization. This makes the result non-deterministic.
To mitigate this, it's common practice to run the algorithm multiple times with different seeds and select the best result (lowest WCSS).
Need to Specify k in Advance
The number of clusters k must be defined beforehand, which may not always be obvious. Incorrect choices can lead to over- clustering or under-clustering.
Sensitivity to Initialization
Random selection of initial centroids can significantly affect the outcome.
k-means++ (recommended) helps by initializing centroids in a way that is distant from each other, improving both convergence speed and result quality.
Assumes Spherical Clusters
K-means assumes clusters of similar size and spherical shape, which is often not the case in real-world data (e.g., varying densities or non-linear boundaries).
Choosing the Optimal Number of Clusters (k)
Several methods have been developed to select a suitable value of k, including:
Elbow Method
Plot the WCSS against different values of k.
The point where the decrease in WCSS becomes marginal (the "elbow") is chosen as the optimal k.
Though intuitive, this method can be subjective and ambiguous when the curve is smooth.
Silhouette Score
Measures how similar a data point is to its own cluster vs. other clusters
Defined as:
Silhouette score ranges from -1 to +1, where values close to 1 indicate appropriate clustering.
k-medoids Algorithm
Partitioning using k-medoids algorithms is a partitional clustering algorithm which is slightly modified from the k-means algorithm.
K-means algorithm: Choose the means as centroids
Nearest Centroid Classification, also known as Centroid-based Classification, is a simple and intuitive classification algorithm that assigns new data points to the class whose centroid (mean) is closest to the data point.
K-medoids algorithm: Data points are chosen to be medoids
A medoid is a centroid that best represents the objects in a defined cluster ; It is similar to the K-Means clustering method but requires the use of medians for the formation of subgroups.
The k-medoids algorithm is a partitional clustering method that, unlike k-means, selects actual data points as the centers of clusters, called medoids.
A medoid is defined as the most centrally located object in a cluster, i.e., the point minimizing the sum of distances to all other points in that cluster. This is fundamentally different from the centroid used in k-means, which is the average of all points and can lie outside the dataset.
PAM algorithm
The most common implementation of k-medoids is the Partitioning Around Medoids (PAM) algorithm, first introduced by Kaufman and Rousseeuw (1987). Despite its age, PAM has been extensively improved in recent years and remains widely used.
Here are the detailed steps of the PAM algorithm:
Initialize medoids
Select k initial medoids randomly from the dataset. Alternatively, use heuristics (such as the most centrally located points) to improve initial placement.
Assign points to closest medoid
For each data point, compute the distance to all medoids and assign it to the nearest one, forming k clusters.
Swap medoid/non-medoid and check cost
For each medoid, test swapping it with a non-medoid point in the dataset. After the swap, calculate the new total cost:
Repeat until no improvement
If a swap improves the cost (reduces total pairwise distance), perform the swap and reassign points. Repeat until no swaps improve the clustering.
Advantages of k-medoids
Robustness to outliers
Compatibility with arbitrary distance metrics. K-means inherently assumes Euclidean distance, as the mean minimization only makes sense in that space. K-medoids, however, can work with non-Euclidean metrics such as Manhattan distance, cosine similarity, or even domain-specific distances (e.g., edit distance for strings).
This flexibility makes it valuable for clustering diverse data types, including graphs, sequences, and categorical data.
Limitations of k-medoids
Computational complexity: PAM has higher computational cost than k-means because it evaluates all possible swaps between medoids and non-medoids, making it
This limits its scalability for large datasets.
Best-fit applications: K-medoids is particularly useful in non-Euclidean spaces (like bioinformatics sequences, string matching, or network data) and in small to medium datasets where its robustness and flexibility outweigh its computational cost.
Comparison Of k-medoids Algorithms
Method | Key Feature | Best for |
|---|---|---|
PAM | Exact medoid swaps, robust but slow | Small datasets, high accuracy needs |
CLARA | Sample-based PAM | Large datasets, approximate clusters |
CLARANS | Randomized local search over medoids | Medium-large datasets, scalable search |
FastPAM | Optimized PAM with reuse of distances | Larger datasets, faster exact results |
Metaheuristics | Optimization using genetic/PSO/ACO frameworks | High-dimensional, complex clustering |
Cluster Interpretability
Cluster interpretability is fundamentally important because clusters in unsupervised learning gain practical value only when they can be understood, explained, and acted upon by domain experts, decision-makers, or stakeholders.
Interpretation connects these outputs to real-world understanding. Specifically, stakeholders need to know what each cluster represents: are these customer segments, behavioural groups, geographic patterns, or product types? Without a clear explanation, clusters risk being dismissed as “black-box artifacts” — mathematically valid but practically useless.
Interpretability bridges the gap between technical results and business actions by providing meaningful narratives and summaries that support decision-making, resource allocation, or policy adjustments. For example, if a marketing team uses clustering for customer segmentation, they need to know which demographics or behaviours characterize each segment to tailor targeted campaigns.
Techniques for Interpretation
Centroid/Medoid Analysis
In centroid-based methods (e.g., K-Means), the centroid is the average point representing the cluster center in feature space, while in medoid-based methods (e.g., K-Medoids), the medoid is the actual data point that minimizes within-cluster dissimilarity.
Analysing the centroid or medoid helps explain “what typical members of the cluster look like”.
Feature Importance per Cluster (Mean/Median Comparisons)
Another approach involves comparing the mean or median of each feature across clusters.
By identifying which features have notably different averages, you can summarize which variables are most characteristic of each cluster.
This is often visualized using boxplots, parallel coordinates plots, or summary tables.
For example, in a demographic clustering task, you might find that cluster 1 has a high average age but low income, while cluster 2 has low average age but high spending patterns.
Visual Inspection
Visual tools are essential, especially for communicating results to non-technical audiences:
Scatterplots (2D/3D): By plotting two or three features, you can observe spatial separation between clusters and intuitively understand the data structure.
Pair plots: These show scatterplots for every pair of features, often color-coded by cluster, to reveal multi-feature interactions.
PCA / t-SNE Projections: In high-dimensional data, dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbour Embedding (t-SNE) are used to project data into 2D or 3D for visualization.
Cluster Labelling (Using Supervised Ground Truth if Available)
If labelled data exists, clusters can sometimes be labelled by assigning the majority class from the known labels within each cluster.
Cluster Profiling
A formal cluster profile summarizes each cluster’s defining characteristics using descriptive statistics or domain-specific summaries.
For example, after clustering customers, you might create a profile like:
Cluster A: young, urban, low income, high spending.
Cluster B: older, suburban, high income, conservative spending.
Challenges in interpretation
High-Dimensional Interpretability Challenges
High-dimensional datasets (e.g., genomics, text embeddings) are notoriously hard to interpret because humans cannot visualize or comprehend patterns across 100+ dimensions.
Non-Linear Cluster Boundaries
Many clustering algorithms assume linear separability (e.g., K-Means uses convex boundaries), but real-world data often contains non-linear patterns.
Dependence on Domain Knowledge
Interpretation is not just a technical process but also a domain-driven task.
Danger of Overinterpreting Spurious Clusters
Not all clusters reflect real-world phenomena.
Researchers caution against the human tendency to “see patterns where none exist” — known as pareidolia in psychology or overfitting in machine learning.
Hierarchical Methods
Hierarchical methods produce a nested structure of clusters, typically visualized as a dendrogram. Two strategies:
Agglomerative (bottom-up): start with each data point as its own cluster and iteratively merge
Divisive (top-down): start with all data points in one cluster and iteratively split
Example use cases:
Gene expression analysis (grouping genes/proteins hierarchically)
Customer segmentation with nested segments (e.g., VIP → premium → regular)
Document clustering → organizing large corpora into topic trees
Works well when:
Need hierarchical relationships (clusters within clusters)
Don’t know number of clusters beforehand
Want visual representation of cluster structure (dendrogram)
Avoid if:
Dataset is very large → high cost
Need strict hard assignments
Clusters are noisy / poorly separated
Two Strategies:
Agglomerative
Divisive
How it works:
No need to predefine k
Captures multilevel relationships between data points
Strengths:
Can produce interpretable hierarchy
Flexible cluster shapes
Limitations:
Computationally expensive
Sensitive to linkage criteria
Linkage criteria
Method | How it measures distance between clusters | Purpose / Best Use | Example Use Case |
|---|---|---|---|
Single Link | Distance between the closest pair of points (one from each cluster) | Best for detecting elongated, irregular- shaped clusters; sensitive to noise/outliers (“chaining” effect) | Identifying geographical clusters following rivers or roads (e.g., habitats along a coastline --> can help protect wildlife conservation zones, or pollution across rivers) |
Complete Link | Distance between the farthest pair of points (one from each cluster) | Produces compact, spherical clusters; can split large clusters; sensitive to outliers | Customer segmentation needing tight, well-separated groups for targeted marketing |
Average Link | Average distance between all pairs of points (across the two clusters) | Balanced approach; works well for clusters of moderate variance and shape; less sensitive to outliers | Clustering biological samples (e.g., gene expression profiles) with moderate noise. This can help define molecular subtypes of a disease for personalized treatment |
Centroid Link | Distance between the centroids (means) of the clusters | Sensitive to shape and size; can produce inversions in dendrogram; works for convex, similarly sized clusters | Clustering product categories by average sales metrics across multiple regions |
Note: These linkage criteria are NOT applicable in algorithms like k-means, DBSCAN, or GMM.
Agglomerative
Agglomerative (bottom up) clustering:
It builds the dendrogram (tree) from the bottom level, and
merges the most similar (or nearest) pair of clusters
stops when all the data points are merged into a single cluster (i.e., the root cluster).
It is more popular than divisive methods.
Imagine a company wants to segment its customers to tailor marketing strategies. They collect data like:
Age
Annual spending
Purchase frequency
Favourite product category
Region
Bottom-Up Process Step-by-Step:
Initial step: every customer is its own cluster (singleton).
First merges: similar customers are paired together into tiny groups.
Divisive
Divisive hierarchical clustering is a top-down clustering approach that starts with all data points in a single cluster and recursively splits clusters into smaller sub-clusters until each data point is isolated or a stopping condition is met. ◦
Analogy:
Imagine starting with a big classroom of students. First, you split them by major (e.g., Computer Science vs. Business). Then split Computer Science into AI vs. Cybersecurity. Then split AI into Reinforcement Learning vs. NLP… You keep dividing until each student stands alone."
Document Classification Example
Start by treating all documents as one large cluster.
First split: Separate science papers vs. non-science.
Next split: Within science, separate biology vs. physics.
Next split: Within biology, separate molecular vs. ecology.
Keep recursively splitting → build a topic hierarchy tree.
Comparison Agglomerative vs Divisive
Feature | Agglomerative | Divisive |
|---|---|---|
Starts with | Singletons (each point = cluster) | One big cluster |
Action | Merges clusters | Splits clusters |
Computational cost | Less expensive | More expensive (needs global splitting) |
Popularity | More commonly used | Less commonly used |
DIANA Algorithm
DIANA (DIvisive ANAlysis) is an implementation of the divisive hierarchical clustering approach. DIANA (DIvisive ANAlysis Clustering) was developed by Kaufman and Rousseeuw (1990) in their seminal book Finding Groups in Data.
Key steps of DIANA:
Start with all data in one cluster.
Find the most dissimilar object (the “seed”).
Create a new cluster around this object.
Move other points to the new cluster if they are more dissimilar from the current cluster than the new one.
Repeat splitting the remaining clusters until each point is its own cluster (or a stopping criterion is met).
DIANA uses a dissimilarity matrix, making it flexible for non-Euclidean distances.
Density-Based Methods
Density-based clustering methods define clusters as contiguous regions of high point density separated by regions of low density. A cluster is formed where data points are densely packed together, and points in sparse regions are considered noise or outliers.
The most well-known algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Example use case:
Geospatial clustering (e.g. to detect hotspots of earthquakes,
Works well when:
Clusters are irregularly shaped (e.g., non-spherical)
You expect outliers / noise and want to detect them
Need a parameter-free cluster count (k is not required)
Avoid if:
Data is uniformly distributed
Data has varying densities → DBSCAN struggles unless you tune ε carefully
Dataset is very high-dimensional
Representative Algorithms:
DBSCAN
OPTICS
How it works:
Points within high-density neighbourhoods are grouped into clusters
Noise/outliers are identified as low-density points
Strengths:
Robust to outliers
Can detect arbitrary-shaped clusters
Does not require predefining k
Limitations:
Struggles with varying densities
Sensitive to parameter selection (epsilon, minPts)
Model-Based Methods
Model-based clustering assumes the data is generated from a mixture of underlying probability distributions and aims to estimate the parameters of these distributions to assign points to clusters."
Each cluster is modelled as a component in a mixture model (often Gaussian).
The most common algorithm: Gaussian Mixture Models (GMMs).
Example use cases:
Market segmentation where customers belong partially to multiple segments
Image processing for colour clustering under Gaussian assumption
Genomic data modeling gene expression clusters with uncertainty
works well when:
Data is generated from normally distributed populations
Clusters may overlap
Need probabilistic assignment (uncertainty modeling)
Avoid If:
Data has highly irregular/non-Gaussian shapes
Very high-dimensional → slow convergence
You need clear, hard cluster labels
Representative Algorithms:
Gaussian Mixture Models (GMMs)
How it works:
Fits a probabilistic model to the data
Each cluster corresponds to a component of the mixture
Assigns membership probabilities rather than hard assignments
Strengths:
Can capture overlapping clusters
Probabilistic interpretation
Limitations:
Sensitive to initialisation
Computationally intensive for large/high- dimensional data
Grid-Based Methods
Grid-based clustering divides the data space into a finite number of cells forming a grid structure, then performs clustering by operating on these cells instead of individual data points. The main idea: quantise data space → summarise data density inside cells → cluster based on cell density patterns.
A famous example: STING (Statistical Information Grid).
Example use cases:
Astronomical data mining (detecting dense star regions), Geographic Information Systems (GIS) (clustering spatial regions), Sensor networks (summarizing coverage zones)
Works well when:
Dataset is massive → scalability is priority
Interested in speed over precision
Data can be easily binned into fixed intervals
Avoid If:
High-dimensional data → grid becomes sparse
Irregular cluster shapes
Need fine-grained clustering boundaries
Representative Algorithms:
STING (Statistical Information Grid)
CLIQUE (Clustering in Quest)
How it works:
Cluster formation occurs within dense cells
Reduces computational complexity to number of cells, not data points
Strengths:
Scalable to large datasets
Efficient for multidimensional data
Limitations:
Resolution depends on grid granularity