Clustering Notes

Clustering Concepts

Clustering

Clustering is the task of dividing a set of objects into groups (clusters) such that objects in the same group are more similar to each other than to those in other groups. Cluster analysis is a fundamental unsupervised learning technique aimed at grouping a set of objects such that those within the same group are more similar to each other than to those in other groups. Clustering aims to organize data based on intrinsic similarity without predefined labels.

Properties of clusters:
  • Clusters may have different sizes, shapes, densities

  • Clusters may form a hierarchy

  • Clusters may be overlapping or disjoint

Objectives of Cluster Analysis

  1. Discovering Structure in Data: Cluster analysis uncovers hidden patterns or groupings in data without prior labelling.

  2. Reducing Complexity: In datasets with high dimensionality or large volumes, clustering simplifies analysis by summarizing data into representative groups.

  3. Facilitating Exploratory Data Analysis (EDA): Clustering serves as a powerful tool in EDA by revealing underlying data distributions and relationships. It aids in hypothesis generation and guides subsequent analytical or modeling efforts.

  4. Enabling Segmentation for Actionable Insights: By segmenting data into meaningful clusters, organizations can tailor strategies to specific groups. For instance, businesses can customize marketing campaigns for different customer segments identified through clustering.

Connection to Unsupervised Learning & Knowledge Discovery

Cluster analysis is a cornerstone of unsupervised learning, where the goal is to infer patterns from unlabelled data. It plays a critical role in knowledge discovery by:

  • Identifying Novel Patterns: Uncovering previously unknown groupings or associations within data.

  • Informing Feature Selection: Highlighting relevant features that contribute to cluster formation.

  • Supporting Anomaly Detection: Detecting outliers as data points that do not fit into any cluster.

Clustering vs Classification

Aspect

Clustering

Classification

Learning Type

Unsupervised

Supervised

Data Labels

No predefined labels

Requires labeled data

Objective

Group similar data points

Assign data points to predefined categories

Output

Natural groupings (clusters)

Predicted labels for new data points

Use Cases

Market segmentation, image compression

Spam detection, disease diagnosis

Clustering is Subjective

Examples include dividing items into clusters of children vs. adults, trousers vs. shorts, or by clothing color such as red vs. blue vs. green.

Evaluating Clusters

  1. Internal Evaluation Criteria: Based on intra-cluster and inter-cluster similarities.

  2. External Evaluation Criteria: Compares clustering results to ground truth labels, if available.

    • Ground truth labels are the true, correct class or category assignments for each data point, as established by: human annotation, verified datasets, domain expert knowledge, or experimental design.

    • These labels are treated as the "gold standard" or "correct answers" against which clustering results are compared.

    • Datasets of animals with known species labels can be used to evaluate agreement. High agreement occurs if the algorithm groups all cats together and all dogs together; low agreement occurs if the algorithm mixes them.

    • Agreement can be quantified using metrics like: Adjusted Rand Index, Normalized Mutual Information, or Fowlkes-Mallows Index.

  3. Relative Evaluation Criteria: Used to compare clustering outcomes across different models or parameter values, often when neither internal quality nor ground truth labels alone are conclusive.

    • An example is the Silhouette score (highest = optimal), but there are other metrics such as the Calinski-Harabasz Index.

    • We often don’t know beforehand how many clusters (k) we should use.

Clustering Approaches Overview

Clustering approaches can be categorized based on how they define clusters and their underlying assumptions. Each method suits different types of data, application contexts, and scalability needs:

  • Partitioning: Construct various partitions and then evaluate them by some criterion

  • Hierarchical: Create a hierarchical decomposition of the set of objects using some criterion

  • Density-based: Guided by connectivity and density functions

  • Model-based: Hypothesize a model for each cluster and find best fit of models to data

Partitioning

Partitioning methods aim to divide a dataset into a predefined number of clusters, optimizing a specific objective function, e.g., minimizing within-cluster variance:

Example:

Imagine you have 10 data points and want to split them into 2 clusters. Each time you assign points to clusters, you can calculate:

  • How far each point is from the center (centroid) of its assigned cluster

  • Square those distances

  • Sum them up for all points in the cluster → this is the "within-cluster variance"

Example Use Cases:
  • Customer segmentation (grouping customers by purchasing behaviour)

  • Image colour quantisation (reducing number of colours in an image)

  • Market basket analysis (grouping similar buying patterns)

Works Well When:
  • Clusters are spherical, equally sized, similar densities

  • Dataset is moderate to large, low to medium dimensions

  • Need fast, efficient clustering

Avoid If:
  • Clusters are non-spherical, varying sizes

  • Many outliers → sensitive to outliers

  • Clusters are not linearly separable

Representative Algorithms:
  • k-means

  • k-medoids

  • CLARA (Clustering Large Applications)

How it works:
  • Start with k initial cluster centers

  • Assign data points to the closest center

  • Iteratively update centers to minimize intra -cluster distance

Strengths:
  • Efficient for large datasets

  • Simple implementation

  • Works best with spherical, equally sized clusters

Limitations:
  • Requires predefining k

  • Sensitive to outliers

  • Struggles with non-convex or varying density clusters

Distance Metrics

A metric is a function d(x,y)d(x,y) that measures the distance between two points x and y, satisfying these 4 key properties:

Property

Formula

Meaning

Non-negativity

d(x,y)0d(x,y) \geq 0

Distance is never negative, e.g. Distance between two cities = 0 km or more

Identity (of indiscernibles)

d(x,y) = 0 if and only if x = y

Zero distance only if they’re the same point, e.g. Object compared to itself → distance = 0.

Symmetry

d(x,y)=d(y,x)d(x,y) = d(y,x)

Distance is the same both ways (A → B same as B → A), e.g. Paris → London same as London → Paris

Triangle inequality

d(x,z)d(x,y)+d(y,z)d(x,z) \leq d(x,y) + d(y,z).

Shortest distance is direct; detour is longer, e.g. Walking direct vs. via another city

These properties ensure distance behaves consistently and intuitively.

Why is distance fundamental in clustering?

Clustering aims to group similar things together → we need a way to measure “similarity.”
A distance metric tells the algorithm: "Are these two points close enough to belong together?"

Examples:
  • Customers with similar spending → short distance

  • Patients with similar gene expression → short distance

  • Documents with similar topics → short distance

Points closer together → more similar → same cluster. Points far apart → dissimilar → different clusters.

Key distance metrics for partitioning:

Metric

Best Suited For

Works With

How to Interpret Distance

Euclidean

Continuous numeric data; compact, spherical clusters

k-means, k-medoids

Smaller = more similar

Manhattan (L1)

Grid-like data; sparse vectors

k-medoids

Smaller = more similar

Minkowski

Generalization of Euclidean and Manhattan

k-medoids

Smaller = more similar

Mahalanobis

Correlated features; elliptical clusters

k-medoids

Smaller = more similar

Cosine distance

High-dimensional sparse data; text analysis

k-medoids

Closer to 0 = more similar; closer to 1 = more dissimilar

Jaccard distance

Binary/categorical data (market basket)

k-medoids

Closer to 0 = more similar; closer to 1 = more dissimilar

Hamming distance

Binary strings; genetic sequence comparisons

k-medoids

Smaller = more similar

Gower distance

Mixed numeric + categorical data

k-medoids

Smaller = more similar

Chebyshev distance

Emphasis on largest attribute difference

k-medoids

Smaller = more similar

Canberra distance

Small value sensitivity; ecological/environmental data

k-medoids

Smaller = more similar

Bray-Curtis dissimilarity

Ecological species abundance data

k-medoids

Closer to 0 = more similar; closer to 1 = more dissimilar

Curse of Dimensionality

When you have many features (columns) (e.g., 1000 genes per sample):

  • Distances between all points start to look similar.

  • Everything looks equally far apart → clustering becomes random or unstable.

  • Imagine being in a huge hall: No one feels close to anyone—everyone is “far” in every direction.

  • Solution: Use dimensionality reduction (e.g., PCA) to simplify before clustering.

Scaling Issues

Big-scale features dominate distance → bias results
Suppose:
Feature 1: Age (values like 20, 30, 40) Feature 2: Income (values like 10,000, 50,000, 100,000)
Income values are way bigger than age. When measuring distance:→ Income dominates → age differences get ignored.
Result: Clusters based mostly on income → even if age was more important.

Euclidean Distance

Euclidean distance measures the straight-line distance between points. In clustering algorithms like K-Means, it's used to assign points to the nearest cluster centroid.

Interpretation:
  • Clusters are shaped like spheres or circles because Euclidean distance is sensitive to radius.

  • Points closer to a centroid (shorter line) belong to that cluster.

Manhattan Distance

Manhattan distance, also known as "city block" distance, calculates the distance between two points by only moving along axes at right angles. It's useful in grid-like path scenarios

Interpretation:
  • Clusters will have diamond or square boundaries rather than circles.

  • A point is assigned to a cluster whose centroid is closest in total block distance.

Minkowski Distance

Minkowski distance provides flexibility in measuring distance by adjusting the parameter p, allowing for a balance between Manhattan and Euclidean distances based on the dataset's characteristics.
p=1p=1 → diamond shape (Manhattan) p=2p=2 → circle (Euclidean) p > 2 → shapes get more “rounded square”

Interpretation:
  • Shows how distance metric controls the cluster shape:

    • Lower p → elongated along axes

    • Higher p → tighter along diagonals

  • The same dataset could be clustered differently by adjusting p.

Cosine Similarity

Cosine similarity is particularly useful in high-dimensional spaces, such as text mining, where the magnitude of vectors may not be as informative as their direction. Algorithms like spherical K- Means utilize cosine similarity.

Interpretation:
  • Points with similar directions (angles) are grouped together into clusters.

  • Magnitude doesn’t affect similarity → even if two vectors have different lengths but same angle → high similarity.

Jaccard Distance

In clustering, Jaccard distance measures the dissimilarity between sets of data points, typically represented as binary (0 or 1) or categorical data. It's calculated by finding the difference between the number of unique elements (union) and shared elements (intersection) of two sets, divided by the number of unique elements. This distance is commonly used in scenarios where data is best represented as sets, such as in document clustering or analysing categorical variables.

Interpretation:
  • More overlap → smaller Jaccard distance (more similar).

  • Less overlap → higher Jaccard distance (more dissimilar).

  • Clusters are formed by grouping items with high overlap (low distance).

k-means Algorithm

The k-means algorithm is one of the most widely used unsupervised machine learning algorithms for clustering tasks. It partitions a dataset into k distinct, non-overlapping clusters based on feature similarity.

K-means operates under the assumption that clusters are spherical and equally sized, making it particularly suitable for well-separated and convex clusters in Euclidean space.

The primary goal of the k-means algorithm is to divide a set of n observations into k clusters such that the within-cluster sum of squares (WCSS) — also referred to as inertia — is minimized. In simpler terms, each cluster should be formed such that data points within a cluster are as similar as possible to each other (intra-cluster similarity) and as dissimilar as possible from points in other clusters (inter-cluster dissimilarity).

Step

Description

1. Initialization

Pick k random points as starting centroids

2. Assignment

Use Euclidean distance to assign each point to the nearest centroid

3. Update

For each cluster, update the centroid by averaging the assigned points

4. Repeat

Loop steps 2 & 3 until centroids stop moving

Convergence Criteria

K-means converges when:

  1. No change in cluster assignments: The mapping of each point to a cluster remains constant between iterations.

  2. Centroids are stable: The location of cluster centroids no longer changes (or changes are below a predefined tolerance).

  3. Predefined iterations reached: A practical limit on iterations prevents infinite loops in edge cases.

Challenges of K-Means

Despite its simplicity and efficiency, k-means suffers from several well-known challenges:

  1. Local Minima

    • The algorithm may converge to a local optimum rather than the global one due to random initialization. This makes the result non-deterministic.

    • To mitigate this, it's common practice to run the algorithm multiple times with different seeds and select the best result (lowest WCSS).

  2. Need to Specify k in Advance

    • The number of clusters k must be defined beforehand, which may not always be obvious. Incorrect choices can lead to over- clustering or under-clustering.

  3. Sensitivity to Initialization

    • Random selection of initial centroids can significantly affect the outcome.

    • k-means++ (recommended) helps by initializing centroids in a way that is distant from each other, improving both convergence speed and result quality.

  4. Assumes Spherical Clusters

    • K-means assumes clusters of similar size and spherical shape, which is often not the case in real-world data (e.g., varying densities or non-linear boundaries).

Choosing the Optimal Number of Clusters (k)

Several methods have been developed to select a suitable value of k, including:

  1. Elbow Method

    • Plot the WCSS against different values of k.

    • The point where the decrease in WCSS becomes marginal (the "elbow") is chosen as the optimal k.

    • Though intuitive, this method can be subjective and ambiguous when the curve is smooth.

  2. Silhouette Score

    • Measures how similar a data point is to its own cluster vs. other clusters

    • Defined as:
      Silhouettescore=(ba)/max(a,b)Silhouette score = (b - a) / max(a, b)

    • Silhouette score ranges from -1 to +1, where values close to 1 indicate appropriate clustering.

k-medoids Algorithm

Partitioning using k-medoids algorithms is a partitional clustering algorithm which is slightly modified from the k-means algorithm.

  • K-means algorithm: Choose the means as centroids

  • Nearest Centroid Classification, also known as Centroid-based Classification, is a simple and intuitive classification algorithm that assigns new data points to the class whose centroid (mean) is closest to the data point.

  • K-medoids algorithm: Data points are chosen to be medoids

  • A medoid is a centroid that best represents the objects in a defined cluster ; It is similar to the K-Means clustering method but requires the use of medians for the formation of subgroups.

The k-medoids algorithm is a partitional clustering method that, unlike k-means, selects actual data points as the centers of clusters, called medoids.

A medoid is defined as the most centrally located object in a cluster, i.e., the point minimizing the sum of distances to all other points in that cluster. This is fundamentally different from the centroid used in k-means, which is the average of all points and can lie outside the dataset.

PAM algorithm

The most common implementation of k-medoids is the Partitioning Around Medoids (PAM) algorithm, first introduced by Kaufman and Rousseeuw (1987). Despite its age, PAM has been extensively improved in recent years and remains widely used.

Here are the detailed steps of the PAM algorithm:

  1. Initialize medoids

    • Select k initial medoids randomly from the dataset. Alternatively, use heuristics (such as the most centrally located points) to improve initial placement.

  2. Assign points to closest medoid

    • For each data point, compute the distance to all medoids and assign it to the nearest one, forming k clusters.

  3. Swap medoid/non-medoid and check cost

    • For each medoid, test swapping it with a non-medoid point in the dataset. After the swap, calculate the new total cost:

  4. Repeat until no improvement

    • If a swap improves the cost (reduces total pairwise distance), perform the swap and reassign points. Repeat until no swaps improve the clustering.

Advantages of k-medoids
  1. Robustness to outliers

  2. Compatibility with arbitrary distance metrics. K-means inherently assumes Euclidean distance, as the mean minimization only makes sense in that space. K-medoids, however, can work with non-Euclidean metrics such as Manhattan distance, cosine similarity, or even domain-specific distances (e.g., edit distance for strings).
    This flexibility makes it valuable for clustering diverse data types, including graphs, sequences, and categorical data.

Limitations of k-medoids
  • Computational complexity: PAM has higher computational cost than k-means because it evaluates all possible swaps between medoids and non-medoids, making it O(N2)O(N^2)

  • This limits its scalability for large datasets.

  • Best-fit applications: K-medoids is particularly useful in non-Euclidean spaces (like bioinformatics sequences, string matching, or network data) and in small to medium datasets where its robustness and flexibility outweigh its computational cost.

Comparison Of k-medoids Algorithms

Method

Key Feature

Best for

PAM

Exact medoid swaps, robust but slow

Small datasets, high accuracy needs

CLARA

Sample-based PAM

Large datasets, approximate clusters

CLARANS

Randomized local search over medoids

Medium-large datasets, scalable search

FastPAM

Optimized PAM with reuse of distances

Larger datasets, faster exact results

Metaheuristics

Optimization using genetic/PSO/ACO frameworks

High-dimensional, complex clustering

Cluster Interpretability

Cluster interpretability is fundamentally important because clusters in unsupervised learning gain practical value only when they can be understood, explained, and acted upon by domain experts, decision-makers, or stakeholders.

Interpretation connects these outputs to real-world understanding. Specifically, stakeholders need to know what each cluster represents: are these customer segments, behavioural groups, geographic patterns, or product types? Without a clear explanation, clusters risk being dismissed as “black-box artifacts” — mathematically valid but practically useless.

Interpretability bridges the gap between technical results and business actions by providing meaningful narratives and summaries that support decision-making, resource allocation, or policy adjustments. For example, if a marketing team uses clustering for customer segmentation, they need to know which demographics or behaviours characterize each segment to tailor targeted campaigns.

Techniques for Interpretation
  1. Centroid/Medoid Analysis

    • In centroid-based methods (e.g., K-Means), the centroid is the average point representing the cluster center in feature space, while in medoid-based methods (e.g., K-Medoids), the medoid is the actual data point that minimizes within-cluster dissimilarity.

    • Analysing the centroid or medoid helps explain “what typical members of the cluster look like”.

  2. Feature Importance per Cluster (Mean/Median Comparisons)

    • Another approach involves comparing the mean or median of each feature across clusters.

    • By identifying which features have notably different averages, you can summarize which variables are most characteristic of each cluster.

    • This is often visualized using boxplots, parallel coordinates plots, or summary tables.

    • For example, in a demographic clustering task, you might find that cluster 1 has a high average age but low income, while cluster 2 has low average age but high spending patterns.

  3. Visual Inspection

    • Visual tools are essential, especially for communicating results to non-technical audiences:

      • Scatterplots (2D/3D): By plotting two or three features, you can observe spatial separation between clusters and intuitively understand the data structure.

      • Pair plots: These show scatterplots for every pair of features, often color-coded by cluster, to reveal multi-feature interactions.

      • PCA / t-SNE Projections: In high-dimensional data, dimensionality reduction techniques like Principal Component Analysis (PCA) or t-distributed Stochastic Neighbour Embedding (t-SNE) are used to project data into 2D or 3D for visualization.

  4. Cluster Labelling (Using Supervised Ground Truth if Available)

    • If labelled data exists, clusters can sometimes be labelled by assigning the majority class from the known labels within each cluster.

  5. Cluster Profiling

    • A formal cluster profile summarizes each cluster’s defining characteristics using descriptive statistics or domain-specific summaries.

    • For example, after clustering customers, you might create a profile like:

      • Cluster A: young, urban, low income, high spending.

      • Cluster B: older, suburban, high income, conservative spending.

Challenges in interpretation
  1. High-Dimensional Interpretability Challenges

    • High-dimensional datasets (e.g., genomics, text embeddings) are notoriously hard to interpret because humans cannot visualize or comprehend patterns across 100+ dimensions.

  2. Non-Linear Cluster Boundaries

    • Many clustering algorithms assume linear separability (e.g., K-Means uses convex boundaries), but real-world data often contains non-linear patterns.

  3. Dependence on Domain Knowledge

    • Interpretation is not just a technical process but also a domain-driven task.

  4. Danger of Overinterpreting Spurious Clusters

    • Not all clusters reflect real-world phenomena.

    • Researchers caution against the human tendency to “see patterns where none exist” — known as pareidolia in psychology or overfitting in machine learning.

Hierarchical Methods

Hierarchical methods produce a nested structure of clusters, typically visualized as a dendrogram. Two strategies:

  • Agglomerative (bottom-up): start with each data point as its own cluster and iteratively merge

  • Divisive (top-down): start with all data points in one cluster and iteratively split

Example use cases:

  • Gene expression analysis (grouping genes/proteins hierarchically)

  • Customer segmentation with nested segments (e.g., VIP → premium → regular)

  • Document clustering → organizing large corpora into topic trees

Works well when:

  • Need hierarchical relationships (clusters within clusters)

  • Don’t know number of clusters beforehand

  • Want visual representation of cluster structure (dendrogram)

Avoid if:

  • Dataset is very large → high cost

  • Need strict hard assignments

  • Clusters are noisy / poorly separated

Two Strategies:
  • Agglomerative

  • Divisive

How it works:
  • No need to predefine k

  • Captures multilevel relationships between data points

Strengths:
  • Can produce interpretable hierarchy

  • Flexible cluster shapes

Limitations:
  • Computationally expensive

  • Sensitive to linkage criteria

Linkage criteria

Method

How it measures distance between clusters

Purpose / Best Use

Example Use Case

Single Link

Distance between the closest pair of points (one from each cluster)

Best for detecting elongated, irregular- shaped clusters; sensitive to noise/outliers (“chaining” effect)

Identifying geographical clusters following rivers or roads (e.g., habitats along a coastline --> can help protect wildlife conservation zones, or pollution across rivers)

Complete Link

Distance between the farthest pair of points (one from each cluster)

Produces compact, spherical clusters; can split large clusters; sensitive to outliers

Customer segmentation needing tight, well-separated groups for targeted marketing

Average Link

Average distance between all pairs of points (across the two clusters)

Balanced approach; works well for clusters of moderate variance and shape; less sensitive to outliers

Clustering biological samples (e.g., gene expression profiles) with moderate noise. This can help define molecular subtypes of a disease for personalized treatment

Centroid Link

Distance between the centroids (means) of the clusters

Sensitive to shape and size; can produce inversions in dendrogram; works for convex, similarly sized clusters

Clustering product categories by average sales metrics across multiple regions

Note: These linkage criteria are NOT applicable in algorithms like k-means, DBSCAN, or GMM.

Agglomerative

Agglomerative (bottom up) clustering:

  • It builds the dendrogram (tree) from the bottom level, and

  • merges the most similar (or nearest) pair of clusters

  • stops when all the data points are merged into a single cluster (i.e., the root cluster).

  • It is more popular than divisive methods.

Imagine a company wants to segment its customers to tailor marketing strategies. They collect data like:

  • Age

  • Annual spending

  • Purchase frequency

  • Favourite product category

  • Region

Bottom-Up Process Step-by-Step:
  1. Initial step: every customer is its own cluster (singleton).

  2. First merges: similar customers are paired together into tiny groups.

Divisive

Divisive hierarchical clustering is a top-down clustering approach that starts with all data points in a single cluster and recursively splits clusters into smaller sub-clusters until each data point is isolated or a stopping condition is met. ◦

Analogy:

Imagine starting with a big classroom of students. First, you split them by major (e.g., Computer Science vs. Business). Then split Computer Science into AI vs. Cybersecurity. Then split AI into Reinforcement Learning vs. NLP… You keep dividing until each student stands alone."
Document Classification Example
Start by treating all documents as one large cluster.

  • First split: Separate science papers vs. non-science.

  • Next split: Within science, separate biology vs. physics.

  • Next split: Within biology, separate molecular vs. ecology.

  • Keep recursively splitting → build a topic hierarchy tree.

Comparison Agglomerative vs Divisive

Feature

Agglomerative

Divisive

Starts with

Singletons (each point = cluster)

One big cluster

Action

Merges clusters

Splits clusters

Computational cost

Less expensive

More expensive (needs global splitting)

Popularity

More commonly used

Less commonly used

DIANA Algorithm

DIANA (DIvisive ANAlysis) is an implementation of the divisive hierarchical clustering approach. DIANA (DIvisive ANAlysis Clustering) was developed by Kaufman and Rousseeuw (1990) in their seminal book Finding Groups in Data.

Key steps of DIANA:
  • Start with all data in one cluster.

  • Find the most dissimilar object (the “seed”).

  • Create a new cluster around this object.

  • Move other points to the new cluster if they are more dissimilar from the current cluster than the new one.

  • Repeat splitting the remaining clusters until each point is its own cluster (or a stopping criterion is met).

  • DIANA uses a dissimilarity matrix, making it flexible for non-Euclidean distances.

Density-Based Methods

Density-based clustering methods define clusters as contiguous regions of high point density separated by regions of low density. A cluster is formed where data points are densely packed together, and points in sparse regions are considered noise or outliers.

The most well-known algorithm is DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

Example use case:

Geospatial clustering (e.g. to detect hotspots of earthquakes,

Works well when:
  • Clusters are irregularly shaped (e.g., non-spherical)

  • You expect outliers / noise and want to detect them

  • Need a parameter-free cluster count (k is not required)

Avoid if:
  • Data is uniformly distributed

  • Data has varying densities → DBSCAN struggles unless you tune ε carefully

  • Dataset is very high-dimensional

Representative Algorithms:
  • DBSCAN

  • OPTICS

How it works:
  • Points within high-density neighbourhoods are grouped into clusters

  • Noise/outliers are identified as low-density points

Strengths:
  • Robust to outliers

  • Can detect arbitrary-shaped clusters

  • Does not require predefining k

Limitations:
  • Struggles with varying densities

  • Sensitive to parameter selection (epsilon, minPts)

Model-Based Methods

Model-based clustering assumes the data is generated from a mixture of underlying probability distributions and aims to estimate the parameters of these distributions to assign points to clusters."
Each cluster is modelled as a component in a mixture model (often Gaussian).

The most common algorithm: Gaussian Mixture Models (GMMs).

Example use cases:
  • Market segmentation where customers belong partially to multiple segments

  • Image processing for colour clustering under Gaussian assumption

  • Genomic data modeling gene expression clusters with uncertainty

works well when:
  • Data is generated from normally distributed populations

  • Clusters may overlap

  • Need probabilistic assignment (uncertainty modeling)

Avoid If:
  • Data has highly irregular/non-Gaussian shapes

  • Very high-dimensional → slow convergence

  • You need clear, hard cluster labels

Representative Algorithms:
  • Gaussian Mixture Models (GMMs)

How it works:
  • Fits a probabilistic model to the data

  • Each cluster corresponds to a component of the mixture

  • Assigns membership probabilities rather than hard assignments

Strengths:
  • Can capture overlapping clusters

  • Probabilistic interpretation

Limitations:
  • Sensitive to initialisation

  • Computationally intensive for large/high- dimensional data

Grid-Based Methods

Grid-based clustering divides the data space into a finite number of cells forming a grid structure, then performs clustering by operating on these cells instead of individual data points. The main idea: quantise data space → summarise data density inside cells → cluster based on cell density patterns.

A famous example: STING (Statistical Information Grid).

Example use cases:

Astronomical data mining (detecting dense star regions), Geographic Information Systems (GIS) (clustering spatial regions), Sensor networks (summarizing coverage zones)

Works well when:
  • Dataset is massive → scalability is priority

  • Interested in speed over precision

  • Data can be easily binned into fixed intervals

Avoid If:
  • High-dimensional data → grid becomes sparse

  • Irregular cluster shapes

  • Need fine-grained clustering boundaries

Representative Algorithms:
  • STING (Statistical Information Grid)

  • CLIQUE (Clustering in Quest)

How it works:
  • Cluster formation occurs within dense cells

  • Reduces computational complexity to number of cells, not data points

Strengths:
  • Scalable to large datasets

  • Efficient for multidimensional data

Limitations:
  • Resolution depends on grid granularity