Data Mining for eHealth - Clustering Notes
Introduction
COMP8160 eHealth module, Week 12 - Data Mining
Instructor: Daniel Soria (he/him)
Overview of Clustering in Data Mining
Clustering is an essential step in data mining that involves grouping similar data points to discover patterns.
It is particularly relevant in eHealth to cluster medical data, which helps in decision-making.
Key Points on Clustering
Definition: Clustering is the process of grouping similar items (data points) in a dataset without predefined labels.
It is a form of unsupervised learning, unlike classification (supervised learning).
No target value is used in clustering - we're looking for natural groupings in the data.
Key Characteristics of Clusters:
Data points within the same cluster exhibit high similarity.
Data points in different clusters are as dissimilar as possible.
Types of Clustering Approaches
By Assignment Method
Hard Clustering: Each data point belongs to exactly one cluster (e.g., K-means).
Soft/Fuzzy Clustering: Data points may belong to multiple clusters with varying degrees of membership (e.g., red cluster: 0.7, green cluster: 0.3).
By Structure
Partitional Clustering: Divides data into non-overlapping subsets (clusters) where each data point belongs to exactly one cluster. Uses a fixed number of clusters.
Hierarchical Clustering:
Agglomerative (bottom-up): Starts with individual data points as clusters and merges them.
Divisive (top-down): Starts with all data points in one cluster and recursively splits.
Creates a dendrogram showing the hierarchical relationship between clusters.
Main Clustering Methods
1. K-Means Clustering
Definition: A partitioning method that divides data into k distinct clusters based on distance to the centroid of each cluster.
Objective Function: Minimize J(V) = sum of squared distances between data points and their cluster centers.
Mathematical Formula: J(V) = Σ Σ ||xi - μj||², where:
μj is the center of cluster j
||xi - μj|| is the Euclidean distance between point xi and center μj
cj represents the data points in cluster j
Process:
Select initial k cluster centers (centroids).
Allocate each data point to the nearest cluster center based on distance.
Recompute cluster center as the average (mean) of assigned data points.
Repeat until no data points change clusters (convergence).
2. Partitioning Around Medoids (PAM)
Similar to K-means, but cluster centers are actual data points from the dataset.
More robust to outliers than K-means.
3. Fuzzy C-Means
Allows data points to belong to multiple clusters with degrees of membership.
Each data point has a set of membership values indicating the degree to which it belongs to each cluster.
Determining the Optimal Number of Clusters
When n (number of clusters) is unknown, use validity indices:
External criteria to assess how good clusters are
Defined considering data dispersion within and between clusters
Helps select the best number of clusters according to decision rules
Practical Example of K-Means
Start with a dataset and initialize cluster centers (e.g., 6, 12, 18)
Calculate distances to each cluster center, then assign data points accordingly
Update cluster centers by taking averages, then repeat until stabilizing