lecture 9

Data Mining for eHealth - Clustering Notes

Introduction

Overview of Clustering in Data Mining

Clustering is an essential step in data mining that involves grouping similar data points to discover patterns.
It is particularly relevant in eHealth to cluster medical data, which helps in decision-making.

Key Points on Clustering

Definition: Clustering is the process of grouping similar items (data points) in a dataset without predefined labels.
It is a form of unsupervised learning, unlike classification (supervised learning).
No target value is used in clustering - we're looking for natural groupings in the data.
Key Characteristics of Clusters:
- Data points within the same cluster exhibit high similarity.
- Data points in different clusters are as dissimilar as possible.

Types of Clustering Approaches

By Assignment Method

Hard Clustering: Each data point belongs to exactly one cluster (e.g., K-means).
Soft/Fuzzy Clustering: Data points may belong to multiple clusters with varying degrees of membership (e.g., red cluster: 0.7, green cluster: 0.3).

By Structure

Partitional Clustering: Divides data into non-overlapping subsets (clusters) where each data point belongs to exactly one cluster. Uses a fixed number of clusters.
Hierarchical Clustering:
- Agglomerative (bottom-up): Starts with individual data points as clusters and merges them.
- Divisive (top-down): Starts with all data points in one cluster and recursively splits.
- Creates a dendrogram showing the hierarchical relationship between clusters.

Main Clustering Methods

1. K-Means Clustering

Definition: A partitioning method that divides data into k distinct clusters based on distance to the centroid of each cluster.
Objective Function: Minimize J(V) = sum of squared distances between data points and their cluster centers.
Mathematical Formula: J(V) = Σ Σ ||xi - μj||², where:
- μj is the center of cluster j
- ||xi - μj|| is the Euclidean distance between point xi and center μj
- cj represents the data points in cluster j
Process:
- Select initial k cluster centers (centroids).
- Allocate each data point to the nearest cluster center based on distance.
- Recompute cluster center as the average (mean) of assigned data points.
- Repeat until no data points change clusters (convergence).

2. Partitioning Around Medoids (PAM)

Similar to K-means, but cluster centers are actual data points from the dataset.
More robust to outliers than K-means.

3. Fuzzy C-Means

Allows data points to belong to multiple clusters with degrees of membership.
Each data point has a set of membership values indicating the degree to which it belongs to each cluster.

Determining the Optimal Number of Clusters

When n (number of clusters) is unknown, use validity indices:
- External criteria to assess how good clusters are
- Defined considering data dispersion within and between clusters
- Helps select the best number of clusters according to decision rules

Practical Example of K-Means