HS

lecture 9

Data Mining for eHealth - Clustering Notes

Introduction

  • COMP8160 eHealth module, Week 12 - Data Mining

  • Instructor: Daniel Soria (he/him)

Overview of Clustering in Data Mining

  • Clustering is an essential step in data mining that involves grouping similar data points to discover patterns.

  • It is particularly relevant in eHealth to cluster medical data, which helps in decision-making.

Key Points on Clustering

  • Definition: Clustering is the process of grouping similar items (data points) in a dataset without predefined labels.

  • It is a form of unsupervised learning, unlike classification (supervised learning).

  • No target value is used in clustering - we're looking for natural groupings in the data.

  • Key Characteristics of Clusters:

    • Data points within the same cluster exhibit high similarity.

    • Data points in different clusters are as dissimilar as possible.

Types of Clustering Approaches

By Assignment Method

  • Hard Clustering: Each data point belongs to exactly one cluster (e.g., K-means).

  • Soft/Fuzzy Clustering: Data points may belong to multiple clusters with varying degrees of membership (e.g., red cluster: 0.7, green cluster: 0.3).

By Structure

  • Partitional Clustering: Divides data into non-overlapping subsets (clusters) where each data point belongs to exactly one cluster. Uses a fixed number of clusters.

  • Hierarchical Clustering:

    • Agglomerative (bottom-up): Starts with individual data points as clusters and merges them.

    • Divisive (top-down): Starts with all data points in one cluster and recursively splits.

    • Creates a dendrogram showing the hierarchical relationship between clusters.

Main Clustering Methods

1. K-Means Clustering

  • Definition: A partitioning method that divides data into k distinct clusters based on distance to the centroid of each cluster.

  • Objective Function: Minimize J(V) = sum of squared distances between data points and their cluster centers.

  • Mathematical Formula: J(V) = Σ Σ ||xi - μj||², where:

    • μj is the center of cluster j

    • ||xi - μj|| is the Euclidean distance between point xi and center μj

    • cj represents the data points in cluster j

  • Process:

    • Select initial k cluster centers (centroids).

    • Allocate each data point to the nearest cluster center based on distance.

    • Recompute cluster center as the average (mean) of assigned data points.

    • Repeat until no data points change clusters (convergence).

2. Partitioning Around Medoids (PAM)

  • Similar to K-means, but cluster centers are actual data points from the dataset.

  • More robust to outliers than K-means.

3. Fuzzy C-Means

  • Allows data points to belong to multiple clusters with degrees of membership.

  • Each data point has a set of membership values indicating the degree to which it belongs to each cluster.

Determining the Optimal Number of Clusters

  • When n (number of clusters) is unknown, use validity indices:

    • External criteria to assess how good clusters are

    • Defined considering data dispersion within and between clusters

    • Helps select the best number of clusters according to decision rules

Practical Example of K-Means

  • Start with a dataset and initialize cluster centers (e.g., 6, 12, 18)

  • Calculate distances to each cluster center, then assign data points accordingly

  • Update cluster centers by taking averages, then repeat until stabilizing