K-Means Clustering

K-Means Clustering Overview

Definition: K-Means Clustering is an algorithm aimed at partitioning n data points into k clusters, where each data point belongs to the cluster with the nearest mean.

Steps in K-Means Clustering

Step 1: Choose the Number of Clusters (K)

Select the desired number of clusters to identify in the data. Example: K=3.

Step 2: Initialize Clusters

Randomly select K distinct points as initial cluster centroids.

Step 3: Assign Data Points to Nearest Cluster

Measure the distance from each data point to the K cluster centroids and assign each point to the nearest cluster (e.g., distance measured for green, red, and blue clusters).

Step 4: Calculate New Centroids

After assigning clusters, calculate the mean of each cluster's data points to determine new centroids.

Step 5: Repeat Assignment

Reassign each data point to the nearest new centroid.

Step 6: Check for Convergence

Continue reassigning points and updating centroids until there are no changes, indicating convergence.

Assessing Clustering Quality

The quality of the clustering can be assessed by calculating the total variation within each cluster. Lower total variation indicates better clustering.

Choosing Optimal K

Elbow Method: Graph the total variation against different values of K to identify the 'elbow' point where the rate of decrease in variation slows down, suggesting an optimal number of clusters (e.g., K=3).

Distance Calculation

Use Euclidean distance to determine distances in K dimensions, analogous to the Pythagorean theorem in two dimensions.