K-Means Clustering

K-Means Clustering Overview

  • Definition: K-Means Clustering is an algorithm aimed at partitioning n data points into k clusters, where each data point belongs to the cluster with the nearest mean.

Steps in K-Means Clustering

Step 1: Choose the Number of Clusters (K)

  • Select the desired number of clusters to identify in the data. Example: K=3.

Step 2: Initialize Clusters

  • Randomly select K distinct points as initial cluster centroids.

Step 3: Assign Data Points to Nearest Cluster

  • Measure the distance from each data point to the K cluster centroids and assign each point to the nearest cluster (e.g., distance measured for green, red, and blue clusters).

Step 4: Calculate New Centroids

  • After assigning clusters, calculate the mean of each cluster's data points to determine new centroids.

Step 5: Repeat Assignment

  • Reassign each data point to the nearest new centroid.

Step 6: Check for Convergence

  • Continue reassigning points and updating centroids until there are no changes, indicating convergence.

Assessing Clustering Quality

  • The quality of the clustering can be assessed by calculating the total variation within each cluster. Lower total variation indicates better clustering.

Choosing Optimal K

  • Elbow Method: Graph the total variation against different values of K to identify the 'elbow' point where the rate of decrease in variation slows down, suggesting an optimal number of clusters (e.g., K=3).

Distance Calculation

  • Use Euclidean distance to determine distances in K dimensions, analogous to the Pythagorean theorem in two dimensions.