K-Means Clustering
K-Means Clustering Overview
Definition: K-Means Clustering is an algorithm aimed at partitioning n data points into k clusters, where each data point belongs to the cluster with the nearest mean.
Steps in K-Means Clustering
Step 1: Choose the Number of Clusters (K)
Select the desired number of clusters to identify in the data. Example: K=3.
Step 2: Initialize Clusters
Randomly select K distinct points as initial cluster centroids.
Step 3: Assign Data Points to Nearest Cluster
Measure the distance from each data point to the K cluster centroids and assign each point to the nearest cluster (e.g., distance measured for green, red, and blue clusters).
Step 4: Calculate New Centroids
After assigning clusters, calculate the mean of each cluster's data points to determine new centroids.
Step 5: Repeat Assignment
Reassign each data point to the nearest new centroid.
Step 6: Check for Convergence
Continue reassigning points and updating centroids until there are no changes, indicating convergence.
Assessing Clustering Quality
The quality of the clustering can be assessed by calculating the total variation within each cluster. Lower total variation indicates better clustering.
Choosing Optimal K
Elbow Method: Graph the total variation against different values of K to identify the 'elbow' point where the rate of decrease in variation slows down, suggesting an optimal number of clusters (e.g., K=3).
Distance Calculation
Use Euclidean distance to determine distances in K dimensions, analogous to the Pythagorean theorem in two dimensions.