Clustering
Machine Learning Overview
Introduction to Machine Learning
Machine Learning Components:
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Deep Learning
Focus Areas of This Course:
Supervised Learning
Unsupervised Learning
Key Differences Between Supervised and Unsupervised Learning
Supervised Learning:
Definition: Learning occurs with the presence of a label (target column).
Types:
Classification: Predicts categorical fields.
Examples:
Decision Trees
k-Nearest Neighbors (KNN)
Logistic Regression
Regression: Predicts continuous fields.
Examples:
Simple Linear Regression
Multiple Linear Regression
Ridge Regression
Lasso Regression
Unsupervised Learning:
Definition: Learning occurs without the presence of labels (no target column).
Types:
Clustering (Segmentation):
Examples:
k-means Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Hierarchical Dendrogram
Association (Market Basket Analysis):
Example: Apriori Algorithm
Clustering
Definition of Clustering
Type: Unsupervised machine learning technique.
Objective: Group data points based on specific characteristics.
Categories of Clustering Algorithms
Main Clustering Algorithms:
K-Means Clustering
Hierarchical Clustering
Density-Based Clustering
Use Cases of Clustering Algorithms
Examples of Applications:
Document Clustering
Recommendation Engine
Image Segmentation
Market Segmentation
Search Result Grouping
Anomaly Detection
K-Means Clustering
Introduction to K-Means Clustering
Recognition: Most recognized clustering algorithm.
Objective: Partition the data space to maximize similarity within clusters (intra-class) and minimize similarity between clusters (inter-class).
K-Means Clustering Process
Initialization:
Randomly choose k initial items as centroids for k clusters.
Assignment Step:
For each item, assign it to the nearest cluster based on distance to centroids.
Centroid Update Step:
Compute new centroids for each cluster based on the mean of all points assigned to that cluster.
Iteration:
Repeat the assignment and update steps until no changes occur or maximum iterations reached.
Business Problem Example 1
Objective: Cluster customers into high and low-risk categories.
Attributes:
Income: Annual income of each customer.
Loan: Total loan amount granted.
Tasks:
Create scatter plot for customers.
Build customer segmentation model.
Visualization of Clusters
Process:
Plot scatter plot colored by cluster.
Identify high and low-risk clusters post-clustering.
Distance Measures in K-Means
Euclidean Distance Calculation
Distance Formula:
For points $c1 = (x1, y1)$ and $c2 = (x2, y2)$:
Example Calculation:
Given points (50000, 60000) and (62000, 56000):
Importance of Normalization
Context: Normalization is critical when distance calculations are involved in distance-based algorithms, helping give all dimensions equal weight by scaling them appropriately.
K-Means Centroids
Initial Centroid Selection and Iteration
Challenge: Close centroids may lead to poor clustering results.
Algorithm Improvement: K-means++ helps select better initial centroids.
Reiteration of Distance Calculation: For each point, classify into the appropriate cluster based on proximity to centroids across multiple iterations until stabilization.
Optimal Number of Clusters
Determining Cluster Count
Implicit Objective Function: Measures sum of distances from observations to their respective centroids, called Within-Cluster Sum of Squares (WCSS).
Formula for WCSS:
where $xi$ is observation and $Yi$ is its centroid.
Elbow Method for Optimal K
Objective: Graph WCSS for different values of k to find where WCSS decreases significantly.
Process: Identify the point where increasing k results in little improvement in WCSS, typically noted as the elbow.
Characteristics:
Subjective nature of determining “elbow”.
Not universally applicable, especially for high-dimensional or irregular datasets.
Hierarchical Clustering
Agglomerative Hierarchical Clustering
Basic Steps:
Initialization: Treat each data point as an individual cluster.
Distance Calculation: Compute proximity matrix.
Merge Closest Clusters: Iteratively combine clusters until one remains.
Dendrogram Visualization:
Tree-like representation showing merges.
DBSCAN Clustering
Introduction to DBSCAN
Definition: Density-based clustering algorithm that groups closely packed points and identifies outliers.
Advantages over K-Means: Capable of clustering irregular shapes and inherently robust to noise.
Parameters of DBSCAN
Epsilon (ε): Defines radius around a point for neighborhood density check.
minPoints: Minimum points required in ε neighborhood to classify as a core point.
Higher Dimensions: Generalization of ε to hypersphere in higher dimensions.
Application Need for DBSCAN
Capability: Effectively clusters spatial datasets while robustly detecting noise.