Clustering

Machine Learning Overview

Introduction to Machine Learning

  • Machine Learning Components:

    • Supervised Learning

    • Unsupervised Learning

    • Reinforcement Learning

    • Deep Learning

  • Focus Areas of This Course:

    • Supervised Learning

    • Unsupervised Learning

Key Differences Between Supervised and Unsupervised Learning

  • Supervised Learning:

    • Definition: Learning occurs with the presence of a label (target column).

    • Types:

    • Classification: Predicts categorical fields.

    • Examples:

      • Decision Trees

      • k-Nearest Neighbors (KNN)

      • Logistic Regression

    • Regression: Predicts continuous fields.

    • Examples:

      • Simple Linear Regression

      • Multiple Linear Regression

      • Ridge Regression

      • Lasso Regression

  • Unsupervised Learning:

    • Definition: Learning occurs without the presence of labels (no target column).

    • Types:

    • Clustering (Segmentation):

      • Examples:

      • k-means Clustering

      • DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

      • Hierarchical Dendrogram

    • Association (Market Basket Analysis):

      • Example: Apriori Algorithm

Clustering

Definition of Clustering

  • Type: Unsupervised machine learning technique.

  • Objective: Group data points based on specific characteristics.

Categories of Clustering Algorithms

  • Main Clustering Algorithms:

    • K-Means Clustering

    • Hierarchical Clustering

    • Density-Based Clustering

Use Cases of Clustering Algorithms

  • Examples of Applications:

    • Document Clustering

    • Recommendation Engine

    • Image Segmentation

    • Market Segmentation

    • Search Result Grouping

    • Anomaly Detection

K-Means Clustering

Introduction to K-Means Clustering

  • Recognition: Most recognized clustering algorithm.

  • Objective: Partition the data space to maximize similarity within clusters (intra-class) and minimize similarity between clusters (inter-class).

K-Means Clustering Process

  1. Initialization:

    • Randomly choose k initial items as centroids for k clusters.

  2. Assignment Step:

    • For each item, assign it to the nearest cluster based on distance to centroids.

  3. Centroid Update Step:

    • Compute new centroids for each cluster based on the mean of all points assigned to that cluster.

  4. Iteration:

    • Repeat the assignment and update steps until no changes occur or maximum iterations reached.

Business Problem Example 1

  • Objective: Cluster customers into high and low-risk categories.

    • Attributes:

    • Income: Annual income of each customer.

    • Loan: Total loan amount granted.

  • Tasks:

    1. Create scatter plot for customers.

    2. Build customer segmentation model.

Visualization of Clusters

  • Process:

    1. Plot scatter plot colored by cluster.

    2. Identify high and low-risk clusters post-clustering.

Distance Measures in K-Means

Euclidean Distance Calculation

  • Distance Formula:

    • For points $c1 = (x1, y1)$ and $c2 = (x2, y2)$:
      d(c<em>1,c</em>2)=extEuclidean(c)=extsqrt((x<em>2x</em>1)2+(y<em>2y</em>1)2)d(c<em>1, c</em>2) = ext{Euclidean}(c) = ext{sqrt}((x<em>2 - x</em>1)^2 + (y<em>2 - y</em>1)^2)

  • Example Calculation:

    • Given points (50000, 60000) and (62000, 56000):
      d(c)=extsqrt((6200050000)2+(5600060000)2)=12649d(c) = ext{sqrt}((62000 - 50000)^2 + (56000 - 60000)^2) = 12649

Importance of Normalization

  • Context: Normalization is critical when distance calculations are involved in distance-based algorithms, helping give all dimensions equal weight by scaling them appropriately.

K-Means Centroids

Initial Centroid Selection and Iteration

  • Challenge: Close centroids may lead to poor clustering results.

  • Algorithm Improvement: K-means++ helps select better initial centroids.

  • Reiteration of Distance Calculation: For each point, classify into the appropriate cluster based on proximity to centroids across multiple iterations until stabilization.

Optimal Number of Clusters

Determining Cluster Count

  • Implicit Objective Function: Measures sum of distances from observations to their respective centroids, called Within-Cluster Sum of Squares (WCSS).

  • Formula for WCSS:

    • extWCSS=extsum((x<em>iY</em>i)2)ext{WCSS} = ext{sum}((x<em>i - Y</em>i)^2) where $xi$ is observation and $Yi$ is its centroid.

Elbow Method for Optimal K

  • Objective: Graph WCSS for different values of k to find where WCSS decreases significantly.

  • Process: Identify the point where increasing k results in little improvement in WCSS, typically noted as the elbow.

  • Characteristics:

    • Subjective nature of determining “elbow”.

    • Not universally applicable, especially for high-dimensional or irregular datasets.

Hierarchical Clustering

Agglomerative Hierarchical Clustering

  • Basic Steps:

    1. Initialization: Treat each data point as an individual cluster.

    2. Distance Calculation: Compute proximity matrix.

    3. Merge Closest Clusters: Iteratively combine clusters until one remains.

  • Dendrogram Visualization:

    • Tree-like representation showing merges.

DBSCAN Clustering

Introduction to DBSCAN

  • Definition: Density-based clustering algorithm that groups closely packed points and identifies outliers.

  • Advantages over K-Means: Capable of clustering irregular shapes and inherently robust to noise.

Parameters of DBSCAN

  • Epsilon (ε): Defines radius around a point for neighborhood density check.

  • minPoints: Minimum points required in ε neighborhood to classify as a core point.

  • Higher Dimensions: Generalization of ε to hypersphere in higher dimensions.

Application Need for DBSCAN

  • Capability: Effectively clusters spatial datasets while robustly detecting noise.