Clustering

Machine Learning Components:
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Deep Learning
Focus Areas of This Course:
- Supervised Learning
- Unsupervised Learning

Supervised Learning:
- Definition: Learning occurs with the presence of a label (target column).
- Types:
- Classification: Predicts categorical fields.
- Examples:
  - Decision Trees
  - k-Nearest Neighbors (KNN)
  - Logistic Regression
- Regression: Predicts continuous fields.
- Examples:
  - Simple Linear Regression
  - Multiple Linear Regression
  - Ridge Regression
  - Lasso Regression
Unsupervised Learning:
- Definition: Learning occurs without the presence of labels (no target column).
- Types:
- Clustering (Segmentation):
  - Examples:
  - k-means Clustering
  - DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
  - Hierarchical Dendrogram
- Association (Market Basket Analysis):
  - Example: Apriori Algorithm

Main Clustering Algorithms:
- K-Means Clustering
- Hierarchical Clustering
- Density-Based Clustering

Examples of Applications:
- Document Clustering
- Recommendation Engine
- Image Segmentation
- Market Segmentation
- Search Result Grouping
- Anomaly Detection

Recognition: Most recognized clustering algorithm.
Objective: Partition the data space to maximize similarity within clusters (intra-class) and minimize similarity between clusters (inter-class).

Initialization:
- Randomly choose k initial items as centroids for k clusters.
Assignment Step:
- For each item, assign it to the nearest cluster based on distance to centroids.
Centroid Update Step:
- Compute new centroids for each cluster based on the mean of all points assigned to that cluster.
Iteration:
- Repeat the assignment and update steps until no changes occur or maximum iterations reached.

Objective: Cluster customers into high and low-risk categories.
- Attributes:
- Income: Annual income of each customer.
- Loan: Total loan amount granted.
Tasks:
1. Create scatter plot for customers.
2. Build customer segmentation model.

Process:
1. Plot scatter plot colored by cluster.
2. Identify high and low-risk clusters post-clustering.

Distance Formula:
- For points $c1 = (x1, y1)$ and $c2 = (x2, y2)$:
 $d(c1, c2) = ext{Euclidean}(c) = ext{sqrt}((x2 - x1)^2 + (y2 - y1)^2)$
Example Calculation:
- Given points (50000, 60000) and (62000, 56000):
  $d(c) = ext{sqrt}((62000 - 50000)^2 + (56000 - 60000)^2) = 12649$

Context: Normalization is critical when distance calculations are involved in distance-based algorithms, helping give all dimensions equal weight by scaling them appropriately.

Challenge: Close centroids may lead to poor clustering results.
Algorithm Improvement: K-means++ helps select better initial centroids.
Reiteration of Distance Calculation: For each point, classify into the appropriate cluster based on proximity to centroids across multiple iterations until stabilization.

Implicit Objective Function: Measures sum of distances from observations to their respective centroids, called Within-Cluster Sum of Squares (WCSS).
Formula for WCSS:
- $ext{WCSS} = ext{sum}((xi - Yi)^2)$ where $xi$ is observation and $Yi$ is its centroid.

Objective: Graph WCSS for different values of k to find where WCSS decreases significantly.
Process: Identify the point where increasing k results in little improvement in WCSS, typically noted as the elbow.
Characteristics:
- Subjective nature of determining “elbow”.
- Not universally applicable, especially for high-dimensional or irregular datasets.

Basic Steps:
1. Initialization: Treat each data point as an individual cluster.
2. Distance Calculation: Compute proximity matrix.
3. Merge Closest Clusters: Iteratively combine clusters until one remains.
Dendrogram Visualization:
- Tree-like representation showing merges.

Definition: Density-based clustering algorithm that groups closely packed points and identifies outliers.
Advantages over K-Means: Capable of clustering irregular shapes and inherently robust to noise.

Epsilon (ε): Defines radius around a point for neighborhood density check.
minPoints: Minimum points required in ε neighborhood to classify as a core point.
Higher Dimensions: Generalization of ε to hypersphere in higher dimensions.

Capability: Effectively clusters spatial datasets while robustly detecting noise.