Lecture Notes on Machine Learning Concepts

Introduction

This session continues the lecture series from the previous session, building upon foundational concepts.
It includes important recording and necessary corrections to previously distributed notes, ensuring accuracy and comprehensive understanding.

Averaging Methods in Classification

Clarification on different averaging methods used for evaluating classification metrics, particularly essential when dealing with multi-class problems.
Macro Averaging:
- Calculates the metric independently for each class and then takes the unweighted average of these per-class metrics.
- Gives equal weightage to all classes, regardless of their support (number of true instances).
- This method is particularly suitable for imbalanced datasets as it treats each class equally, preventing the performance of minority classes from being overshadowed by majority classes.
Weighted Averaging:
- Calculates the metric for each class independently and then computes a weighted average, where the weight for each class is proportional to the number of samples in that class (their support).
- Gives more importance to majority classes due to their larger number of samples.
- This method is useful when you want to prioritize the overall performance across the entire dataset, or when you want to give more emphasis to the classes that occur more frequently in the data, based on specific use cases or business objectives.

Imbalanced Datasets

Definition and Implications: Imbalanced datasets are characterized by a significant disparity in the number of samples across different classes, where one or more classes (majority classes) have many more instances than others (minority classes).
Example: Fraud detection in credit card transactions provides a classic example.
- The majority class would be "Non-fraud" (representing the vast majority of daily transactions).
- The minority class would be "Fraud" (representing a very small percentage of transactions).
- A dataset is generally considered balanced for classification if the ratio between the majority and minority classes is closer, for instance, a 50-50, 60-40, or even 70-30 split at maximum for certain types of classifications, although the exact threshold can vary.
Accuracy of Prediction vs. Class Imbalance (The Accuracy Paradox):
- Consider an example case: In a dataset with 1000 instances, Class 1 has 900 instances (majority), and Class 2 has 100 instances (minority).
- A naive prediction method that simply predicts every instance as Class 1 (the majority class) would show a high accuracy of 90% (900 correct predictions out of 1000).
- This 90% accuracy is highly misleading, as the model completely fails to identify any instances of the minority class (Class 2), achieving 0% accuracy on the critical minority class transactions. This highlights why accuracy alone is insufficient for evaluating models on imbalanced datasets.

Strategies for Addressing Class Imbalance

1. Collect More Data

This is often the most effective solution as it directly addresses the root cause of imbalance by increasing the representation of the under-sampled class.
However, it may not always be feasible due to financial, time, ethical, or availability constraints (e.g., rare diseases, historical events).

2. Random Upsampling of Minority Class

Involves increasing the number of instances in the minority class by randomly replicating existing samples with replacement until the class distribution is more balanced (e.g., increasing minority class from 100 to 900 instances).
Pros: Simple to implement.
Cons: Can lead to overfitting, as the model sees the same minority samples multiple times, and does not add any new, unique information to the dataset, potentially leading to a less robust model.

3. Downsampling of Majority Class

Involves reducing the number of instances in the majority class by randomly removing samples until the class distribution is more balanced.
Pros: Can help address imbalance and reduce computational cost.
Cons: Often undesirable as it may discard valuable information contained in the majority class samples, potentially leading to a loss of important patterns and decreased model performance.

4. Synthetic Data Augmentation

These techniques artificially increase the size of the dataset, particularly the minority class, by generating new, synthetic samples rather than simply duplicating existing ones.
Examples from image processing: Image modifications like cropping, rotating, flipping, adjusting brightness, or adding noise are common forms of data augmentation to increase dataset size and variability.
SMOTE (Synthetic Minority Oversampling Technique): Generates synthetic samples for the minority class. It works by taking an instance from the minority class, finding its k-nearest neighbors, randomly selecting one of these neighbors, and creating a new synthetic instance at a random point along the line segment between the original instance and its chosen neighbor in the feature space. This helps create meaningful new examples that are similar but not identical to existing minority class instances.
ADASYN (Adaptive Synthetic Sampling): An extension of SMOTE that focuses on generating more synthetic data for minority class instances that are harder to learn (i.e., those near the decision boundary or misclassified). It adaptively shifts the decision boundary to focus on the difficult instances, improving the model's robustness against hard examples.

Metrics for Class Performance

Precision, Recall, and F1 Score: These are crucial metrics for evaluating the performance of classifiers, especially when dealing with imbalanced datasets, as accuracy can be misleading.
Precision: \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} Measures the proportion of positive identifications that were actually correct. It's about how many of the predicted positives are truly positive.
Recall (Sensitivity): \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} Measures the proportion of actual positives that were identified correctly. It's about how many of the actual positives the model caught.
F1 Score: \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} The harmonic mean of precision and recall. It tries to strike a balance between precision and recall, making it a good single metric for imbalanced datasets.
It is of utmost importance to use an appropriate weight strategy during impact assessments (e.g., macro or weighted averaging) to prevent the underrepresentation of the minority class's performance in the overall evaluation.

Logistic Regression Overview

Logistic Regression: Despite its "regression" in the name, it is fundamentally a classification algorithm widely used for binary classification problems, though it can be extended to multi-class problems.
It is best suited for linearly separable data, meaning there exists a linear decision boundary (a hyperplane in higher dimensions) that can effectively separate the different classes.
Output: The model outputs a continuous value between 0 and 1, which is interpreted as the probability of an instance belonging to the positive class.
Sigmoid Function Example: The core of logistic regression is the sigmoid (or logistic) activation function, which squashes any real-valued number into a value between 0 and 1:
\text{Sigmoid}(x) = \frac{1}{1 + e^{-x}}
Here, x is the linear combination of input features and weights (w \cdot X + b).
Decision Boundary: A decision boundary is determined by setting a threshold (commonly 0.5) on the predicted probabilities. If the probability is above the threshold, the instance is classified as the positive class; otherwise, it's the negative class. Adjusting this threshold directly impacts the trade-off between precision and recall.
Applications: Examples include email spam classification (spam vs. non-spam), disease diagnosis (presence vs. absence of disease), customer churn prediction, and sentiment analysis.

K-Nearest Neighbors (KNN)

Mechanism: KNN is a non-parametric, lazy learning algorithm that works by classifying a new data point based on the majority class of its K nearest neighbors in the feature space.
The "distance" to neighbors is typically calculated using metrics like Euclidean distance, Manhattan distance, or others, determining how similar data points are.
Impact of K:
- If K=1, the model is highly sensitive to noise and outliers, as a single noisy neighbor can determine the classification.
- Increasing K generally stabilizes the decision boundaries, making the model more robust to noise but can also oversimplify complex decision boundaries, potentially leading to underfitting (high bias, low variance).
Choosing Optimal K: The optimal value of K is crucial for good performance and is typically found through systematic methods like cross-validation or grid search strategies to identify the hyperparameters that yield the best model performance on unseen data.
Curse of Dimensionality: KNN's performance can degrade significantly in high-dimensional spaces, a phenomenon known as the "curse of dimensionality," where distances become less meaningful, and the data becomes sparse.

Support Vector Machines (SVM)

Objective: SVM aims to find the optimal hyperplane (a decision boundary) that best separates data points of different classes by maximizing the margin between them. The margin is the distance between the hyperplane and the closest data points from each class.
Support Vectors: The hyperplane is crucially derived from the data points that are closest to it. These critical data points, which directly influence the position and orientation of the hyperplane, are called "support vectors."
Optimization with Slack Variables: SVM's optimization seeks to maximize the classification margin while also minimizing misclassification errors. For non-perfectly separable data, "slack variables" are introduced to allow some misclassifications or points within the margin, penalizing them in the objective function to find a balance between margin maximization and error minimization.
Linear vs. Non-linear Data:
- Linear SVM: Works well when classes can be separated by a straight line or plane.
- Non-linear SVM (Kernel Trick): Regularly handles non-linear data by using "Kernel tricks." Kernels are functions (e.g., RBF - Radial Basis Function, Polynomial, Sigmoid) that implicitly map the original input features into a higher-dimensional space where the data might become linearly separable. This avoids computationally intensive direct mapping of all data points.
Kernel Types: Various kernel types allow SVM to manage complex data transformations effectively for decision boundary determination, depending on the intrinsic distribution and complexity of the data.
Example Formulations for SVM:
- The primary goal is to maximize the margin, which is inversely proportional to the norm of the weight vector ||w||:
  \text{Maximize:} \frac{2}{|w|}
- Subject to the constraint that all data points are correctly classified with a margin of at least 1 (for hard-margin SVM) or considering slack variables (for soft-margin SVM):
  yi (w \cdot xi + b) \ge 1 \quad \text{for all } i
  where yi is the label (-1 or 1), xi are the features, w is the weight vector, and b is the bias.

Implementation Details

Python Libraries: Powerful Python libraries like scikit-learn (sklearn) provide efficient and user-friendly implementations of SVM, KNN, and Logistic Regression, among other machine learning algorithms.
These libraries offer various hyperparameters that can be tuned to optimize model performance, such as C (regularization parameter) and kernel for SVM, n_neighbors for KNN, and solver for Logistic Regression.
Importance of Tuning: There is a strong emphasis on regularization (preventing overfitting) and comprehensive hyperparameter tuning for sensitive models like KNN and SVM, as their performance is highly dependent on these choices.

Project Requirements and Assignment Guidelines

Data and Target Variable Identification: Clearly define the dataset being used and precisely identify the target variable that the model will predict.
Preprocessing: Includes essential steps such as stratifying the dataset during splitting (e.g., train-test split) to maintain class proportions, and applying correct scaling methods (e.g., StandardScaler, MinMaxScaler) to features, which is critical for distance-based algorithms like KNN and SVM.
Experimentation Component: Requires conducting thorough experimentation with different hyperparameters for chosen models and evaluating their performance using relevant metrics, particularly precision-recall metrics (and their curves) for validation datasets.
Visual Presentations: Creation of clear visual presentations (e.g., plots, graphs) to illustrate findings, model performances, and decision boundaries where applicable.
Detailed Comparison in Reporting Outcomes: A comprehensive report detailing the methodologies, experimental results, and a thoughtful comparison of different models' performances and their implications.
Logical Reasoning: Note that logical reasoning behind metric selections (e.g., why F1 score over accuracy for imbalanced data) and the chosen thresholds (e.g., decision boundary for logistic regression) should underpin the final project analysis and conclusions.