Comprehensive Machine Learning Notes

Overview

This lecture provides an in-depth exploration of methods for assessing machine learning algorithms, particularly for regression and classification tasks. It explains the essential steps in model evaluation and preparation, including detailed methods for handling data, measuring performance, and selecting algorithms. Key topics covered are dividing datasets into training, validation, and testing sets, evaluating performance using metrics such as Mean Squared Error (MSE) for regression and confusion matrices for classification, applying logistic regression for both binary and multiclass classification, employing data scaling techniques to standardize feature ranges, and utilizing the k-nearest neighbors (KNN) algorithm for classification.

Regression Problems and Performance Measures

In regression analysis, the main objective is to define a mathematical relationship between input variables (features) and continuous output variables (targets). The effectiveness of this relationship is measured using various performance metrics, which help determine how accurately the model predicts the target variable based on the input features. The quality of this relationship is assessed using performance measures like Mean Squared Error (MSE).

Mean Squared Error (MSE)

MSE calculates the average of the squared differences between actual and predicted values. This metric is especially sensitive to outliers due to the squared term, making it valuable for assessing model accuracy. The formula is:

MSE = \frac{1}{n} \sum{i=1}^{n} (yi - \hat{y}_i)^2

where yi represents the actual target values, \hat{y}i represents the predicted values, and n is the number of data points in the dataset.

Splitting Data Sets

To properly train and evaluate machine learning models, data is divided into three distinct sets. This division ensures the model is trained on one part of the data, validated on another to fine-tune hyperparameters, and tested on a held-out set to evaluate its ability to generalize.

Training Set: Comprises 60-70% of the data and is used to train the model. It is essential for the model to learn patterns and relationships within the data.
Validation Set: Contains 10-20% of the data and is used for hyperparameter tuning and model selection. It helps prevent overfitting by providing an unbiased evaluation during training.
Testing Set: Includes 20-30% of the data and is used to assess the model's final performance. It provides an independent evaluation of how well the model generalizes to unseen data.

Model Selection via MSE

Model selection involves choosing the best model by minimizing the validation MSE. For example, in polynomial regression, different polynomial degrees are evaluated to find the best fit. The choice of degree significantly affects the model's ability to capture underlying patterns without overfitting.

\hat{y}(x) = w0 + w1x + w2x^2 + \cdots + wdx^d

The polynomial degree that results in the lowest validation MSE is chosen as the optimal model complexity, balancing model fit and generalization.

Logistic Regression for Classification

Logistic regression estimates the probabilities of binary outcomes using the logistic (sigmoid) function. It is commonly used for binary classification because of its simplicity and interpretability.

\sigma(z) = \frac{1}{1 + e^{-z}}

where the linear combination z is defined as:

z = w0 + w1x1 + w2x2 + \cdots + wnx_n

Classification decisions are made by comparing the sigmoid function’s output to a threshold, typically 0.5. If the output is greater than 0.5, the instance is classified as positive; otherwise, it is negative.

Data Scaling Techniques

Data scaling is crucial to ensure all features contribute equally to model training and to prevent features with larger ranges from dominating the learning process. Two common methods are:

Min-Max Normalization: Scales data to fit within the range [0,1] using the formula:
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
Standardization (Z-score): Scales data to have a mean of 0 and a standard deviation of 1, calculated as:
x' = \frac{x - \mu}{\sigma}
where \mu is the mean and \sigma is the standard deviation.

K-Nearest Neighbors (KNN) Algorithm

KNN classifies data points based on the majority class among its K nearest neighbors, measured by Euclidean distance. It is a non-parametric algorithm that makes predictions based on the similarity of data points.

distance(x, y) = \sqrt{\sum{i=1}^{n} (xi - y_i)^2}

Choosing Optimal K

Selecting the optimal K value is critical for KNN's performance. A small K can lead to overfitting, while a large K can result in underfitting.

Too small K: Results in overfitting, making the model sensitive to noise.
Too large K: Results in underfitting, causing the model to ignore local data structure.

Advantages and Disadvantages

Advantages:
- Simple and intuitive to understand.
- Requires no assumptions about data distribution.
Disadvantages:
- Computationally expensive for large datasets.
- Sensitive to irrelevant features and scaling differences.

Cross-Validation

K-fold cross-validation assesses model performance and generalization, providing a more robust estimate than a single train-test split.

Split the dataset into K equally sized subsets.
Train the model K times, each time using a different subset as the validation set and the remaining subsets as the training set.
Average performance metrics across all K folds for a robust estimate.

Next Steps

Review logistic regression principles and applications.
Investigate the KNN algorithm, focusing on K selection strategies.
Prepare the Iris dataset for K-fold cross-validation exercises.
Practice implementing min-max normalization and standardization in Python.

What is Machine Learning About?

The main challenge in machine learning is accurately learning the target function from limited and noisy data. Data scarcity and noise can significantly affect a model's ability to generalize.

Limited amount of data: Insufficient data can lead to poor generalization.
The presence of noise \epsilon: Noise can obscure true data relationships.

Noise can be inherent in data collection or the underlying phenomenon. The goal is to approximate the relationship:

Y = f(X1, X2, X3, \dots, Xp) + \epsilon

where:

Y is the output variable.
(X1, X2, X3, \dots, Xp) are input variables.
f is the function mapping inputs to the output.
\epsilon represents noise or random error.

What is the Function Needed for f?

Tasks typically include:

Prediction/Forecasting: Estimating the value of Y for new values of (X1, X2, X3, \dots, Xp)
Inference: Understanding how Y depends on the input variables, focusing on interpreting f.

Examples

Prediction (Forecasting):
- Predicting stroke or heart attack probability based on blood parameters, smoking status, weight, and blood pressure.
- Assessing loan applicant creditworthiness based on banking and credit history.
- Predicting alloy quality based on production parameters.

Other Classifications - Prediction vs. Classification

Prediction (Forecasting): Output variable is continuous (e.g., regression).
- Example: Predicting house prices, temperature, or sales figures.
Classification Problems: Assigning output to discrete categories.
- Examples:
  - Determining if an email is spam.
  - Analyzing CT scans for tumors (yes/no).

Parametric and Non-Parametric Methods

Parametric: Assume a specific functional form for the relationship.
Non-Parametric: Do not make assumptions about the relationship.

Supervised vs. Unsupervised Learning

Supervised: Each input dataset has a corresponding output value.
Unsupervised: No predefined output values.

Theoretical Framework for Machine Learning

The process involves:

Project goal definition
Task definition
Data collection
Data exploration, cleaning, and preprocessing
Dimension reduction and feature engineering
Data splitting (in supervised learning)
Model selection
Implementation of ML techniques
Interpretation of results

Basic Algorithms

Common algorithms:

Decision Trees
Random Forests
Naive Bayes
Support Vector Machines (SVM)
Deep Learning: Neural Networks
Unsupervised Learning: Clustering Algorithms
Principal Component Analysis (PCA)

Basic Python Libraries

Key libraries:

Numpy
Pandas
Scikit-learn
Visualization: matplotlib, bookeh, seaborn, plotly
Neural networks: pytorch, keras, tensorflow

Useful Definitions

A priori knowledge: Prior knowledge.
Deterministic: No randomness assumed.
Stochastic: Randomness assumed.

When is Learning Possible?

Consider mapping binary inputs to a binary output.

Assumptions for Data

Independent drawings from a population
Future data from the same stochastic process

Symptoms: Underfitting, Overfitting, Correct Fit

(The table was empty, so I cannot reword it.)

How to Choose the Right Model?

Partition data into training, validation, and testing sets.

Undertraining

Causes:
- Model simplicity
- Data scarcity
- Data noise
Symptoms: Poor performance on training and validation data.
Reducing undertraining:
- Increase model complexity
- Add more variables
- Reduce noise
- Train longer

Overtraining

Causes:
- Model complexity
- Prolonged training
- Noisy data
Symptoms: Good performance on training data, poor on validation and test data.
Ways to reduce overtraining:
- Use cross-validation
- Choose simpler models
- Train with more data
- Use regularization
- Implement early stopping

Good/Sustainable Fit

Optimal balance, generalizing well to unseen data.

Cross-Validation/K-Fold Validation

Addresses issues with limited data.

Advantages:
- Efficient data utilization
- Reduced randomness impact
- Model stability assessment
- Support for model selection
Disadvantages:
- Increased computation
- Implementation complexity
- Unsuitability for unbalanced data

Logistic Regression

Algorithm for binary and multiclass classification:

f(x) = \frac{1}{1 + e^{-x}}

Classification based on f(x) < \alpha or f(x) \geq \alpha, where \alpha = 0.5

K-Nearest Neighbors (K-NN)

Classification algorithm:

Calculate distances
Rank by distance
Assign category based on the most frequent category among the k nearest neighbors

Distance Metrics

Metrics include Euclidean, Manhattan, and Chebyshev distances.

Scaling and Standardization

Techniques to approximate variable ranges, including Min-Max scaling and Z-score normalization.

Conversion of Categorical Variables

Methods for converting categorical variables into numerical representations, such as one-hot encoding.

Selection of K

Balancing load and variance.

The Curse of Dimensionality

Issues arising from high dimensionality.

Consequences of High Dimensionality

Reduced usefulness of distance metrics, necessitating dimension reduction.

Statistics and Probability

Statistics infers models from data, while probability explains data structure.

Independent and Dependent Events

Examples of independent and dependent events.

Conditional Probability

P(A|B) = \frac{P(A \cap B)}{P(B)}

Bayesian Theorem

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Metrics for Assessing Prediction Quality - Regression

Various metrics, including mean absolute error, mean squared error, and root mean squared error.

Metrics for Assessing Prediction Quality - Categorization Mistakes (Confusion Matrix)

(The table was empty, so I cannot reword it.)

Sensitivity (recall): TPR = \frac{TP}{TP + FN}
Specificity (SPC): TNR = \frac{TN}{FP + TN}
Precision: PPV = \frac{TP}{TP + FP}