Comprehensive Machine Learning Notes
Overview
This lecture provides an in-depth exploration of methods for assessing machine learning algorithms, particularly for regression and classification tasks. It explains the essential steps in model evaluation and preparation, including detailed methods for handling data, measuring performance, and selecting algorithms. Key topics covered are dividing datasets into training, validation, and testing sets, evaluating performance using metrics such as Mean Squared Error (MSE) for regression and confusion matrices for classification, applying logistic regression for both binary and multiclass classification, employing data scaling techniques to standardize feature ranges, and utilizing the k-nearest neighbors (KNN) algorithm for classification.
Regression Problems and Performance Measures
In regression analysis, the main objective is to define a mathematical relationship between input variables (features) and continuous output variables (targets). The effectiveness of this relationship is measured using various performance metrics, which help determine how accurately the model predicts the target variable based on the input features. The quality of this relationship is assessed using performance measures like Mean Squared Error (MSE).
Mean Squared Error (MSE)
MSE calculates the average of the squared differences between actual and predicted values. This metric is especially sensitive to outliers due to the squared term, making it valuable for assessing model accuracy. The formula is:
MSE = \frac{1}{n} \sum{i=1}^{n} (yi - \hat{y}_i)^2
where yi represents the actual target values, \hat{y}i represents the predicted values, and n is the number of data points in the dataset.
Splitting Data Sets
To properly train and evaluate machine learning models, data is divided into three distinct sets. This division ensures the model is trained on one part of the data, validated on another to fine-tune hyperparameters, and tested on a held-out set to evaluate its ability to generalize.
Training Set: Comprises 60-70% of the data and is used to train the model. It is essential for the model to learn patterns and relationships within the data.
Validation Set: Contains 10-20% of the data and is used for hyperparameter tuning and model selection. It helps prevent overfitting by providing an unbiased evaluation during training.
Testing Set: Includes 20-30% of the data and is used to assess the model's final performance. It provides an independent evaluation of how well the model generalizes to unseen data.
Model Selection via MSE
Model selection involves choosing the best model by minimizing the validation MSE. For example, in polynomial regression, different polynomial degrees are evaluated to find the best fit. The choice of degree significantly affects the model's ability to capture underlying patterns without overfitting.
\hat{y}(x) = w0 + w1x + w2x^2 + \cdots + wdx^d
The polynomial degree that results in the lowest validation MSE is chosen as the optimal model complexity, balancing model fit and generalization.
Logistic Regression for Classification
Logistic regression estimates the probabilities of binary outcomes using the logistic (sigmoid) function. It is commonly used for binary classification because of its simplicity and interpretability.
\sigma(z) = \frac{1}{1 + e^{-z}}
where the linear combination z is defined as:
z = w0 + w1x1 + w2x2 + \cdots + wnx_n
Classification decisions are made by comparing the sigmoid function’s output to a threshold, typically 0.5. If the output is greater than 0.5, the instance is classified as positive; otherwise, it is negative.
Data Scaling Techniques
Data scaling is crucial to ensure all features contribute equally to model training and to prevent features with larger ranges from dominating the learning process. Two common methods are:
Min-Max Normalization: Scales data to fit within the range [0,1] using the formula:
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
Standardization (Z-score): Scales data to have a mean of 0 and a standard deviation of 1, calculated as:
x' = \frac{x - \mu}{\sigma}
where \mu is the mean and \sigma is the standard deviation.
K-Nearest Neighbors (KNN) Algorithm
KNN classifies data points based on the majority class among its K nearest neighbors, measured by Euclidean distance. It is a non-parametric algorithm that makes predictions based on the similarity of data points.
distance(x, y) = \sqrt{\sum{i=1}^{n} (xi - y_i)^2}
Choosing Optimal K
Selecting the optimal K value is critical for KNN's performance. A small K can lead to overfitting, while a large K can result in underfitting.
Too small K: Results in overfitting, making the model sensitive to noise.
Too large K: Results in underfitting, causing the model to ignore local data structure.
Advantages and Disadvantages
Advantages:
Simple and intuitive to understand.
Requires no assumptions about data distribution.
Disadvantages:
Computationally expensive for large datasets.
Sensitive to irrelevant features and scaling differences.
Cross-Validation
K-fold cross-validation assesses model performance and generalization, providing a more robust estimate than a single train-test split.
Split the dataset into K equally sized subsets.
Train the model K times, each time using a different subset as the validation set and the remaining subsets as the training set.
Average performance metrics across all K folds for a robust estimate.
Next Steps
Review logistic regression principles and applications.
Investigate the KNN algorithm, focusing on K selection strategies.
Prepare the Iris dataset for K-fold cross-validation exercises.
Practice implementing min-max normalization and standardization in Python.
What is Machine Learning About?
The main challenge in machine learning is accurately learning the target function from limited and noisy data. Data scarcity and noise can significantly affect a model's ability to generalize.
Limited amount of data: Insufficient data can lead to poor generalization.
The presence of noise \epsilon: Noise can obscure true data relationships.
Noise can be inherent in data collection or the underlying phenomenon. The goal is to approximate the relationship:
Y = f(X1, X2, X3, \dots, Xp) + \epsilon
where:
Y is the output variable.
(X1, X2, X3, \dots, Xp) are input variables.
f is the function mapping inputs to the output.
\epsilon represents noise or random error.
What is the Function Needed for f?
Tasks typically include:
Prediction/Forecasting: Estimating the value of Y for new values of (X1, X2, X3, \dots, Xp)
Inference: Understanding how Y depends on the input variables, focusing on interpreting f.
Examples
Prediction (Forecasting):
Predicting stroke or heart attack probability based on blood parameters, smoking status, weight, and blood pressure.
Assessing loan applicant creditworthiness based on banking and credit history.
Predicting alloy quality based on production parameters.
Other Classifications - Prediction vs. Classification
Prediction (Forecasting): Output variable is continuous (e.g., regression).
Example: Predicting house prices, temperature, or sales figures.
Classification Problems: Assigning output to discrete categories.
Examples:
Determining if an email is spam.
Analyzing CT scans for tumors (yes/no).
Parametric and Non-Parametric Methods
Parametric: Assume a specific functional form for the relationship.
Non-Parametric: Do not make assumptions about the relationship.
Supervised vs. Unsupervised Learning
Supervised: Each input dataset has a corresponding output value.
Unsupervised: No predefined output values.
Theoretical Framework for Machine Learning
The process involves:
Project goal definition
Task definition
Data collection
Data exploration, cleaning, and preprocessing
Dimension reduction and feature engineering
Data splitting (in supervised learning)
Model selection
Implementation of ML techniques
Interpretation of results
Basic Algorithms
Common algorithms:
Decision Trees
Random Forests
Naive Bayes
Support Vector Machines (SVM)
Deep Learning: Neural Networks
Unsupervised Learning: Clustering Algorithms
Principal Component Analysis (PCA)
Basic Python Libraries
Key libraries:
Numpy
Pandas
Scikit-learn
Visualization: matplotlib, bookeh, seaborn, plotly
Neural networks: pytorch, keras, tensorflow
Useful Definitions
A priori knowledge: Prior knowledge.
Deterministic: No randomness assumed.
Stochastic: Randomness assumed.
When is Learning Possible?
Consider mapping binary inputs to a binary output.
Assumptions for Data
Independent drawings from a population
Future data from the same stochastic process
Symptoms: Underfitting, Overfitting, Correct Fit
(The table was empty, so I cannot reword it.)
How to Choose the Right Model?
Partition data into training, validation, and testing sets.
Undertraining
Causes:
Model simplicity
Data scarcity
Data noise
Symptoms: Poor performance on training and validation data.
Reducing undertraining:
Increase model complexity
Add more variables
Reduce noise
Train longer
Overtraining
Causes:
Model complexity
Prolonged training
Noisy data
Symptoms: Good performance on training data, poor on validation and test data.
Ways to reduce overtraining:
Use cross-validation
Choose simpler models
Train with more data
Use regularization
Implement early stopping
Good/Sustainable Fit
Optimal balance, generalizing well to unseen data.
Cross-Validation/K-Fold Validation
Addresses issues with limited data.
Advantages:
Efficient data utilization
Reduced randomness impact
Model stability assessment
Support for model selection
Disadvantages:
Increased computation
Implementation complexity
Unsuitability for unbalanced data
Logistic Regression
Algorithm for binary and multiclass classification:
f(x) = \frac{1}{1 + e^{-x}}
Classification based on f(x) < \alpha or f(x) \geq \alpha, where \alpha = 0.5
K-Nearest Neighbors (K-NN)
Classification algorithm:
Calculate distances
Rank by distance
Assign category based on the most frequent category among the k nearest neighbors
Distance Metrics
Metrics include Euclidean, Manhattan, and Chebyshev distances.
Scaling and Standardization
Techniques to approximate variable ranges, including Min-Max scaling and Z-score normalization.
Conversion of Categorical Variables
Methods for converting categorical variables into numerical representations, such as one-hot encoding.
Selection of K
Balancing load and variance.
The Curse of Dimensionality
Issues arising from high dimensionality.
Consequences of High Dimensionality
Reduced usefulness of distance metrics, necessitating dimension reduction.
Statistics and Probability
Statistics infers models from data, while probability explains data structure.
Independent and Dependent Events
Examples of independent and dependent events.
Conditional Probability
P(A|B) = \frac{P(A \cap B)}{P(B)}
Bayesian Theorem
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
Metrics for Assessing Prediction Quality - Regression
Various metrics, including mean absolute error, mean squared error, and root mean squared error.
Metrics for Assessing Prediction Quality - Categorization Mistakes (Confusion Matrix)
(The table was empty, so I cannot reword it.)
Sensitivity (recall): TPR = \frac{TP}{TP + FN}
Specificity (SPC): TNR = \frac{TN}{FP + TN}
Precision: PPV = \frac{TP}{TP + FP}