This lecture provides an in-depth exploration of methods for assessing machine learning algorithms, particularly for regression and classification tasks. It explains the essential steps in model evaluation and preparation, including detailed methods for handling data, measuring performance, and selecting algorithms. Key topics covered are dividing datasets into training, validation, and testing sets, evaluating performance using metrics such as Mean Squared Error (MSE) for regression and confusion matrices for classification, applying logistic regression for both binary and multiclass classification, employing data scaling techniques to standardize feature ranges, and utilizing the k-nearest neighbors (KNN) algorithm for classification.
In regression analysis, the main objective is to define a mathematical relationship between input variables (features) and continuous output variables (targets). The effectiveness of this relationship is measured using various performance metrics, which help determine how accurately the model predicts the target variable based on the input features. The quality of this relationship is assessed using performance measures like Mean Squared Error (MSE).
MSE calculates the average of the squared differences between actual and predicted values. This metric is especially sensitive to outliers due to the squared term, making it valuable for assessing model accuracy. The formula is:
MSE = \frac{1}{n} \sum{i=1}^{n} (yi - \hat{y}_i)^2
where yi represents the actual target values, \hat{y}i represents the predicted values, and n is the number of data points in the dataset.
To properly train and evaluate machine learning models, data is divided into three distinct sets. This division ensures the model is trained on one part of the data, validated on another to fine-tune hyperparameters, and tested on a held-out set to evaluate its ability to generalize.
Training Set: Comprises 60-70% of the data and is used to train the model. It is essential for the model to learn patterns and relationships within the data.
Validation Set: Contains 10-20% of the data and is used for hyperparameter tuning and model selection. It helps prevent overfitting by providing an unbiased evaluation during training.
Testing Set: Includes 20-30% of the data and is used to assess the model's final performance. It provides an independent evaluation of how well the model generalizes to unseen data.
Model selection involves choosing the best model by minimizing the validation MSE. For example, in polynomial regression, different polynomial degrees are evaluated to find the best fit. The choice of degree significantly affects the model's ability to capture underlying patterns without overfitting.
\hat{y}(x) = w0 + w1x + w2x^2 + \cdots + wdx^d
The polynomial degree that results in the lowest validation MSE is chosen as the optimal model complexity, balancing model fit and generalization.
Logistic regression estimates the probabilities of binary outcomes using the logistic (sigmoid) function. It is commonly used for binary classification because of its simplicity and interpretability.
\sigma(z) = \frac{1}{1 + e^{-z}}
where the linear combination z is defined as:
z = w0 + w1x1 + w2x2 + \cdots + wnx_n
Classification decisions are made by comparing the sigmoid function’s output to a threshold, typically 0.5. If the output is greater than 0.5, the instance is classified as positive; otherwise, it is negative.
Data scaling is crucial to ensure all features contribute equally to model training and to prevent features with larger ranges from dominating the learning process. Two common methods are:
Min-Max Normalization: Scales data to fit within the range [0,1] using the formula:
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
Standardization (Z-score): Scales data to have a mean of 0 and a standard deviation of 1, calculated as:
x' = \frac{x - \mu}{\sigma}
where \mu is the mean and \sigma is the standard deviation.
KNN classifies data points based on the majority class among its K nearest neighbors, measured by Euclidean distance. It is a non-parametric algorithm that makes predictions based on the similarity of data points.
distance(x, y) = \sqrt{\sum{i=1}^{n} (xi - y_i)^2}
Selecting the optimal K value is critical for KNN's performance. A small K can lead to overfitting, while a large K can result in underfitting.
Too small K: Results in overfitting, making the model sensitive to noise.
Too large K: Results in underfitting, causing the model to ignore local data structure.
Advantages:
Simple and intuitive to understand.
Requires no assumptions about data distribution.
Disadvantages:
Computationally expensive for large datasets.
Sensitive to irrelevant features and scaling differences.
K-fold cross-validation assesses model performance and generalization, providing a more robust estimate than a single train-test split.
Split the dataset into K equally sized subsets.
Train the model K times, each time using a different subset as the validation set and the remaining subsets as the training set.
Average performance metrics across all K folds for a robust estimate.
Review logistic regression principles and applications.
Investigate the KNN algorithm, focusing on K selection strategies.
Prepare the Iris dataset for K-fold cross-validation exercises.
Practice implementing min-max normalization and standardization in Python.
The main challenge in machine learning is accurately learning the target function from limited and noisy data. Data scarcity and noise can significantly affect a model's ability to generalize.
Limited amount of data: Insufficient data can lead to poor generalization.
The presence of noise \epsilon: Noise can obscure true data relationships.
Noise can be inherent in data collection or the underlying phenomenon. The goal is to approximate the relationship:
Y = f(X1, X2, X3, \dots, Xp) + \epsilon
where:
Y is the output variable.
(X1, X2, X3, \dots, Xp) are input variables.
f is the function mapping inputs to the output.
\epsilon represents noise or random error.
Tasks typically include:
Prediction/Forecasting: Estimating the value of Y for new values of (X1, X2, X3, \dots, Xp)
Inference: Understanding how Y depends on the input variables, focusing on interpreting f.
Prediction (Forecasting):
Predicting stroke or heart attack probability based on blood parameters, smoking status, weight, and blood pressure.
Assessing loan applicant creditworthiness based on banking and credit history.
Predicting alloy quality based on production parameters.
Prediction (Forecasting): Output variable is continuous (e.g., regression).
Example: Predicting house prices, temperature, or sales figures.
Classification Problems: Assigning output to discrete categories.
Examples:
Determining if an email is spam.
Analyzing CT scans for tumors (yes/no).
Parametric: Assume a specific functional form for the relationship.
Non-Parametric: Do not make assumptions about the relationship.
Supervised: Each input dataset has a corresponding output value.
Unsupervised: No predefined output values.
The process involves:
Project goal definition
Task definition
Data collection
Data exploration, cleaning, and preprocessing
Dimension reduction and feature engineering
Data splitting (in supervised learning)
Model selection
Implementation of ML techniques
Interpretation of results
Common algorithms:
Decision Trees
Random Forests
Naive Bayes
Support Vector Machines (SVM)
Deep Learning: Neural Networks
Unsupervised Learning: Clustering Algorithms
Principal Component Analysis (PCA)
Key libraries:
Numpy
Pandas
Scikit-learn
Visualization: matplotlib, bookeh, seaborn, plotly
Neural networks: pytorch, keras, tensorflow
A priori knowledge: Prior knowledge.
Deterministic: No randomness assumed.
Stochastic: Randomness assumed.
Consider mapping binary inputs to a binary output.
Independent drawings from a population
Future data from the same stochastic process
(The table was empty, so I cannot reword it.)
Partition data into training, validation, and testing sets.
Causes:
Model simplicity
Data scarcity
Data noise
Symptoms: Poor performance on training and validation data.
Reducing undertraining:
Increase model complexity
Add more variables
Reduce noise
Train longer
Causes:
Model complexity
Prolonged training
Noisy data
Symptoms: Good performance on training data, poor on validation and test data.
Ways to reduce overtraining:
Use cross-validation
Choose simpler models
Train with more data
Use regularization
Implement early stopping
Optimal balance, generalizing well to unseen data.
Addresses issues with limited data.
Advantages:
Efficient data utilization
Reduced randomness impact
Model stability assessment
Support for model selection
Disadvantages:
Increased computation
Implementation complexity
Unsuitability for unbalanced data
Algorithm for binary and multiclass classification:
f(x) = \frac{1}{1 + e^{-x}}
Classification based on f(x) < \alpha or f(x) \geq \alpha, where \alpha = 0.5
Classification algorithm:
Calculate distances
Rank by distance
Assign category based on the most frequent category among the k nearest neighbors
Metrics include Euclidean, Manhattan, and Chebyshev distances.
Techniques to approximate variable ranges, including Min-Max scaling and Z-score normalization.
Methods for converting categorical variables into numerical representations, such as one-hot encoding.
Balancing load and variance.
Issues arising from high dimensionality.
Reduced usefulness of distance metrics, necessitating dimension reduction.
Statistics infers models from data, while probability explains data structure.
Examples of independent and dependent events.
P(A|B) = \frac{P(A \cap B)}{P(B)}
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
Various metrics, including mean absolute error, mean squared error, and root mean squared error.
(The table was empty, so I cannot reword it.)
Sensitivity (recall): TPR = \frac{TP}{TP + FN}
Specificity (SPC): TNR = \frac{TN}{FP + TN}
Precision: PPV = \frac{TP}{TP + FP}