Lecture Date: April 23, 2025
The lecture covered methods for evaluating machine learning algorithms, focusing on regression and classification.
Key topics:
Splitting data sets
Performance evaluation with Mean Squared Error (MSE)
Logistic Regression
Data scaling techniques
K-Nearest Neighbors (KNN) Algorithm
In regression analysis, the goal is to find a mathematical relationship between input variables (features) and continuous output variables (targets).
One common metric is Mean Squared Error (MSE).
Definition: MSE = \frac{1}{n} \sum{i=1}^{n} (yi - \hat{y}_i)^2
$y_i$: Actual target values
$\hat{y}_i$: Predicted values
$n$: Number of data points
Data is typically split into three sets:
Training Set: 60-70%
Validation Set: 10-20%
Testing Set: 20-30%
The optimal model is chosen based on minimizing the validation MSE.
Example: Polynomial Regression
Evaluate different polynomial degrees:
\hat{y}(x) = w0 + w1x + w2x^2 + \cdots + wdx^d
Select the polynomial degree with the lowest validation MSE.
Logistic regression estimates probabilities using the logistic (sigmoid) function:
\sigma(z) = \frac{1}{1 + e^{-z}}
Where the linear combination z is:
z = w0 + w1x1 + w2x2 + \cdots + wnx_n
Classification is made by applying a threshold, typically 0.5.
Two primary methods for scaling features:
Scales data into the range [0, 1]:
x' = \frac{x - \min(x)}{\max(x) - \min(x)}
Scales data to mean 0 and variance 1:
x' = \frac{x - \mu}{\sigma}
$\mu$: Mean
$\sigma$: Standard Deviation
KNN classifies data based on the majority class of its K closest points, measured by Euclidean distance:
distance(x, y) = \sqrt{\sum{i=1}^{n} (xi - y_i)^2}
Too small K: Sensitive to noise (overfitting).
Too large K: Underfitting and ignores local data structure.
Advantages:
Simple and intuitive
No assumptions about data distribution
Disadvantages:
Computationally expensive for large datasets
Sensitive to irrelevant features and scaling
Split data into K subsets.
Train model K times, each time using a different subset as validation.
Average the performance across all folds.
Review Logistic Regression
Explore K-Nearest Neighbors Algorithm, optimal K selection
Prepare for the Iris dataset and K-fold cross-validation
Practice Min-Max Normalization and Standardization
Main goal: Understand the relationship between input variables (X1, X2, X3, β¦, Xp) and the output variable (Y).
Equation: Y = f(X1, X2, X3, β¦, Xp) + \epsilon
Challenges in learning the target function:
Limited amount of data
Presence of noise (\epsilon), which is a random phenomenon. The function f itself may have an inherently random nature.
Two types of tasks:
Prediction/Forecasting:
Estimating the value of the output variable Y when the input variables take on new values without detailed knowledge of the function f.
Treating f as a βblack box.β
Inference:
Understanding how the output variable depends on the input variables.
Prediction (Forecasting):
Estimating the probability of stroke/heart attack based on blood parameters, smoking habits, weight, and blood pressure.
Assessing the creditworthiness of a loan applicant based on banking and credit history.
Predicting the quality of an alloy based on production parameters.
Inference:
Understanding how sales growth from TV, print, or radio advertising campaigns can be attributed to different types of campaigns to optimize the marketing mix.
Predicting housing prices:
Understanding how different factors contribute to the final price (e.g., construction material, scenic view).
Prediction vs. Classification:
Prediction (Forecasting):
Problems where the output variable is continuous (e.g., regression).
Example: Predicting house prices, temperature, or sales figures based on input variables.
Classification Problems:
Problems where the goal is to assign the output to discrete categories.
Examples:
Determining whether an email is spam based on its content.
Analyzing a CT scan to detect whether a tumor is present (yes/no classification).
Parametric Methods:
Assume a specific form for the function f (e.g., linear relationship).
Non-Parametric Methods:
The algorithm can learn any shape of the function f, but requires a large amount of input data to achieve reliable results.
Supervised Learning:
Each input dataset has a corresponding output value.
Examples:
X-ray images labeled with whether a tumor is present or not.
Production data with specified alloy quality for each dataset.
Historical banking data of customers who defaulted on their loans.
Unsupervised Learning:
There is no output variable.
Example:
Determining whether customers form distinct groups based on behavioral patterns using customer data (e.g., from loyalty cards) to enable targeted advertising.
Project goal definition
Definition of the task at hand
Data collection for analysis
Exploration, data cleaning, and preprocessing
Dimension reduction and input data prep. (feature engineering)
Data splitting (in supervised learning)
Choice of machine learning techniques
Implementation of the selected ML techniques
Implementation of the most effective technique (if required)
Interpretation of results
Decision Trees
Random Forests
Naive Bayes
Support Vector Machine (SVM)
Deep Learning: Neural Networks
Unsupervised Learning: Clustering Algorithms
Principal Component Analysis (PCA)
Numpy
Pandas
Scikit-learn (preprocessing)
Visualizations: matplotlib, bookeh, seaborn, plotly
For neural networks: pytorch, keras, tensorflow, scikit-learn (preprocessing)
A priori knowledge: what you know about the system before any data has been collected and analyzed
Deterministic: assumes that there is no randomness in the system, so the same input will always return exactly the same output
Stochastic: assumes that there is some randomness in the system that we cannot account for (known as noise), which means that the same input will give a slightly different result each time
Example: For 3 binary inputs and 1 binary output, there are 256 functions mapping 3 inputs per 1 output.
Given the first 5 observations, there are 8 candidate functions.
Question: Which of these eight candidate functions will give the correct prediction for the three samples that have not yet been seen?
Without additional knowledge, there is no basis for choosing one hypothesis over another.
In our initial example, with eight different candidate functions, we made one basic mistake by starting with a deterministic approach.
To achieve anything useful in machine learning, we have to take a probabilistic approach and assume that:
Training data come from independent drawings of a certain general population.
The data that we will see in the future is drawn according to the same stochastic process as the data that we have seen in the past.
Assumption of Independent and Identically Distributed Samples:
A team creates an ML algorithm to identify lung cancer based on computed tomography scans.
Training set: patients aged 20-40.
Applying the algorithm to patients of all ages violates the assumption.
Assumption of Stationarity:
A team creates an ML algorithm to detect product defects using data from a sensor on the production line.
The sensor is changed to one from another manufacturer after creating the algorithm violates the assumption.
Premise of A Priori Knowledge:
A marketing director wants to use machine learning to learn how the first letter of a customer's name affects what day of the week they are most likely to buy a product violates the assumption.
Premise of Independence:
A student wants to write an algorithm that predicts how likely a person is to speak Spanish. He asks 60 families to complete a questionnaire, treating each person as one sample. Its input variables are zip code, age, and gender, and the output variables are whether a person speaks Spanish or not violates the assumption.
Generalizable Model Example:
A company develops an algorithm based on photos to identify the brand of cars driving on a given section of the road.
The same camera takes pictures of cars while driving will be used for both the test and training datasets, and photos taken at any time of the day are used for both sets.
In machine learning, we depart from assuming a model describing the relationship between input and output data and embrace more flexibility.
The most common approach is to divide the data into a learning set and a validation set (e.g., an 80%+20% split) or for learning, validation, and testing (e.g., 40% + 30% + 30%).
High training error
Training error close to test error
High bias
Training error slightly lower than test error
Very low training error
Training error much lower than test error
High variance
Complexify model
Add more features
Train longer
Perform regularization
Get more data
Causes:
Model is too simple
Data is noisy or invalid
Symptoms: Poor Model performance regarding both the training and validation Data
Reducing undertraining: Increase model complexity, add more input variables, train Longer
Causes:
Model is too complex and Model data is very noisy
Symptoms: Good model performance on training Data, but not validation.Model doesnt generalize well
Reducing overtraining: cross-validation (K-fold validation), using less complex Models, using enough data and use of regularisation
Model is neither poorly Fitted and generalized well on unseen data
Engineers used the well-known Gutenberg-Richter law to predict the probability of a very strong earthquake based on the frequency of very weak earthquakes to model and design a power plant
The relationship between the magnitude of an earthquake and the logarithm of its probability of occurrence is linear.
The actual data show a collapse at a magnitude of 7.3 which is definitely not linear
The overfitted model predicted one earthquake with a magnitude of at least 9 approximately every 13,000 years, while the correct model predicted one earthquake with a magnitude of at least 9 approximately every 300 years.
the Fukushima nuclear power plant was built to withstand only an 8.6magnitude quake so it was destoryed by the approximately 20.5 times more powerful 8.6 Magnitude
Can lead to poor model performance regarding less frequent
Approaches used to deal with this problem:
Oversampling
Undersampling
Combination of oversampling and undersampling
Modification of the ssessment metric
Cost-sensitive learning
Use of transfer learning techniques
Use of non-sustainable models
In practice, data are often not available in sufficient quantities
Procedure:
we divide randomly arranged data into k subsets
We use one of the subsets to determine the metric, and the training is carried out using the remaining k-1 subsets
Repeat the operation for all k subsets
The final metric is the average of all k metrics
Better use of data: In contrast to the training/test set split method, cross-validation allows all available data to be used for training and model testing, which is particularly beneficial for a limited amount of data.
Reducing the impact of randomness: By repeatedly dividing the data into different combinations of training and testing sets, cross-validation reduces the effect of randomness on the result, leading to more stable and reliable assessments of model performance.
Assessment of model stability: Thanks to k-times cross-validation, it is possible to assess how stable and resistant the model is by observing differences in results on individual subsets.
Supports the selection of the model: Cross-validation allows the selection of model hyperparameters, which allows you to achieve better performance.
Increased demand for computing resources: Since cross-validation involves training and testing the model k times, this can lead to a significant increase in calculation time and resource utilization, especially for big data and complex models.
Complexity of implementation: Although most machine learning libraries, such as scikit-learn, offer built-in cross-validation features, the implementation of this method may be more complex than simple data-sharing methods, such as splitting into training/test set.
May be unsuitable for unbalanced data: For unbalanced datasets where class proportions are different, standard cross-validation can lead to poor model evaluation. In such cases, it is better to use stratified k-fold cross-validation, which maintains the proportions of classes in each subset.
Used to solve binary and multiclass classification problems
Based of a logistical function π π₯ = \frac{1}{1 + π ^{-x}}
Classification is based on the principle π π₯ < πΌof , Class I,π π₯ β₯ πΌ, Class II, where e.g. πΌ = 0.5
The algorithm is used for classification
rank/sort with increasing distance
the category of the new product is the category occurring most frequently among the first k products in the training set
q=2: Euclidean distance
dist ππ1, ππ2, β¦ , πππ , (π1, π2, β¦ , ππ) = \sqrt{\sum_{π=1}^{π} πππ β ππ}
q=1: Manhattan distance
dist ππ1, ππ2, β¦ , πππ , (π1, π2, β¦ , ππ) = \sum_{π=1}^{π} πππ β ππ
Distance of Chybyshev
\lim{π\rightarrow β} dist ππ1, ππ2, β¦ , πππ , (π1, π2, β¦ , ππ) = \lim{π\rightarrow β} (\sum_{π=1}^{π} πππ β ππ )^{1/π} = max π| πππ β ππ |
We see that these metrics will be dominated by the first variable.
In order to approximate the range of variables, we can use scaling.
The simplest scaling is subtracting the minimum value of a given variable and dividing by its range.
new = ππ β ππππ old / (ππππ₯ β ππππ)
X{new} = \frac{Xi - X}{\sigma_X}
assignments of a numeric values for each categorical column with different values or creation of different features for the different values
Case of 3 variables: Variable 1: hot 1, other cases 0
Variable 2: medium-hot 1, other cases 0
Variable 3: cold 1, other cases 0
so we have 100, 010, 001 capabilities This is called one-hot encoding
compromise between load (bias) and variance (variance)
The load refers to errors resulting from simplifications in the model that lead to incorrect generalization of the learning data.
In multidimensional space, the distances between close points are large.
Samples tend to accumulate in the corners of the space, even if they are evenly distributed
We have 5,000 samples with p features, each in the range [0, 1].
First, we have to find the smallest hypercube with a volume equal to 1/1000 of the volume of the entire hypercube.
The volume of the hypercube with side c in p dimensions is cp, i.e. cp=0.001 Hence, c = 0.0011/p.
The volume of the sphere in n dimensions is
π = \frac{π ^{π/2}}{ Ξ(π 2 +1 )} π ^π
Let's see what part of the volume of the whole sphere fits in a shell of thickness ο₯, when n goes to infinity.
lim{n \rightarrow \infty} \frac{\pi^{\frac{n}{2}}}{\Gamma \left( \frac{n}{2} + 1 \right)} R^n - (R-\epsilon)^n = \lim_{n \rightarrow \infty} \frac{\pi^{\frac{n}{2}}}{\Gamma \left( \frac{n}{2} + 1 \right)} R^n
Result: if the training data is multidimensional and each input variable is evenly distributed, samples accumulate in a thin coating away from the center
When the number of variables is large, the distance metric becomes less useful unless we have an exponential increase in the number of samples.
Practical proposal: it is worth reducing the number of input variables and pre-processing of data is very important.
Given. In addressing issues related to finding clusters, we want to verify whether our dataset comprises classes that can be distinguished in some way.
Decision trees are intuitive and flexible models used for classification and regression.
They represent decisions, chance outcomes, costs, and utilities in a structured form.
Decision trees are interpretable models, as they make it easy to understand the reasoning behind a decision.
There are three popular choices used to split categorical input variables:
You can create a child node for every category of that input
You can introduce only two child nodes
You can create binary splits for any subset of categories.
You typically use binary splits for numerical input variables. The splitting possibilities that are considered are the midpoints between consecutive values of the numeric input variable that can be found in the training data.
Entropy is a measure of impurity or uncertainty in a dataset. In the context of classification trees, it quantifies how mixed the classes are in a node.
Definition of Entropy
For a dataset N with m classes, the entropy is defined as:
H(N) = - β pi log2(pi) = β pi log2(1/Pi). (for i from 1 to m).
pi: proportion of instances in class Ci
Suppose you split a dataset N into child nodes N1, N2, , Nn. Each child node will have its own entropy H(Ni), and its weight in the overall dataset
Information Gain is the reduction in entropy achieved by splitting a node:
It tells us how much uncertainty is removed by the split. A higher IG means the split leads to purer child nodes, which is desirable.
The Gini index (or Gini impurity) is a measure of how often a randomly chosen element from a dataset would be incorrectly classified if it were randomly labelled according to the distribution of labels in the subset.
For a dataset N with m classes, the Gini index is calculated as:
Gini(N) = 1 - \sum{i=1}^{m} pi^2
The Challenge: Real-world data often has hundreds or thousands of features (dimensions). Images: 1024Γ1024 = 1,048,576 pixel values.
Why Do We Need It? Visualization, Computational efficiency, Remove Noise and Avoid Curse of Dimensionality.
Unsupervised learning technique for exploratory data analysis.
Dimensionality reduction by finding directions (components) of highest variance.
Helps to simplify data interpretation in high-dimensional spaces.
Components = directions of maximum variance Principal components form a new subspace close to the original data cloud.
Mean centering: shift mean of each feature to zero Variable scaling: standardize each feature to unit variance.
Zi = Ξ±βxi + BβYi
Wβ = Ξ±2xi + αΊ2 Yi
Use a Scree Plot to observe the variance explained by each component.
Converts high-dimensional distances between points into probabilities.
Similar points β high probability of being neighbors.
Embeds points in low dimensions to preserve local structure.
A nonlinear dimensionality reduction method
Developed by McInnes, Healy, and Melville.
Statistics: We set out from the data and adjust the model explaining the data.
Theory of Probability: Understanding the data that a model can generate, or its structure Example: a toss of a coin. We can analyze symmetrical coin model when the probability of getting an tails or heads is equal to, or an asymmetrical coin.
π(π΄|π΅) = \frac{π(π΄ β© π΅)}{π(π΅)}
π(π΄|π΅) = \frac{π(π΅|π΄)π(π΄)}{π(π΅)}
Positive class Negative class
Predicted positive class (+) Truly positive TP False positive Error of the first kind FP
Predicted negative class (-) Falsely negative Error of the second kind FN