Machine Learning Algorithms in Python - Notes

Machine Learning Algorithms in Python - Summer Semester 2024/2025

Overview

Lecture Date: April 23, 2025
The lecture covered methods for evaluating machine learning algorithms, focusing on regression and classification.
Key topics:
- Splitting data sets
- Performance evaluation with Mean Squared Error (MSE)
- Logistic Regression
- Data scaling techniques
- K-Nearest Neighbors (KNN) Algorithm

Regression Problems and Performance Measures

In regression analysis, the goal is to find a mathematical relationship between input variables (features) and continuous output variables (targets).
One common metric is Mean Squared Error (MSE).

Mean Squared Error (MSE)

Definition: MSE = \frac{1}{n} \sum{i=1}^{n} (yi - \hat{y}_i)^2
- $y_i$: Actual target values
- $\hat{y}_i$: Predicted values
- $n$: Number of data points

Splitting Data Sets

Data is typically split into three sets:
- Training Set: 60-70%
- Validation Set: 10-20%
- Testing Set: 20-30%

Model Selection via MSE

The optimal model is chosen based on minimizing the validation MSE.
Example: Polynomial Regression
- Evaluate different polynomial degrees:
  \hat{y}(x) = w0 + w1x + w2x^2 + \cdots + wdx^d
- Select the polynomial degree with the lowest validation MSE.

Logistic Regression for Classification

Logistic regression estimates probabilities using the logistic (sigmoid) function:
\sigma(z) = \frac{1}{1 + e^{-z}}
- Where the linear combination z is:
  z = w0 + w1x1 + w2x2 + \cdots + wnx_n
Classification is made by applying a threshold, typically 0.5.

Data Scaling Techniques

Two primary methods for scaling features:

Min-Max Normalization

Scales data into the range [0, 1]:
x' = \frac{x - \min(x)}{\max(x) - \min(x)}

Standardization (Z-score)

Scales data to mean 0 and variance 1:
x' = \frac{x - \mu}{\sigma}
- $\mu$: Mean
- $\sigma$: Standard Deviation

K-Nearest Neighbors (KNN) Algorithm

KNN classifies data based on the majority class of its K closest points, measured by Euclidean distance:
distance(x, y) = \sqrt{\sum{i=1}^{n} (xi - y_i)^2}

Choosing Optimal K

Too small K: Sensitive to noise (overfitting).
Too large K: Underfitting and ignores local data structure.

Advantages and Disadvantages

Advantages:
- Simple and intuitive
- No assumptions about data distribution
Disadvantages:
- Computationally expensive for large datasets
- Sensitive to irrelevant features and scaling

Cross-Validation

K-Fold Cross-Validation

Split data into K subsets.
Train model K times, each time using a different subset as validation.
Average the performance across all folds.

Next Steps

Review Logistic Regression
Explore K-Nearest Neighbors Algorithm, optimal K selection
Prepare for the Iris dataset and K-fold cross-validation
Practice Min-Max Normalization and Standardization

What is Machine Learning About?

Main goal: Understand the relationship between input variables (X1, X2, X3, …, Xp) and the output variable (Y).
Equation: Y = f(X1, X2, X3, …, Xp) + \epsilon
Challenges in learning the target function:
- Limited amount of data
- Presence of noise (\epsilon), which is a random phenomenon. The function f itself may have an inherently random nature.

What is the Function Needed for f?

Two types of tasks:
1. Prediction/Forecasting:
  - Estimating the value of the output variable Y when the input variables take on new values without detailed knowledge of the function f.
  - Treating f as a “black box.”
2. Inference:
  - Understanding how the output variable depends on the input variables.

Examples

Prediction (Forecasting):
- Estimating the probability of stroke/heart attack based on blood parameters, smoking habits, weight, and blood pressure.
- Assessing the creditworthiness of a loan applicant based on banking and credit history.
- Predicting the quality of an alloy based on production parameters.
Inference:
- Understanding how sales growth from TV, print, or radio advertising campaigns can be attributed to different types of campaigns to optimize the marketing mix.
Predicting housing prices:
- Understanding how different factors contribute to the final price (e.g., construction material, scenic view).

Other Classification of ML and AI Algorithms

Prediction vs. Classification:
1. Prediction (Forecasting):
  - Problems where the output variable is continuous (e.g., regression).
  - Example: Predicting house prices, temperature, or sales figures based on input variables.
2. Classification Problems:
  - Problems where the goal is to assign the output to discrete categories.
  - Examples:
    - Determining whether an email is spam based on its content.
    - Analyzing a CT scan to detect whether a tumor is present (yes/no classification).

Parametric and Non-Parametric Methods

Parametric Methods:
- Assume a specific form for the function f (e.g., linear relationship).
Non-Parametric Methods:
- The algorithm can learn any shape of the function f, but requires a large amount of input data to achieve reliable results.

Supervised vs. Unsupervised Learning

Supervised Learning:
- Each input dataset has a corresponding output value.
- Examples:
  - X-ray images labeled with whether a tumor is present or not.
  - Production data with specified alloy quality for each dataset.
  - Historical banking data of customers who defaulted on their loans.
Unsupervised Learning:
- There is no output variable.
- Example:
  - Determining whether customers form distinct groups based on behavioral patterns using customer data (e.g., from loyalty cards) to enable targeted advertising.

Theoretical Framework for Using Machine Learning

Project goal definition
Definition of the task at hand
Data collection for analysis
Exploration, data cleaning, and preprocessing
Dimension reduction and input data prep. (feature engineering)
Data splitting (in supervised learning)
Choice of machine learning techniques
Implementation of the selected ML techniques
Implementation of the most effective technique (if required)
Interpretation of results

Basic Algorithms

Decision Trees
Random Forests
Naive Bayes
Support Vector Machine (SVM)
Deep Learning: Neural Networks
Unsupervised Learning: Clustering Algorithms
Principal Component Analysis (PCA)

Basic Python Libraries Used in ML and AI

Numpy
Pandas
Scikit-learn (preprocessing)
Visualizations: matplotlib, bookeh, seaborn, plotly
For neural networks: pytorch, keras, tensorflow, scikit-learn (preprocessing)

Useful Definitions:

A priori knowledge: what you know about the system before any data has been collected and analyzed
Deterministic: assumes that there is no randomness in the system, so the same input will always return exactly the same output
Stochastic: assumes that there is some randomness in the system that we cannot account for (known as noise), which means that the same input will give a slightly different result each time

When is Learning Possible?

Example: For 3 binary inputs and 1 binary output, there are 256 functions mapping 3 inputs per 1 output.
Given the first 5 observations, there are 8 candidate functions.
Question: Which of these eight candidate functions will give the correct prediction for the three samples that have not yet been seen?
Without additional knowledge, there is no basis for choosing one hypothesis over another.

Necessary Assumptions

In our initial example, with eight different candidate functions, we made one basic mistake by starting with a deterministic approach.
To achieve anything useful in machine learning, we have to take a probabilistic approach and assume that:
1. Training data come from independent drawings of a certain general population.
2. The data that we will see in the future is drawn according to the same stochastic process as the data that we have seen in the past.

Data Assumptions and Case Studies

Assumption of Independent and Identically Distributed Samples:
- A team creates an ML algorithm to identify lung cancer based on computed tomography scans.
- Training set: patients aged 20-40.
- Applying the algorithm to patients of all ages violates the assumption.
Assumption of Stationarity:
- A team creates an ML algorithm to detect product defects using data from a sensor on the production line.
- The sensor is changed to one from another manufacturer after creating the algorithm violates the assumption.
Premise of A Priori Knowledge:
- A marketing director wants to use machine learning to learn how the first letter of a customer's name affects what day of the week they are most likely to buy a product violates the assumption.
Premise of Independence:
- A student wants to write an algorithm that predicts how likely a person is to speak Spanish. He asks 60 families to complete a questionnaire, treating each person as one sample. Its input variables are zip code, age, and gender, and the output variables are whether a person speaks Spanish or not violates the assumption.
Generalizable Model Example:
- A company develops an algorithm based on photos to identify the brand of cars driving on a given section of the road.
- The same camera takes pictures of cars while driving will be used for both the test and training datasets, and photos taken at any time of the day are used for both sets.

Regression: Choosing the Right Model

In machine learning, we depart from assuming a model describing the relationship between input and output data and embrace more flexibility.
The most common approach is to divide the data into a learning set and a validation set (e.g., an 80%+20% split) or for learning, validation, and testing (e.g., 40% + 30% + 30%).

Symptoms of Model Fitting

Underfitting

High training error
Training error close to test error
High bias

Just Right

Training error slightly lower than test error

Overfitting

Very low training error
Training error much lower than test error
High variance

Possible Remedies for Model Fitting Issues

Underfitting

Complexify model
Add more features
Train longer

Overfitting

Perform regularization
Get more data

Undertraining

Causes:
- Model is too simple
- Data is noisy or invalid
Symptoms: Poor Model performance regarding both the training and validation Data
Reducing undertraining: Increase model complexity, add more input variables, train Longer

Overtraining

Causes:
- Model is too complex and Model data is very noisy
Symptoms: Good model performance on training Data, but not validation.Model doesnt generalize well
Reducing overtraining: cross-validation (K-fold validation), using less complex Models, using enough data and use of regularisation

Good and Sustaninable Fit

Model is neither poorly Fitted and generalized well on unseen data

Consequences of Overtraining

Engineers used the well-known Gutenberg-Richter law to predict the probability of a very strong earthquake based on the frequency of very weak earthquakes to model and design a power plant
The relationship between the magnitude of an earthquake and the logarithm of its probability of occurrence is linear.
The actual data show a collapse at a magnitude of 7.3 which is definitely not linear

The Gutenberg-Richter Law

The overfitted model predicted one earthquake with a magnitude of at least 9 approximately every 13,000 years, while the correct model predicted one earthquake with a magnitude of at least 9 approximately every 300 years.
the Fukushima nuclear power plant was built to withstand only an 8.6magnitude quake so it was destoryed by the approximately 20.5 times more powerful 8.6 Magnitude

Unbalanced Datasets

Can lead to poor model performance regarding less frequent
Approaches used to deal with this problem:
- Oversampling
- Undersampling
- Combination of oversampling and undersampling
- Modification of the ssessment metric
- Cost-sensitive learning
- Use of transfer learning techniques
- Use of non-sustainable models

Cross-validation/ k-fold validation

In practice, data are often not available in sufficient quantities
Procedure:
1. we divide randomly arranged data into k subsets
2. We use one of the subsets to determine the metric, and the training is carried out using the remaining k-1 subsets
3. Repeat the operation for all k subsets
4. The final metric is the average of all k metrics

Cross-checking / cross-validation (k-fold validation) Advantages of k-fold cross-validation

Better use of data: In contrast to the training/test set split method, cross-validation allows all available data to be used for training and model testing, which is particularly beneficial for a limited amount of data.
Reducing the impact of randomness: By repeatedly dividing the data into different combinations of training and testing sets, cross-validation reduces the effect of randomness on the result, leading to more stable and reliable assessments of model performance.
Assessment of model stability: Thanks to k-times cross-validation, it is possible to assess how stable and resistant the model is by observing differences in results on individual subsets.
Supports the selection of the model: Cross-validation allows the selection of model hyperparameters, which allows you to achieve better performance.

Disadvantages of k-fold cross-validation

Increased demand for computing resources: Since cross-validation involves training and testing the model k times, this can lead to a significant increase in calculation time and resource utilization, especially for big data and complex models.
Complexity of implementation: Although most machine learning libraries, such as scikit-learn, offer built-in cross-validation features, the implementation of this method may be more complex than simple data-sharing methods, such as splitting into training/test set.
May be unsuitable for unbalanced data: For unbalanced datasets where class proportions are different, standard cross-validation can lead to poor model evaluation. In such cases, it is better to use stratified k-fold cross-validation, which maintains the proportions of classes in each subset.

Logistic regression

Used to solve binary and multiclass classification problems
Based of a logistical function 𝑓 𝑥 = \frac{1}{1 + 𝑒 ^{-x}}
Classification is based on the principle 𝑓 𝑥 < 𝛼of , Class I,𝑓 𝑥 ≥ 𝛼, Class II, where e.g. 𝛼 = 0.5

Logistical function of many variables

The k-neighbour algorithm (k nearest neighbours, k-NN)

The algorithm is used for classification
rank/sort with increasing distance
the category of the new product is the category occurring most frequently among the first k products in the training set

Distance Metrics

q=2: Euclidean distance
dist 𝑋𝑖1, 𝑋𝑖2, … , 𝑋𝑖𝑝 , (𝑋1, 𝑋2, … , 𝑋𝑝) = \sqrt{\sum_{𝑘=1}^{𝑝} 𝑋𝑖𝑘 − 𝑋𝑘}
q=1: Manhattan distance
dist 𝑋𝑖1, 𝑋𝑖2, … , 𝑋𝑖𝑝 , (𝑋1, 𝑋2, … , 𝑋𝑝) = \sum_{𝑘=1}^{𝑝} 𝑋𝑖𝑘 − 𝑋𝑘
Distance of Chybyshev
\lim{𝑞\rightarrow ∞} dist 𝑋𝑖1, 𝑋𝑖2, … , 𝑋𝑖𝑝 , (𝑋1, 𝑋2, … , 𝑋𝑝) = \lim{𝑞\rightarrow ∞} (\sum_{𝑘=1}^{𝑝} 𝑋𝑖𝑘 − 𝑋𝑘 )^{1/𝑞} = max 𝑘| 𝑋𝑖𝑘 − 𝑋𝑘 |

Scaling and Standardisation

We see that these metrics will be dominated by the first variable.
In order to approximate the range of variables, we can use scaling.
The simplest scaling is subtracting the minimum value of a given variable and dividing by its range.
new = 𝑋𝑖 − 𝑋𝑚𝑖𝑛 old / (𝑋𝑚𝑎𝑥 − 𝑋𝑚𝑖𝑛)

Standardization (z-score normalization)

X{new} = \frac{Xi - X}{\sigma_X}

Conversion of categorical variables

assignments of a numeric values for each categorical column with different values or creation of different features for the different values
Case of 3 variables: Variable 1: hot 1, other cases 0
Variable 2: medium-hot 1, other cases 0
Variable 3: cold 1, other cases 0
so we have 100, 010, 001 capabilities This is called one-hot encoding

Selection of k

compromise between load (bias) and variance (variance)
The load refers to errors resulting from simplifications in the model that lead to incorrect generalization of the learning data.

The Curse of Dimensionality

In multidimensional space, the distances between close points are large.
Samples tend to accumulate in the corners of the space, even if they are evenly distributed

1. In multidimensional space, the distances between close points are large.

We have 5,000 samples with p features, each in the range [0, 1].
First, we have to find the smallest hypercube with a volume equal to 1/1000 of the volume of the entire hypercube.
The volume of the hypercube with side c in p dimensions is cp, i.e. cp=0.001 Hence, c = 0.0011/p.

Samples tend to accumulate in the corners of the space, even if they are evenly distributed

The volume of the sphere in n dimensions is
𝑉 = \frac{𝜋 ^{𝑛/2}}{ Γ(𝑛 2 +1 )} 𝑅^𝑛
Let's see what part of the volume of the whole sphere fits in a shell of thickness , when n goes to infinity.

lim{n \rightarrow \infty} \frac{\pi^{\frac{n}{2}}}{\Gamma \left( \frac{n}{2} + 1 \right)} R^n - (R-\epsilon)^n = \lim_{n \rightarrow \infty}  \frac{\pi^{\frac{n}{2}}}{\Gamma \left( \frac{n}{2} + 1 \right)} R^n

Result: if the training data is multidimensional and each input variable is evenly distributed, samples accumulate in a thin coating away from the center

Consequences

When the number of variables is large, the distance metric becomes less useful unless we have an exponential increase in the number of samples.
Practical proposal: it is worth reducing the number of input variables and pre-processing of data is very important.

Unsupervised learning - finding concentrations (clustering)

Given. In addressing issues related to finding clusters, we want to verify whether our dataset comprises classes that can be distinguished in some way.

What are decision trees?

Decision trees are intuitive and flexible models used for classification and regression.
They represent decisions, chance outcomes, costs, and utilities in a structured form.
Decision trees are interpretable models, as they make it easy to understand the reasoning behind a decision.

Categorical inputs

There are three popular choices used to split categorical input variables:

You can create a child node for every category of that input
You can introduce only two child nodes
You can create binary splits for any subset of categories.

Numerical inputs

You typically use binary splits for numerical input variables. The splitting possibilities that are considered are the midpoints between consecutive values of the numeric input variable that can be found in the training data.

Entropy in Decision Trees

Entropy is a measure of impurity or uncertainty in a dataset. In the context of classification trees, it quantifies how mixed the classes are in a node.
Definition of Entropy
For a dataset N with m classes, the entropy is defined as:
H(N) = - ∑ pi log2(pi) = ∑ pi log2(1/Pi). (for i from 1 to m).
- pi: proportion of instances in class Ci

Entropy After a Split

Suppose you split a dataset N into child nodes N1, N2, , Nn. Each child node will have its own entropy H(Ni), and its weight in the overall dataset

Information Gain (IG)

Information Gain is the reduction in entropy achieved by splitting a node:
It tells us how much uncertainty is removed by the split. A higher IG means the split leads to purer child nodes, which is desirable.

Gini Index in Decision Trees

The Gini index (or Gini impurity) is a measure of how often a randomly chosen element from a dataset would be incorrectly classified if it were randomly labelled according to the distribution of labels in the subset.
For a dataset N with m classes, the Gini index is calculated as:
Gini(N) = 1 - \sum{i=1}^{m} pi^2

What is Dimensionality Reduction?

The Challenge: Real-world data often has hundreds or thousands of features (dimensions). Images: 1024×1024 = 1,048,576 pixel values.
Why Do We Need It? Visualization, Computational efficiency, Remove Noise and Avoid Curse of Dimensionality.

What is PCA – principal component analysis

Unsupervised learning technique for exploratory data analysis.
Dimensionality reduction by finding directions (components) of highest variance.
Helps to simplify data interpretation in high-dimensional spaces.

Core Concepts

Components = directions of maximum variance Principal components form a new subspace close to the original data cloud.

Data preprocessing

Mean centering: shift mean of each feature to zero Variable scaling: standardize each feature to unit variance.

Transformation to obtain new variables that are not correlated

Zi = α₁xi + B₁Yi
W₁ = α2xi + ẞ2 Yi

How many dimensions to use

Use a Scree Plot to observe the variance explained by each component.

What is t-SNE?

Converts high-dimensional distances between points into probabilities.
Similar points → high probability of being neighbors.
Embeds points in low dimensions to preserve local structure.

What is UMAP?

A nonlinear dimensionality reduction method
Developed by McInnes, Healy, and Melville.

Statistics and Probability

Statistics: We set out from the data and adjust the model explaining the data.
Theory of Probability: Understanding the data that a model can generate, or its structure Example: a toss of a coin. We can analyze symmetrical coin model when the probability of getting an tails or heads is equal to, or an asymmetrical coin.

Conditonal Probability

𝑃(𝐴|𝐵) = \frac{𝑃(𝐴 ∩ 𝐵)}{𝑃(𝐵)}

Bayesian theorem

𝑃(𝐴|𝐵) = \frac{𝑃(𝐵|𝐴)𝑃(𝐴)}{𝑃(𝐵)}

Metrics for assessing the quality of prediction categorisation mistakes

Positive class Negative class
Predicted positive class (+) Truly positive TP False positive Error of the first kind FP
Predicted negative class (-) Falsely negative Error of the second kind FN