Machine Learning - Classifiers

Neural Networks

Will be revisited later in the course.
TensorFlow Playground: A tool by TensorFlow developers to experiment with neural networks.
- Explore networks with varying layers and neurons for classification tasks.

Practical Coursework

This week's practical focuses on classifiers.
Includes deliberately incorrect code due to Python and library updates.
Task: Identify and correct the code to achieve the desired functionality.

General Classifier Concepts

Applies to various classifiers (Perceptron, Logistic Regression) and models.
Covers comparing and fitting different classifiers.

Fitting Models to Data

Goal: Find the optimal parameter set for accurate predictions.
Parameter Space: The range of all possible parameter values.

Methods for Finding Optimal Parameters

Random Sampling

Trying random sets of parameters until a good fit is achieved.
Not efficient but technically viable.

Grid Search

Systematic approach: testing predefined values for each parameter.
Commonly used for hyperparameter tuning (parameters set in advance, not learned by the model).
More methodical than random search.

Iterative Improvement

Keeping track of tried parameters and their performance.
Making incremental changes based on previous results.

Analogy: Perfecting a Chocolate Cake Recipe

Start with a base recipe or random proportions.
Adjust ingredients iteratively, monitoring the outcome.
Reduce adjustments as the results improve.

Definition of Learning

Based on the definition provided by someone who won Nobel and Turing Award.
Learning: Adaptive changes in a system that improve its performance on a task.
The algorithm should be more effective at the task taken from a population.

Designing a Parameter Fitting Algorithm

Objective: Maximize goodness or minimize error iteratively.
Error: Mismatch between model predictions and data.
The algorithm should be iterative, with each step building upon the previous one.

Steps:

Start at a random point p_1 in parameter space.
Calculate the error at that point p_1.
Move to a nearby point p_2.
Calculate the error at point p_2.

Stopping Criteria:

Minimal Improvement: Stop when changes yield insignificant improvements.
Iteration Limit: Set a maximum number of iterations (e.g., 10,000).
Error Threshold: Stop when the error falls below a predefined threshold.

These choices are arbitrary and require creative input.

Deterministic vs. Stochastic Methods

Deterministic: Calculating each move, always choosing the best option.
Stochastic (with randomness): Introducing random steps to avoid local minima.

Analogy: Mountain Climber

Deterministic: A rational climber calculating the optimal path down the mountain.
Stochastic: A slightly inebriated climber taking random deviations.

Stochastic Methods and Temperature

Temperature: A parameter controlling the amount of randomness.
Low temperature: Minimal randomness, favoring calculated steps.
High temperature: High randomness, leading to potentially nonsensical outcomes.

Calculating Error

Depends on the machine learning task (regression, classification).

Regression Error Measures

L1 Norm

Absolute difference between predicted and actual values.

L2 Norm

Square root of the squared differences (Root Mean Squared Error).
More sensitive to large errors.

L1 = \sum |yi - \hat{y}i|

L2 = \sqrt{\sum (yi - \hat{y}i)^2}

L2 norm penalizes big mistakes and can be manually specified as a parameter in fitting algorithms.
Cost Function, Loss Function, and Error are interchangeable terms.
Many cost functions can be interpreted as a distance between data and predictions.
For model fitting, we aim to minimize the L2 norm using training data.
Different parameter values (slope, intercept) correspond to different linear regressions.

Improving the Algorithm

The algorithms include step five that will make algorithm stop.
Algorithm should calculate step three to pick a random point nearby and go.
The Ideal stochastic and deterministic methods will mix a little bit of both.

Calculus and Gradient Descent

If only one parameter, find the point where the derivative is zero.
In more dimensions, use partial derivatives to find where the gradient is zero. The gradient is a vector that is made of d loss or del loss del a and del loss del b. Trying to go down this gradient until we find a point where everything is zero.
Assess the function's gradient in the neighborhood to determine the direction to descend.

Stochastic Gradient Descent

Mixes deterministic calculation with randomness.
Take a step in the direction of the steepest gradient, with a little bit of noise.
Useful for generic machine learning models.

Comparing Classifiers

For the classifiers practical, the workshop, the code provided is wrong.
Common language: Confusion Matrix
Matrix showing the counts of true positives, true negatives, false positives, and false negatives.

Accuracy

Simplest metric: percentage of correct answers.

Accuracy = \frac{True Positives + True Negatives}{Total Answers}

Classification error: 1 - Accuracy

Precision and Recall

Precision: Given that the model predicts positive, what is the probability it is correct?
Recall: Given that the example is actually positive, what is the probability the model captures it?

Precision = \frac{True Positives}{True Positives + False Positives}

Recall = \frac{True Positives}{True Positives + False Negatives}

F1 Score: A way to combining precision and recall.

F1 = 2 * \frac{Precision * Recall}{Precision + Recall}

You often see accuracy, precision, recall and F1 score, and the F1 is just combination of precision and recall.

Multi-Class Classification

Confusion matrix with more classes.
Precision and recall can be measured for each class.

Improving Multi-Class Classification

Try a different classifier
Feed different data.
Trying to bias your model by changing the cost function.
Get more data to make it more strategic to the model.

Other Measures

In the medical context, use sensitivity and specificity.
Sensitivity is recall. Specificity is the measurement at the bottom there.

Sensitivity

How many sick people classified as sick.

Specificity

How many healthy people classified as healthy.

ROC Curve

A model will have some kind of default middle threshold, but in real life scenarios it's up to the user on what's worse.
Is a way to define a more specific classification depending on true positives, false positives, etc.
Vary the threshold for classification and plot true positive rate vs. false positive rate.
Ideal curve: High true positive rate and low false positive rate.
Area Under the Curve (AUC): Measures the quality of the classifier across all possible thresholds.
- An AUC of 0.78 is better than an AUC of 0.77654.
Healthy and sick require different white blood cell counts, where you've got a very clear separation.