Lecture 1 ML

Types of Machine Learning Algorithms

Understanding the types of machine learning algorithms is crucial for solving problems effectively.
Different algorithms are designed for specific types of problems.
Knowing the available algorithms helps in identifying potential solutions and framing problems.

Data and Terminology

Data: The data used is a critical factor in determining the type of problem that can be solved.
Model:
- The algorithm or system that makes decisions.
- Can refer to the algorithm itself (e.g., decision trees, support vector machines).
- Also refers to the trained model, ready for deployment.
Training: The process of making the model understand the desired solutions by feeding it data and adjusting its parameters.
Prediction/Inference: The output of the model when given new, unseen data.

High-Level Classification: Supervised vs. Unsupervised Learning

Machine learning algorithms can be categorized based on whether they are trained with human supervision.
This depends on whether the dataset includes information about the desired output (labels).
The key aspect is the label, which is an annotation about what each data point represents (e.g., animal pictures with labels indicating the animal type).
Labeled data typically leads to supervised learning, while unlabeled data leads to unsupervised learning.

Types of Machine Learning

Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning

Supervised Learning

The training set contains information about the desired outcomes.
For each input, the desired output is known.

Iris Dataset Example

A popular dataset for classifying different types of iris flowers.
Experts collected measurements (sepal length, sepal width, petal length, petal width) for different iris species.
The dataset includes the species type as a label for each sample.
Data is typically presented in a tabular format (e.g., CSV).
After training, the model can predict the species of a new flower based on its measurements.

OECD Data Example

A dataset with GDP per capita and life satisfaction scores for different countries.
The goal is to find a link between GDP and life satisfaction.
The model should learn the function that maps GDP to life satisfaction.
The output is a numerical value that may not exist in the original dataset but fits the relationship between input and output.

Subcategories of Supervised Learning

Classification: The labels represent categories or classes (not numerical).
- Examples: Iris dataset, spam filter, image classification.
- Binary Classification: Two possible labels (e.g., spam/not spam).
- Multi-Class Classification: Multiple possible labels (e.g., different types of flowers, hundreds of activities classified based on sound analysis).
Regression: The model generates a numerical output.

Regression

Identifying an actual mathematical function. Given input, can calculate a value to come out.
The model generates a numerical output in a continuous numerical space.
- Example: Estimating house prices based on location, number of bedrooms, etc.
- Multiple input values generate a numerical output. e.g: house prices.
Common in finance for estimating future values.
The model learns a line that minimizes the error between the predicted and actual values.
Formula : $f(x) = wx + b$
- Where: $w$ is the weight/slope and $b$ is the bias/intercept.
Example: GDP per capita vs. life satisfaction.
- The model learns the line that best fits the data.
- The red line represents a test example not used in training for the model.
If that line is close to the trained line during testing, it informs of how well the model performs given new data.

Supervised Learning Algorithms

Classification: KNN, decision trees, random forests, support vector machines, Naive Bayes.
Regression: Linear regression, polynomial regression, decision tree regression.

Unsupervised Learning

Used when data lacks labels.
The goal is to understand the internal structure of the data.
- Identify similar data points.
- Identify separations of groups.
- If separated enough, identify class separation despite lack of labels.
The model cannot generate labels without any information.
Examples: clustering, retail customer segmentation, recommendation engines, anomaly detection.

Clustering Example

Input variables X1 and X2 (e.g., blood pressure and cholesterol levels).
The model identifies clusters based on the similarity of values.
The output may be cluster centers or the identification of which cluster new data points belong to.

Retail Customer Segmentation

Understanding customer behavior and identifying different customer personas.
Companies collect data about customer patterns to make recommendations.

Anomaly Detection

Identifying unusual patterns in data.
Common in cybersecurity and stock market analysis.
The model defines the boundaries of what is normal and flags anything outside those boundaries.

Unsupervised Learning Algorithms

Clustering: K-means, DBSCAN, hierarchical clustering.
Anomaly Detection: One-class SVM.
Dimensionality Reduction: Reducing the number of variables in the data.

Semi-Supervised Learning

A combination of labeled and unlabeled data.
Common scenario: A company has a large amount of unlabeled data and invests in labeling a smaller portion.
The goal is to use both labeled and unlabeled data to build a better model.

Example: Two Classes with Labeled and Unlabeled Data

Labeled data: Used to train a traditional supervised learning model.
Unlabeled data: Provides a bigger picture of the whole space that the dataset represents.
Combining both datasets can lead to a better separation line.

Techniques for Semi-Supervised Learning

Transductive Learning: Generate label predictions for the unlabeled data using existing labels.
Inductive Learning: Train a model on the labeled data, then feed the unlabeled data through the model and incorporate high-confidence predictions into the learning.
Apply unsupervised learning first (e.g., clustering), then use the labels to assign labels to the clusters.
S3VM (Semi-Supervised Support Vector Machine).

Reinforcement Learning

The system learns dynamically during runtime.
The algorithm (agent) tries specific outputs and observes whether the decision was correct or not.
The algorithm improves through trial and error.
Popular in robotics, self-driving cars, and gaming.

Reinforcement Learning System

Agent: Computer program that can observe its environment.
Environment: The physical environment or a game.
Action: The agent decides to perform an action and observes how that action changes the environment.
Reward: The agent receives a reward based on the outcome of the action (positive or negative).
The agent learns to match the right action to a particular state to maximize the reward.

Reinforcement Learning Algorithms

Model-based Approach: Uses a pre-trained model of how the world reacts to actions
Model-free Approach: Starts by randomly selecting actions and learning the relationship between state, action, and outcome.
Function Q: The link between state and action leading to reward.
Deep Q-Networks: Used Deep neural networks

Data Structure Concepts

Sample: A single representation of the input and potentially output for your dataset.
Features: The input information that we put into our model.
Feature Vector: When we're feeding a single sample into our model that set of features is the feature vector.
Target: Is the label for that feature vector.
Feature Matrix: The whole collection of the feature vectors that you have for training or for validation or testing.
Target Vector: It's what that feature vector is supposed to map when you run through the machine learning model.
$X$ : Commonly used to represent the feature matrix.
$Y$ : Commonly used to represent target vector.
Data point sample : Individual entry to be passed through your model.
Generalization: Model's ability to minimize that error for unseen data.