Lecture 1 ML

Types of Machine Learning Algorithms

  • Understanding the types of machine learning algorithms is crucial for solving problems effectively.
  • Different algorithms are designed for specific types of problems.
  • Knowing the available algorithms helps in identifying potential solutions and framing problems.

Data and Terminology

  • Data: The data used is a critical factor in determining the type of problem that can be solved.
  • Model:
    • The algorithm or system that makes decisions.
    • Can refer to the algorithm itself (e.g., decision trees, support vector machines).
    • Also refers to the trained model, ready for deployment.
  • Training: The process of making the model understand the desired solutions by feeding it data and adjusting its parameters.
  • Prediction/Inference: The output of the model when given new, unseen data.

High-Level Classification: Supervised vs. Unsupervised Learning

  • Machine learning algorithms can be categorized based on whether they are trained with human supervision.
  • This depends on whether the dataset includes information about the desired output (labels).
  • The key aspect is the label, which is an annotation about what each data point represents (e.g., animal pictures with labels indicating the animal type).
  • Labeled data typically leads to supervised learning, while unlabeled data leads to unsupervised learning.

Types of Machine Learning

  • Supervised Learning
  • Unsupervised Learning
  • Semi-Supervised Learning
  • Reinforcement Learning

Supervised Learning

  • The training set contains information about the desired outcomes.
  • For each input, the desired output is known.

Iris Dataset Example

  • A popular dataset for classifying different types of iris flowers.
  • Experts collected measurements (sepal length, sepal width, petal length, petal width) for different iris species.
  • The dataset includes the species type as a label for each sample.
  • Data is typically presented in a tabular format (e.g., CSV).
  • After training, the model can predict the species of a new flower based on its measurements.

OECD Data Example

  • A dataset with GDP per capita and life satisfaction scores for different countries.
  • The goal is to find a link between GDP and life satisfaction.
  • The model should learn the function that maps GDP to life satisfaction.
  • The output is a numerical value that may not exist in the original dataset but fits the relationship between input and output.

Subcategories of Supervised Learning

  • Classification: The labels represent categories or classes (not numerical).
    • Examples: Iris dataset, spam filter, image classification.
    • Binary Classification: Two possible labels (e.g., spam/not spam).
    • Multi-Class Classification: Multiple possible labels (e.g., different types of flowers, hundreds of activities classified based on sound analysis).
  • Regression: The model generates a numerical output.

Regression

  • Identifying an actual mathematical function. Given input, can calculate a value to come out.

  • The model generates a numerical output in a continuous numerical space.

    • Example: Estimating house prices based on location, number of bedrooms, etc.

    • Multiple input values generate a numerical output. e.g: house prices.

  • Common in finance for estimating future values.

  • The model learns a line that minimizes the error between the predicted and actual values.

  • Formula : f(x)=wx+bf(x) = wx + b

    • Where: ww is the weight/slope and bb is the bias/intercept.
  • Example: GDP per capita vs. life satisfaction.

    • The model learns the line that best fits the data.
    • The red line represents a test example not used in training for the model.
  • If that line is close to the trained line during testing, it informs of how well the model performs given new data.

Supervised Learning Algorithms

  • Classification: KNN, decision trees, random forests, support vector machines, Naive Bayes.
  • Regression: Linear regression, polynomial regression, decision tree regression.

Unsupervised Learning

  • Used when data lacks labels.
  • The goal is to understand the internal structure of the data.
    • Identify similar data points.
    • Identify separations of groups.
    • If separated enough, identify class separation despite lack of labels.
  • The model cannot generate labels without any information.
  • Examples: clustering, retail customer segmentation, recommendation engines, anomaly detection.

Clustering Example

  • Input variables X1 and X2 (e.g., blood pressure and cholesterol levels).
  • The model identifies clusters based on the similarity of values.
  • The output may be cluster centers or the identification of which cluster new data points belong to.

Retail Customer Segmentation

  • Understanding customer behavior and identifying different customer personas.
  • Companies collect data about customer patterns to make recommendations.

Anomaly Detection

  • Identifying unusual patterns in data.
  • Common in cybersecurity and stock market analysis.
  • The model defines the boundaries of what is normal and flags anything outside those boundaries.

Unsupervised Learning Algorithms

  • Clustering: K-means, DBSCAN, hierarchical clustering.
  • Anomaly Detection: One-class SVM.
  • Dimensionality Reduction: Reducing the number of variables in the data.

Semi-Supervised Learning

  • A combination of labeled and unlabeled data.
  • Common scenario: A company has a large amount of unlabeled data and invests in labeling a smaller portion.
  • The goal is to use both labeled and unlabeled data to build a better model.

Example: Two Classes with Labeled and Unlabeled Data

  • Labeled data: Used to train a traditional supervised learning model.
  • Unlabeled data: Provides a bigger picture of the whole space that the dataset represents.
  • Combining both datasets can lead to a better separation line.

Techniques for Semi-Supervised Learning

  • Transductive Learning: Generate label predictions for the unlabeled data using existing labels.
  • Inductive Learning: Train a model on the labeled data, then feed the unlabeled data through the model and incorporate high-confidence predictions into the learning.
  • Apply unsupervised learning first (e.g., clustering), then use the labels to assign labels to the clusters.
  • S3VM (Semi-Supervised Support Vector Machine).

Reinforcement Learning

  • The system learns dynamically during runtime.
  • The algorithm (agent) tries specific outputs and observes whether the decision was correct or not.
  • The algorithm improves through trial and error.
  • Popular in robotics, self-driving cars, and gaming.

Reinforcement Learning System

  • Agent: Computer program that can observe its environment.
  • Environment: The physical environment or a game.
  • Action: The agent decides to perform an action and observes how that action changes the environment.
  • Reward: The agent receives a reward based on the outcome of the action (positive or negative).
  • The agent learns to match the right action to a particular state to maximize the reward.

Reinforcement Learning Algorithms

  • Model-based Approach: Uses a pre-trained model of how the world reacts to actions
  • Model-free Approach: Starts by randomly selecting actions and learning the relationship between state, action, and outcome.
  • Function Q: The link between state and action leading to reward.
  • Deep Q-Networks: Used Deep neural networks

Data Structure Concepts

  • Sample: A single representation of the input and potentially output for your dataset.
  • Features: The input information that we put into our model.
  • Feature Vector: When we're feeding a single sample into our model that set of features is the feature vector.
  • Target: Is the label for that feature vector.
  • Feature Matrix: The whole collection of the feature vectors that you have for training or for validation or testing.
  • Target Vector: It's what that feature vector is supposed to map when you run through the machine learning model.
  • XX: Commonly used to represent the feature matrix.
  • YY: Commonly used to represent target vector.
  • Data point sample : Individual entry to be passed through your model.
  • Generalization: Model's ability to minimize that error for unseen data.