Supervised Learning with Scikit-Learn Notes

Introduction to Supervised Learning

  • Instructor: George Borman

  • Focus: Course on supervised learning using scikit-learn

Machine Learning Definition

  • Machine Learning: Process whereby computers learn to make decisions based on data without explicit programming.

    • Example:

    • Predicting whether an email is spam or not based on content and sender.

    • Clustering books into different categories based on contained words and assigning new books to existing clusters.

Types of Learning

  • Supervised Learning: Type of machine learning where outcomes are known.

    • Aim: Build a model that can accurately predict values of previously unseen data.

    • Uses features to predict a target variable.

    • Example: Predicting a basketball player's position based on points per game.

  • Unsupervised Learning: Process of discovering hidden patterns in unlabeled data.

    • Example: Grouping customers based on purchasing behavior without predefined categories.

    • Focuses on clustering, which is a branch of unsupervised learning.

Categories of Supervised Learning

  1. Classification: Predicting the label or category of an observation.

    • Example: Predicting whether a bank transaction is fraudulent or non-fraudulent.

    • This scenario describes binary classification (two possible outcomes).

  2. Regression: Predicting continuous values.

    • Example: Using features like the number of bedrooms and property size to predict the property's price.

Terminology

  • Feature: Independent variable or predictor variable (used throughout the course).

  • Target Variable: Dependent variable or response variable (used throughout the course).

Data Requirements for Supervised Learning

  • Data Criteria:

    • Must not have missing values.

    • Must be in numeric format.

    • Must be stored in Pandas DataFrames or NumPy arrays.

  • Exploratory Data Analysis: Necessary to ensure data is in the correct format before performing supervised learning.

    • Tools: Various Pandas methods for descriptive statistics and appropriate data visualizations.

Scikit-Learn Workflow Syntax

  • General Workflow Steps:

    1. Import a model (algorithm for the supervised learning problem) from an scikit-learn module.

    2. Instantiate the model (create a variable named 'model').

    3. Fit the model to the data to learn patterns about the features and target variable.

    • Fit the model to:

      • x: An array of features.

      • y: An array of target variable values.

    1. Use the model's predict method, passing new observations (e.g., x_new).

    • Example: Feeding features from six emails to a spam classification model results in an array of six values returned by the model:

      • 1 indicates spam.

      • 0 indicates not spam.

Understanding and Practical Applications

  • Aim to check understanding of the principles of supervised learning and its implementation using real data throughout the course.