Supervised Learning with Scikit-Learn Notes
Introduction to Supervised Learning
Instructor: George Borman
Focus: Course on supervised learning using scikit-learn
Machine Learning Definition
Machine Learning: Process whereby computers learn to make decisions based on data without explicit programming.
Example:
Predicting whether an email is spam or not based on content and sender.
Clustering books into different categories based on contained words and assigning new books to existing clusters.
Types of Learning
Supervised Learning: Type of machine learning where outcomes are known.
Aim: Build a model that can accurately predict values of previously unseen data.
Uses features to predict a target variable.
Example: Predicting a basketball player's position based on points per game.
Unsupervised Learning: Process of discovering hidden patterns in unlabeled data.
Example: Grouping customers based on purchasing behavior without predefined categories.
Focuses on clustering, which is a branch of unsupervised learning.
Categories of Supervised Learning
Classification: Predicting the label or category of an observation.
Example: Predicting whether a bank transaction is fraudulent or non-fraudulent.
This scenario describes binary classification (two possible outcomes).
Regression: Predicting continuous values.
Example: Using features like the number of bedrooms and property size to predict the property's price.
Terminology
Feature: Independent variable or predictor variable (used throughout the course).
Target Variable: Dependent variable or response variable (used throughout the course).
Data Requirements for Supervised Learning
Data Criteria:
Must not have missing values.
Must be in numeric format.
Must be stored in Pandas DataFrames or NumPy arrays.
Exploratory Data Analysis: Necessary to ensure data is in the correct format before performing supervised learning.
Tools: Various Pandas methods for descriptive statistics and appropriate data visualizations.
Scikit-Learn Workflow Syntax
General Workflow Steps:
Import a model (algorithm for the supervised learning problem) from an scikit-learn module.
Instantiate the model (create a variable named 'model').
Fit the model to the data to learn patterns about the features and target variable.
Fit the model to:
x: An array of features.
y: An array of target variable values.
Use the model's predict method, passing new observations (e.g.,
x_new).
Example: Feeding features from six emails to a spam classification model results in an array of six values returned by the model:
1 indicates spam.
0 indicates not spam.
Understanding and Practical Applications
Aim to check understanding of the principles of supervised learning and its implementation using real data throughout the course.