Magic - Supervised Learning_2024(1)
Introduction
Data Analytics & Machine Learning Week 4: AI isn’t magic but it’s okay if it feels like it
Educators: Aimée Backiel, Kenric Borgelioen, Daan Nijs
Course: Data Analytics and Machine Learning
Program: BCS 2024 - 2025 (Toegepaste Informatica)
Semester Course Schedule
Introduction
Bringing the right equipment for data adventure
Variable types and summary statistics
Data Visualization
Probability and Statistics
AI isn’t magic but it’s okay if it feels like it
Data-driven decision making
Supervised Learning: linear and logistic regression
Evaluating model quality: Good vs Bad models
Model evaluation and interpretation
Cognitive processes in AI:
Decision trees
Neural networks
AI pattern recognition:
Unsupervised learning
Reinforcement learning
Review and Exam Preparation
Key Concepts for Today's Lesson
AI isn’t magic but it’s okay if it feels like it
Machine Learning Paradigms
Supervised Learning
Classification
Regression
Machine Learning Types
Supervised Learning
Process where a model is trained on an input with labeled outputs
Types:
Classification: Predicting discrete classes (e.g., yes/no decisions)
Regression: Predicting continuous output values
Classification Techniques
Given predictors, determine class y (e.g., loan repayment prediction)
Regression Techniques
Given predictors, determine numeric output (e.g., months till loan repayment)
Supervised Learning Techniques
Classification:
K-Nearest Neighbors (KNN)
Naïve Bayes
Decision Tree
Random Forest
Logistic Regression
Support Vector Machines
Artificial Neural Network
Regression:
Linear Regression
Non-Linear Regression
K-Nearest Neighbors Regression
Decision Trees Regression
Support Vector Regression
Artificial Neural Network Regression
K-Nearest Neighbors (KNN)
Lazy Learner: No training phase required, predictions based on distance from current instance
Predicting Class:
K=1: Uses the closest neighbor
K>1: Majority class among neighbors
Distance Metrics:
Hamming Distance: Suitable for binary variables
Euclidean Distance: Suitable for continuous variables
Manhattan Distance: Suitable for grid-like structures
Chebyshev Distance: Consideration of diagonal moves in predictions
Decision Trees
Structure: Nodes and edges without loops
Nodes: Root, Internal, Leaf
Edges: Connections from parent to child nodes
Key Metrics for Splitting:
Entropy: Homogeneity of data (0 means pure)
Information Gain: Reduction in entropy after a split
Gini Index: Measures likelihood of misclassification
Random Forest
Combines multiple decision trees to improve predictions
Trees trained on random subsets of features to reduce overfitting
Majority vote: For classification problems
Averaging: For regression problems
Linear Regression
Simple Linear Regression: Predicts Y based on a single variable X
Equation: Y = β0 + β1*X + ε
Multiple Linear Regression: Predicts Y based on multiple variables X1, X2,..., Xp
Equation: Y = β0 + β1·X1 + ... + βp·Xp + ε
Logistic Regression
Models the probability of a binary outcome using a logistic function
Uses the sigmoid function to map predictions to a 0-1 range
Threshold: Determines how to classify outputs based on probability
Regularization in Regression
Purpose: Simplify models and prevent overfitting by penalizing complexity
Methods:
Lasso Regression: Reduces less significant coefficients to zero
Ridge Regression: Shrinks coefficients without eliminating them
Upcoming Topics
Next Week: What Makes Good Models Good and Bad Models Bad?
Focus on Model Evaluation and Interpretation