Magic - Supervised Learning_2024(1)

Introduction

  • Data Analytics & Machine Learning Week 4: AI isn’t magic but it’s okay if it feels like it

  • Educators: Aimée Backiel, Kenric Borgelioen, Daan Nijs

  • Course: Data Analytics and Machine Learning

  • Program: BCS 2024 - 2025 (Toegepaste Informatica)

Semester Course Schedule

  1. Introduction

  2. Bringing the right equipment for data adventure

    • Variable types and summary statistics

    • Data Visualization

    • Probability and Statistics

  3. AI isn’t magic but it’s okay if it feels like it

    • Data-driven decision making

    • Supervised Learning: linear and logistic regression

  4. Evaluating model quality: Good vs Bad models

    • Model evaluation and interpretation

  5. Cognitive processes in AI:

    • Decision trees

    • Neural networks

  6. AI pattern recognition:

    • Unsupervised learning

    • Reinforcement learning

  7. Review and Exam Preparation

Key Concepts for Today's Lesson

  • AI isn’t magic but it’s okay if it feels like it

  • Machine Learning Paradigms

    • Supervised Learning

    • Classification

    • Regression

Machine Learning Types

Supervised Learning

  • Process where a model is trained on an input with labeled outputs

  • Types:

    • Classification: Predicting discrete classes (e.g., yes/no decisions)

    • Regression: Predicting continuous output values

Classification Techniques

  • Given predictors, determine class y (e.g., loan repayment prediction)

Regression Techniques

  • Given predictors, determine numeric output (e.g., months till loan repayment)

Supervised Learning Techniques

  • Classification:

    • K-Nearest Neighbors (KNN)

    • Naïve Bayes

    • Decision Tree

    • Random Forest

    • Logistic Regression

    • Support Vector Machines

    • Artificial Neural Network

  • Regression:

    • Linear Regression

    • Non-Linear Regression

    • K-Nearest Neighbors Regression

    • Decision Trees Regression

    • Support Vector Regression

    • Artificial Neural Network Regression

K-Nearest Neighbors (KNN)

  • Lazy Learner: No training phase required, predictions based on distance from current instance

  • Predicting Class:

    • K=1: Uses the closest neighbor

    • K>1: Majority class among neighbors

  • Distance Metrics:

    • Hamming Distance: Suitable for binary variables

    • Euclidean Distance: Suitable for continuous variables

    • Manhattan Distance: Suitable for grid-like structures

    • Chebyshev Distance: Consideration of diagonal moves in predictions

Decision Trees

  • Structure: Nodes and edges without loops

    • Nodes: Root, Internal, Leaf

    • Edges: Connections from parent to child nodes

  • Key Metrics for Splitting:

    • Entropy: Homogeneity of data (0 means pure)

    • Information Gain: Reduction in entropy after a split

    • Gini Index: Measures likelihood of misclassification

Random Forest

  • Combines multiple decision trees to improve predictions

  • Trees trained on random subsets of features to reduce overfitting

  • Majority vote: For classification problems

  • Averaging: For regression problems

Linear Regression

  • Simple Linear Regression: Predicts Y based on a single variable X

    • Equation: Y = β0 + β1*X + ε

  • Multiple Linear Regression: Predicts Y based on multiple variables X1, X2,..., Xp

    • Equation: Y = β0 + β1·X1 + ... + βp·Xp + ε

Logistic Regression

  • Models the probability of a binary outcome using a logistic function

  • Uses the sigmoid function to map predictions to a 0-1 range

  • Threshold: Determines how to classify outputs based on probability

Regularization in Regression

  • Purpose: Simplify models and prevent overfitting by penalizing complexity

  • Methods:

    • Lasso Regression: Reduces less significant coefficients to zero

    • Ridge Regression: Shrinks coefficients without eliminating them

Upcoming Topics

  • Next Week: What Makes Good Models Good and Bad Models Bad?

    • Focus on Model Evaluation and Interpretation