JB

Supervised Learning – Decision Trees, Random Forests, Support Vector Machines & Assignment 2

Quick Recap of Supervised Learning

  • Machine Learning (ML) core idea- Enable a computer to learn from past data and use that learning to make future predictions.

    • Fundamental goal: reduce prediction error (via techniques such as gradient descent).

    • Interview-ready definition: “Go on updating model parameters until the error becomes least or ideally 0.”

  • Examples used to illustrate error minimization- Robot repeatedly moves toward a cake until distance becomes 0 (gradient descent analogy).

    • Neural network draws a separating line between red & blue classes until classification error approaches 0.

  • Training vs. Testing split- Typical splits and their implications

    • 70% train / 30% test ⇒ higher accuracy (model sees more data during learning).

    • 50% train / 50% test ⇒ usually lower accuracy.

    • 90% train / 10% test ⇒ can further boost accuracy but risks over-fitting if data volume is limited.

    • Golden rule: train on large, test on small (but still representative) portions.

  • Feature-based learning- ML models ingest feature vectors, not raw signals (e.g.

    • Instead of raw EEG/ECG waveforms ⇒ feed extracted means, standard deviations, etc.).

Supervised vs. Unsupervised Learning (Context Setting)

  • Supervised Learning- Features ⇒ known target / class labels.

    • Example: Iris flower → features (sepal length, sepal width, …) + provided class.

    • Upcoming Assignment 2 is supervised – targets supplied.

  • Unsupervised Learning (preview for next lecture)- No class labels; algorithm clusters data automatically (e.g. color clusters red/yellow/brown without human tags).

Decision Tree Classifier

  • Concept & Structure- Root node (main question) ⇒ branches (internal nodes) ⇒ leaf nodes (final decisions).

    • Pure if–then–else logic; no explicit math inside the tree.

    • Tree pictured upside-down: root on top, leaves at bottom.

  • Simple Vegetable Example- Q1 (root): Is color red? → branch true vs. false.

    • Q2 (internal): Is diameter > 2 cm? → yields separate chilies.

    • Leaves now hold distinct vegetable classes.

  • Pokémon Go / WhatsApp / Snapchat Example1. Root: Age < 20?

    • True ⇒ Pokémon Go.

    • False ⇒ branch to gender.

    1. Gender Node (Age >= 20)

    • Female ⇒ WhatsApp.

    • Male ⇒ Snapchat.

  • Strengths & Weaknesses- + Transparent, easy to interpret.

    • - High variance → performs well on training-like data but doesn’t generalize (cannot cope with new patterns such as >20-year-olds playing Pokémon Go).

Random Forest Classifier

  • Intuition = "forest of many trees" (ensemble learning)

  • Construction Steps1. Start with full dataset (e.g. 1000 feature vectors: 500 green, 500 red).

    1. Bootstrap/BAGGING: randomly sample (with replacement) smaller "bags" (e.g. 100 samples each).

    2. Train an independent decision tree on each bag.

    3. At inference, collect each tree’s vote; apply majority voting (classification) or averaging (regression).

  • Practical tips- Choose odd number of trees (n=9, 11, 15) to avoid vote ties.

    • More trees ⇒ usually better generalization (lower variance) but higher computation.

  • Real-life analogy- Lecturer phoned friends in India, Singapore, Japan to choose a purchase; picked the brand with majority recommendation ⇒ Random Forest reasoning.

  • Advantages- Handles large datasets, robust to noise, low bias & low variance compared with single trees.

    • Easy to implement (just set number of trees), still interpretable via feature-importance or leaf inspection.

Support Vector Machine (SVM)

  • Goal: Draw the optimal separating hyper-plane between classes.- Support vectors = data points closest to that plane.

    • Maximum-margin principle: place the hyper-plane so margin between support vectors of opposite classes is maximized.

  • Linear SVM- Works when data are linearly separable (straight line or flat plane suffices).

  • Non-linear Reality & Kernel Trick1. Real-world data often not linearly separable (classes overlap in 2-D).

    1. Kernel trick: implicitly map data into higher-dimensional space where separation is linear.

    • Example mapping phi(x,y) = (x, y, x*y) converts 2-D to 3-D; red & green points become separable by a plane xy = c.

    • 1-D example: original axis x → map to x^2 to unfold clusters, then find linear boundary in [x, x^2] space.

    1. Common kernels

    • Polynomial: K(u,v)= (u*v + c)^d

    • Radial Basis Function (RBF / Gaussian): K(u,v) = exp(-( ||u-v||^2 ) / (2 * sigma^2) )

      • sigma^2 controls "width"; smaller sigma^2 ⇒ narrower peaks, larger ⇒ broader influence.

  • Strengths- Extremely accurate; robust for small & large datasets.

    • Memory efficient (depends only on support vectors).

    • Over-fitting resistant via margin maximization and kernel options.

  • Combining with Random Forest- Strategy: use Random Forest to produce informative leaf-based features, then feed those features into an SVM for final classification – leverages RF’s feature discovery + SVM’s powerful separation.

Comparison of the Three Classifiers

  • Decision Tree: simple, interpretable, high variance.

  • Random Forest: ensemble of trees, lowers variance, generally strong performance.

  • SVM: often best-in-class thanks to kernel trick; especially good for complex, overlapping data.

Upcoming & Related Topics

  • Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) will be covered next Tuesday (unsupervised focus).

  • Later weeks: bias-variance trade-off, over-/under-fitting, advanced kernel engineering.

Assignment 2 Overview

  • Dataset Columns- Input features: Age, Sex, BMI, Children, Smoker, Region.

    • Target: Charges (continuous monetary value) ⇒ regression task.

  • Pre-processing Pipeline1. Missing-value check

    • If dataset is large and row has missing cell ⇒ drop row.

    • If dataset is small ⇒ impute (e.g. fill with column mean or replicate plausible value).

    1. Exploratory Data Analysis (EDA) – choose at least one of:

    • Correlation heat-map (identify linear relations among features & target).

    • Feature computation / extraction (e.g. Random-Forest feature importance, statistical summaries).

    • Principal Component Analysis (PCA) for dimensionality reduction + variance explanation.

    1. Visualization – scatter plots, box plots, pair-plots, feature-importance bars.

  • Modelling Requirements- Implement and compare at least these regression models:

    1. Linear Regression.

    2. Decision Tree Regressor.

    3. Random Forest Regressor.

    4. Support Vector Regressor (SVR).

    • Evaluate with suitable metrics (e.g. R^2, Mean Absolute Error, RMSE).

    • Report and discuss which model achieves highest predictive performance and why.

  • Afternoon lab session will walk through: data cleaning, code demonstrations, visualization, metric interpretation.

Practical Tips & Reminders

  • Always work with features (not raw images/signals) for both training and inference.

  • For Random Forest / ensemble approaches, odd number of estimators simplifies majority voting.

  • When kernel engineering for SVM, small tweaks to sigma^2 (RBF) or degree d (polynomial) can drastically change decision boundaries.

  • Decision trees & forests can also supply feature-importance scores useful for EDA or as inputs to other models.

  • Ethical note (implicit): ensure models generalize fairly – a classifier over-fitted to one demographic (e.g. Pokémon Go usage) may misclassify others; Random Forest & SVM help mitigate but validation on diverse data is crucial.

Q&A Highlights

  • Can we design custom kernels? – Yes, any valid positive-definite function works, but standard kernels + parameter tuning usually suffice.

  • Combine RF & SVM? – Yes; RF leaves ⇒ features ⇒ SVM often yields superior separation.

  • Handling many features? – Feed subsets to individual RF trees; ensemble or PCA to reduce before SVM.

  • Slides already uploaded under Week 8; next lecture will deepen PCA + bias-variance concepts.