Supervised Learning – Decision Trees, Random Forests, Support Vector Machines & Assignment 2
Quick Recap of Supervised Learning
Machine Learning (ML) core idea- Enable a computer to learn from past data and use that learning to make future predictions.
Fundamental goal: reduce prediction error (via techniques such as gradient descent).
Interview-ready definition: “Go on updating model parameters until the error becomes least or ideally 0.”
Examples used to illustrate error minimization- Robot repeatedly moves toward a cake until distance becomes 0 (gradient descent analogy).
Neural network draws a separating line between red & blue classes until classification error approaches 0.
Training vs. Testing split- Typical splits and their implications
70% train / 30% test ⇒ higher accuracy (model sees more data during learning).
50% train / 50% test ⇒ usually lower accuracy.
90% train / 10% test ⇒ can further boost accuracy but risks over-fitting if data volume is limited.
Golden rule: train on large, test on small (but still representative) portions.
Feature-based learning- ML models ingest feature vectors, not raw signals (e.g.
Instead of raw EEG/ECG waveforms ⇒ feed extracted means, standard deviations, etc.).
Supervised vs. Unsupervised Learning (Context Setting)
Supervised Learning- Features ⇒ known target / class labels.
Example: Iris flower → features (sepal length, sepal width, …) + provided class.
Upcoming Assignment 2 is supervised – targets supplied.
Unsupervised Learning (preview for next lecture)- No class labels; algorithm clusters data automatically (e.g. color clusters red/yellow/brown without human tags).
Decision Tree Classifier
Concept & Structure- Root node (main question) ⇒ branches (internal nodes) ⇒ leaf nodes (final decisions).
Pure if–then–else logic; no explicit math inside the tree.
Tree pictured upside-down: root on top, leaves at bottom.
Simple Vegetable Example- Q1 (root): Is color red? → branch true vs. false.
Q2 (internal): Is diameter > 2 cm? → yields separate chilies.
Leaves now hold distinct vegetable classes.
Pokémon Go / WhatsApp / Snapchat Example1. Root: Age < 20?
True ⇒ Pokémon Go.
False ⇒ branch to gender.
Gender Node (Age >= 20)
Female ⇒ WhatsApp.
Male ⇒ Snapchat.
Strengths & Weaknesses- + Transparent, easy to interpret.
- High variance → performs well on training-like data but doesn’t generalize (cannot cope with new patterns such as >20-year-olds playing Pokémon Go).
Random Forest Classifier
Intuition = "forest of many trees" (ensemble learning)
Construction Steps1. Start with full dataset (e.g. 1000 feature vectors: 500 green, 500 red).
Bootstrap/BAGGING: randomly sample (with replacement) smaller "bags" (e.g. 100 samples each).
Train an independent decision tree on each bag.
At inference, collect each tree’s vote; apply majority voting (classification) or averaging (regression).
Practical tips- Choose odd number of trees (n=9, 11, 15) to avoid vote ties.
More trees ⇒ usually better generalization (lower variance) but higher computation.
Real-life analogy- Lecturer phoned friends in India, Singapore, Japan to choose a purchase; picked the brand with majority recommendation ⇒ Random Forest reasoning.
Advantages- Handles large datasets, robust to noise, low bias & low variance compared with single trees.
Easy to implement (just set number of trees), still interpretable via feature-importance or leaf inspection.
Support Vector Machine (SVM)
Goal: Draw the optimal separating hyper-plane between classes.- Support vectors = data points closest to that plane.
Maximum-margin principle: place the hyper-plane so margin between support vectors of opposite classes is maximized.
Linear SVM- Works when data are linearly separable (straight line or flat plane suffices).
Non-linear Reality & Kernel Trick1. Real-world data often not linearly separable (classes overlap in 2-D).
Kernel trick: implicitly map data into higher-dimensional space where separation is linear.
Example mapping phi(x,y) = (x, y, x*y) converts 2-D to 3-D; red & green points become separable by a plane xy = c.
1-D example: original axis x → map to x^2 to unfold clusters, then find linear boundary in [x, x^2] space.
Common kernels
Polynomial: K(u,v)= (u*v + c)^d
Radial Basis Function (RBF / Gaussian): K(u,v) = exp(-( ||u-v||^2 ) / (2 * sigma^2) )
• sigma^2 controls "width"; smaller sigma^2 ⇒ narrower peaks, larger ⇒ broader influence.
Strengths- Extremely accurate; robust for small & large datasets.
Memory efficient (depends only on support vectors).
Over-fitting resistant via margin maximization and kernel options.
Combining with Random Forest- Strategy: use Random Forest to produce informative leaf-based features, then feed those features into an SVM for final classification – leverages RF’s feature discovery + SVM’s powerful separation.
Comparison of the Three Classifiers
Decision Tree: simple, interpretable, high variance.
Random Forest: ensemble of trees, lowers variance, generally strong performance.
SVM: often best-in-class thanks to kernel trick; especially good for complex, overlapping data.
Upcoming & Related Topics
Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) will be covered next Tuesday (unsupervised focus).
Later weeks: bias-variance trade-off, over-/under-fitting, advanced kernel engineering.
Assignment 2 Overview
Dataset Columns- Input features: Age, Sex, BMI, Children, Smoker, Region.
Target: Charges (continuous monetary value) ⇒ regression task.
Pre-processing Pipeline1. Missing-value check
If dataset is large and row has missing cell ⇒ drop row.
If dataset is small ⇒ impute (e.g. fill with column mean or replicate plausible value).
Exploratory Data Analysis (EDA) – choose at least one of:
Correlation heat-map (identify linear relations among features & target).
Feature computation / extraction (e.g. Random-Forest feature importance, statistical summaries).
Principal Component Analysis (PCA) for dimensionality reduction + variance explanation.
Visualization – scatter plots, box plots, pair-plots, feature-importance bars.
Modelling Requirements- Implement and compare at least these regression models:
Linear Regression.
Decision Tree Regressor.
Random Forest Regressor.
Support Vector Regressor (SVR).
Evaluate with suitable metrics (e.g. R^2, Mean Absolute Error, RMSE).
Report and discuss which model achieves highest predictive performance and why.
Afternoon lab session will walk through: data cleaning, code demonstrations, visualization, metric interpretation.
Practical Tips & Reminders
Always work with features (not raw images/signals) for both training and inference.
For Random Forest / ensemble approaches, odd number of estimators simplifies majority voting.
When kernel engineering for SVM, small tweaks to sigma^2 (RBF) or degree d (polynomial) can drastically change decision boundaries.
Decision trees & forests can also supply feature-importance scores useful for EDA or as inputs to other models.
Ethical note (implicit): ensure models generalize fairly – a classifier over-fitted to one demographic (e.g. Pokémon Go usage) may misclassify others; Random Forest & SVM help mitigate but validation on diverse data is crucial.
Q&A Highlights
Can we design custom kernels? – Yes, any valid positive-definite function works, but standard kernels + parameter tuning usually suffice.
Combine RF & SVM? – Yes; RF leaves ⇒ features ⇒ SVM often yields superior separation.
Handling many features? – Feed subsets to individual RF trees; ensemble or PCA to reduce before SVM.
Slides already uploaded under Week 8; next lecture will deepen PCA + bias-variance concepts.