Supervised Learning – Decision Trees, Random Forests, Support Vector Machines & Assignment 2

Machine Learning (ML) core idea- Enable a computer to learn from past data and use that learning to make future predictions.
- Fundamental goal: reduce prediction error (via techniques such as gradient descent).
- Interview-ready definition: “Go on updating model parameters until the error becomes least or ideally 0.”
Examples used to illustrate error minimization- Robot repeatedly moves toward a cake until distance becomes 0 (gradient descent analogy).
- Neural network draws a separating line between red & blue classes until classification error approaches 0.
Training vs. Testing split- Typical splits and their implications
- 70% train / 30% test ⇒ higher accuracy (model sees more data during learning).
- 50% train / 50% test ⇒ usually lower accuracy.
- 90% train / 10% test ⇒ can further boost accuracy but risks over-fitting if data volume is limited.
- Golden rule: train on large, test on small (but still representative) portions.
Feature-based learning- ML models ingest feature vectors, not raw signals (e.g.
- Instead of raw EEG/ECG waveforms ⇒ feed extracted means, standard deviations, etc.).

Supervised Learning- Features ⇒ known target / class labels.
- Example: Iris flower → features (sepal length, sepal width, …) + provided class.
- Upcoming Assignment 2 is supervised – targets supplied.
Unsupervised Learning (preview for next lecture)- No class labels; algorithm clusters data automatically (e.g. color clusters red/yellow/brown without human tags).

Concept & Structure- Root node (main question) ⇒ branches (internal nodes) ⇒ leaf nodes (final decisions).
- Pure if–then–else logic; no explicit math inside the tree.
- Tree pictured upside-down: root on top, leaves at bottom.
Simple Vegetable Example- Q1 (root): Is color red? → branch true vs. false.
- Q2 (internal): Is diameter > 2 cm? → yields separate chilies.
- Leaves now hold distinct vegetable classes.
Pokémon Go / WhatsApp / Snapchat Example1. Root: Age < 20?
- True ⇒ Pokémon Go.
- False ⇒ branch to gender.
1. Gender Node (Age >= 20)
- Female ⇒ WhatsApp.
- Male ⇒ Snapchat.
Strengths & Weaknesses- + Transparent, easy to interpret.
- - High variance → performs well on training-like data but doesn’t generalize (cannot cope with new patterns such as >20-year-olds playing Pokémon Go).

Intuition = "forest of many trees" (ensemble learning)
Construction Steps1. Start with full dataset (e.g. 1000 feature vectors: 500 green, 500 red).
1. Bootstrap/BAGGING: randomly sample (with replacement) smaller "bags" (e.g. 100 samples each).
2. Train an independent decision tree on each bag.
3. At inference, collect each tree’s vote; apply majority voting (classification) or averaging (regression).
Practical tips- Choose odd number of trees (n=9, 11, 15) to avoid vote ties.
- More trees ⇒ usually better generalization (lower variance) but higher computation.
Real-life analogy- Lecturer phoned friends in India, Singapore, Japan to choose a purchase; picked the brand with majority recommendation ⇒ Random Forest reasoning.
Advantages- Handles large datasets, robust to noise, low bias & low variance compared with single trees.
- Easy to implement (just set number of trees), still interpretable via feature-importance or leaf inspection.

Goal: Draw the optimal separating hyper-plane between classes.- Support vectors = data points closest to that plane.
- Maximum-margin principle: place the hyper-plane so margin between support vectors of opposite classes is maximized.
Linear SVM- Works when data are linearly separable (straight line or flat plane suffices).
Non-linear Reality & Kernel Trick1. Real-world data often not linearly separable (classes overlap in 2-D).
1. Kernel trick: implicitly map data into higher-dimensional space where separation is linear.
- Example mapping phi(x,y) = (x, y, x*y) converts 2-D to 3-D; red & green points become separable by a plane xy = c.
- 1-D example: original axis x → map to x^2 to unfold clusters, then find linear boundary in [x, x^2] space.
1. Common kernels
- Polynomial: K(u,v)= (u*v + c)^d
- Radial Basis Function (RBF / Gaussian): K(u,v) = exp(-( ||u-v||^2 ) / (2 * sigma^2) )
  • sigma^2 controls "width"; smaller sigma^2 ⇒ narrower peaks, larger ⇒ broader influence.
Strengths- Extremely accurate; robust for small & large datasets.
- Memory efficient (depends only on support vectors).
- Over-fitting resistant via margin maximization and kernel options.
Combining with Random Forest- Strategy: use Random Forest to produce informative leaf-based features, then feed those features into an SVM for final classification – leverages RF’s feature discovery + SVM’s powerful separation.

Decision Tree: simple, interpretable, high variance.
Random Forest: ensemble of trees, lowers variance, generally strong performance.
SVM: often best-in-class thanks to kernel trick; especially good for complex, overlapping data.

Principal Component Analysis (PCA) and K-Nearest Neighbors (KNN) will be covered next Tuesday (unsupervised focus).
Later weeks: bias-variance trade-off, over-/under-fitting, advanced kernel engineering.

Dataset Columns- Input features: Age, Sex, BMI, Children, Smoker, Region.
- Target: Charges (continuous monetary value) ⇒ regression task.
Pre-processing Pipeline1. Missing-value check
- If dataset is large and row has missing cell ⇒ drop row.
- If dataset is small ⇒ impute (e.g. fill with column mean or replicate plausible value).
1. Exploratory Data Analysis (EDA) – choose at least one of:
- Correlation heat-map (identify linear relations among features & target).
- Feature computation / extraction (e.g. Random-Forest feature importance, statistical summaries).
- Principal Component Analysis (PCA) for dimensionality reduction + variance explanation.
1. Visualization – scatter plots, box plots, pair-plots, feature-importance bars.
Modelling Requirements- Implement and compare at least these regression models:
1. Linear Regression.
2. Decision Tree Regressor.
3. Random Forest Regressor.
4. Support Vector Regressor (SVR).
- Evaluate with suitable metrics (e.g. R^2, Mean Absolute Error, RMSE).
- Report and discuss which model achieves highest predictive performance and why.
Afternoon lab session will walk through: data cleaning, code demonstrations, visualization, metric interpretation.

Always work with features (not raw images/signals) for both training and inference.
For Random Forest / ensemble approaches, odd number of estimators simplifies majority voting.
When kernel engineering for SVM, small tweaks to sigma^2 (RBF) or degree d (polynomial) can drastically change decision boundaries.
Decision trees & forests can also supply feature-importance scores useful for EDA or as inputs to other models.
Ethical note (implicit): ensure models generalize fairly – a classifier over-fitted to one demographic (e.g. Pokémon Go usage) may misclassify others; Random Forest & SVM help mitigate but validation on diverse data is crucial.

Can we design custom kernels? – Yes, any valid positive-definite function works, but standard kernels + parameter tuning usually suffice.
Combine RF & SVM? – Yes; RF leaves ⇒ features ⇒ SVM often yields superior separation.
Handling many features? – Feed subsets to individual RF trees; ensemble or PCA to reduce before SVM.
Slides already uploaded under Week 8; next lecture will deepen PCA + bias-variance concepts.