Decision Trees, Random Forests, SVMs & PCA – Lab Walk-Through and Assignment Tips

Quick Recap of Morning Theory

Morning lecture focused on Decision-Tree based learning; practice session now applies those ideas in code.
Topics flagged for later weeks (≈ Week 9):
- Entropy
- Information Gain

Decision Tree (DT)

Core idea: recursively split feature–space so that terminal leaves hold (ideally) single-class/constant targets.
Parts of the tree:
- Root node – first split.
- Internal nodes – intermediate splits.
- Leaves – output predictions (class labels or continuous values).
Classification and regression capable:
- Code demo uses DecisionTreeRegressor(max_depth=3, random_state=42).
Key weakness
- High bias / low generalisation; performance varies widely across datasets.

`random_state = 42`

Setting the seed fixes pseudo-randomness so that repeated runs yield identical splits, bootstraps, etc.
Seen earlier in Week 6.

Random Forest (RF)

Concept = Ensemble of many DTs (a forest).
Two randomness sources → “Random”

Bagging – each tree sees a bootstrap sample.
Feature Sub-sampling – at each split choose k features at random (commonly k = sqrt(p) for classification, k = p/3 for regression).

Prediction aggregation
- Classification: majority vote ("max-o-te" mentioned).
- Regression: mean (sometimes median).
Generally yields strong baseline performance; often best of classical methods.

Support Vector Machine (SVM / SVR)

For two-class linearly separable data, SVM finds the maximum-margin hyperplane.
If data are not linearly separable in the original space:
- Map to higher-dimensional space through a kernel K(xi, xj).
- Examples
- Polynomial kernel: K(u,v) = (u.v + c)^d with degree d = 2,3,4,….
- Radial Basis Function (RBF): K(u,v) = exp(-gamma * ||u-v||^2).
- Lecturer’s 1-D toy example: (x) |-> (x, x^2) turns two mingled classes into separable parabola in 2-D.
Regression variant = SVR; same API in scikit-learn.

Principal Component Analysis (PCA)

Use-case: Dimensionality Reduction / Feature Compression when raw feature count is large.
Idea: Orthogonal linear transform producing principal components (PCs) ordered by explained variance ratio (EVR).
Example discussion:
- Original features = 1000.
- After PCA you may keep top 10 PCs containing most variance, reducing training cost while retaining information.
In code demo: PCA(n_components=8) because dataset has 8 features. EVR sample output:
- PC1 approx 25.66%
- PC2 approx 22.81%
- PC3 approx 16.11%
- PCs 6–8 explain < 7% each → often dropped.

Full Data-Science Workflow Shown

Load packages
- numpy, pandas, matplotlib, seaborn, scikit-learn modules.
Fetch dataset – fetch_california_housing().
Create pandas DataFrame for readability.
Pre-processing
- a. Missing values – data.isna().sum() → all zeros.
- b. Duplicate rows – data.duplicated().sum() → 0.
- c. Outlier detection via boxplots.
Outlier removal (IQR rule)
- Compute quartiles: Q1, Q3.
- Interquartile range: IQR = Q3 - Q1.
- Keep rows satisfying: Q1 - 1.5 * IQR <= x <= Q3 + 1.5 * IQR.
- Rows reduced from 20640 to 16312 (approx 4000 removed).
Exploratory Data Analysis (EDA)
- seaborn.pairplot() for pairwise correlations.
- seaborn.heatmap(corr) – visualises correlation matrix rho in [-1,1].
- Histograms per feature to view distributions & ranges.
Train–Test Split
- train_test_split(test_size=0.2, random_state=42) => 80% train / 20% test.
Feature Scaling – StandardScaler() on train then transform test.
Feature-Selection Strategies
- i. Correlation ranking: sort absolute correlations with target.
- Top three in example: median_income, avg_rooms, house_age.
- ii. Random-Forest feature importance: bar-plot generated; choose variables with importance > 0.10.
- iii. PCA: retain first 5 PCs based on cumulative EVR.
Modelling
- Linear Regression (LinearRegression).
- Decision-Tree Regressor (DecisionTreeRegressor).
- Random-Forest Regressor (RandomForestRegressor).
- Support Vector Regressor (SVR).
Evaluation Metrics
- R^2 = 1 - (sum(y - yhat)^2) / (sum(y - ybar)^2).
- MSE = (1/n) * sum(y - y_hat)^2.
- RMSE = sqrt(MSE).
- MAE = (1/n) * sum(|y - y_hat|).
- Note: Train metrics usually optimistic; Test metrics reflect generalisation.
Results Table (example)
- DT (train) R^2 approx 1 -> over-fitting evident.
- RF best balance on test data.

Insurance-Charges Assignment Hints

Dataset columns: age, bmi, children (numeric) plus sex, smoker, region (categorical) and target charges.
Categorical columns require encoding not normalisation (e.g., One-Hot, LabelEncoder).
Numeric scaling optional (bmi, age, children).
Evaluate test split ratios: test_size=0.1,0.2,0.3; lower test share (e.g., 0.1) may slightly boost accuracy.
Many EDA plots still possible (histograms, count-plots, heatmap of numeric subset).

Practical Coding Remarks & Troubleshooting

Copy–pasting code may break indentation or variable scope; watch for mismatched variable names.
The bitwise NOT (~) operator used for boolean masking when dropping outliers.
Always transform test data with the same fitted scaler/PCA object used on train.
Validate shapes: after PCA with 5 components -> X_train_PCA.shape = (n_train, 5).

Ethical & Practical Considerations Briefly Touched

Large models (e.g., SVM with 1000 features) are computationally expensive -> motivates PCA.
Bias/Variance trade-off: DT over-fits (low bias, high variance); RF averages out variance.
Reproducibility: using fixed random seeds ensures results can be audited and replicated.

Take-Home Checklist

[ ] Clean data: handle NA, duplicates, outliers.
[ ] Explore distributions & correlations visually.
[ ] Decide feature set via correlation, feature-importance or PCA.
[ ] Scale numeric features.
[ ] Split data (train/test) with fixed random_state.
[ ] Train multiple models (LR, DT, RF, SVR).
[ ] Compute R^2, MSE, RMSE, MAE for both splits.
[ ] Compare and justify best model choice.
[ ] Document parameters (e.g., max_depth, n_estimators, kernel, C, gamma) for reproducibility.
[ ] Reflect on generalisation, over-fitting signs, and computational efficiency.

End of consolidated notes – can serve as standalone study guide for the video’s content.