JB

Decision Trees, Random Forests, SVMs & PCA – Lab Walk-Through and Assignment Tips

Quick Recap of Morning Theory

  • Morning lecture focused on Decision-Tree based learning; practice session now applies those ideas in code.

  • Topics flagged for later weeks (≈ Week 9):

    • Entropy

    • Information Gain


Decision Tree (DT)

  • Core idea: recursively split feature–space so that terminal leaves hold (ideally) single-class/constant targets.

  • Parts of the tree:

    • Root node – first split.

    • Internal nodes – intermediate splits.

    • Leaves – output predictions (class labels or continuous values).

  • Classification and regression capable:

    • Code demo uses DecisionTreeRegressor(max_depth=3, random_state=42).

  • Key weakness

    • High bias / low generalisation; performance varies widely across datasets.


random_state = 42

  • Setting the seed fixes pseudo-randomness so that repeated runs yield identical splits, bootstraps, etc.

  • Seen earlier in Week 6.


Random Forest (RF)

  • Concept = Ensemble of many DTs (a forest).

  • Two randomness sources → “Random”

  1. Bagging – each tree sees a bootstrap sample.

  2. Feature Sub-sampling – at each split choose k features at random (commonly k = sqrt(p) for classification, k = p/3 for regression).

  • Prediction aggregation

    • Classification: majority vote ("max-o-te" mentioned).

    • Regression: mean (sometimes median).

  • Generally yields strong baseline performance; often best of classical methods.


Support Vector Machine (SVM / SVR)

  • For two-class linearly separable data, SVM finds the maximum-margin hyperplane.

  • If data are not linearly separable in the original space:

    • Map to higher-dimensional space through a kernel K(xi, xj).

    • Examples

    • Polynomial kernel: K(u,v) = (u.v + c)^d with degree d = 2,3,4,….

    • Radial Basis Function (RBF): K(u,v) = exp(-gamma * ||u-v||^2).

    • Lecturer’s 1-D toy example: (x) |-> (x, x^2) turns two mingled classes into separable parabola in 2-D.

  • Regression variant = SVR; same API in scikit-learn.


Principal Component Analysis (PCA)

  • Use-case: Dimensionality Reduction / Feature Compression when raw feature count is large.

  • Idea: Orthogonal linear transform producing principal components (PCs) ordered by explained variance ratio (EVR).

  • Example discussion:

    • Original features = 1000.

    • After PCA you may keep top 10 PCs containing most variance, reducing training cost while retaining information.

  • In code demo: PCA(n_components=8) because dataset has 8 features. EVR sample output:

    • PC1 approx 25.66%

    • PC2 approx 22.81%

    • PC3 approx 16.11%

    • PCs 6–8 explain < 7% each → often dropped.


Full Data-Science Workflow Shown

  1. Load packages

    • numpy, pandas, matplotlib, seaborn, scikit-learn modules.

  2. Fetch datasetfetch_california_housing().

  3. Create pandas DataFrame for readability.

  4. Pre-processing

    • a. Missing valuesdata.isna().sum() → all zeros.

    • b. Duplicate rowsdata.duplicated().sum() → 0.

    • c. Outlier detection via boxplots.

  5. Outlier removal (IQR rule)

    • Compute quartiles: Q1, Q3.

    • Interquartile range: IQR = Q3 - Q1.

    • Keep rows satisfying: Q1 - 1.5 * IQR <= x <= Q3 + 1.5 * IQR.

    • Rows reduced from 20640 to 16312 (approx 4000 removed).

  6. Exploratory Data Analysis (EDA)

    • seaborn.pairplot() for pairwise correlations.

    • seaborn.heatmap(corr) – visualises correlation matrix rho in [-1,1].

    • Histograms per feature to view distributions & ranges.

  7. Train–Test Split

    • train_test_split(test_size=0.2, random_state=42) => 80% train / 20% test.

  8. Feature ScalingStandardScaler() on train then transform test.

  9. Feature-Selection Strategies

    • i. Correlation ranking: sort absolute correlations with target.

    • Top three in example: median_income, avg_rooms, house_age.

    • ii. Random-Forest feature importance: bar-plot generated; choose variables with importance > 0.10.

    • iii. PCA: retain first 5 PCs based on cumulative EVR.

  10. Modelling

    • Linear Regression (LinearRegression).

    • Decision-Tree Regressor (DecisionTreeRegressor).

    • Random-Forest Regressor (RandomForestRegressor).

    • Support Vector Regressor (SVR).

  11. Evaluation Metrics

    • R^2 = 1 - (sum(y - yhat)^2) / (sum(y - ybar)^2).

    • MSE = (1/n) * sum(y - y_hat)^2.

    • RMSE = sqrt(MSE).

    • MAE = (1/n) * sum(|y - y_hat|).

    • Note: Train metrics usually optimistic; Test metrics reflect generalisation.

  12. Results Table (example)

    • DT (train) R^2 approx 1 -> over-fitting evident.

    • RF best balance on test data.


Insurance-Charges Assignment Hints

  • Dataset columns: age, bmi, children (numeric) plus sex, smoker, region (categorical) and target charges.

  • Categorical columns require encoding not normalisation (e.g., One-Hot, LabelEncoder).

  • Numeric scaling optional (bmi, age, children).

  • Evaluate test split ratios: test_size=0.1,0.2,0.3; lower test share (e.g., 0.1) may slightly boost accuracy.

  • Many EDA plots still possible (histograms, count-plots, heatmap of numeric subset).


Practical Coding Remarks & Troubleshooting

  • Copy–pasting code may break indentation or variable scope; watch for mismatched variable names.

  • The bitwise NOT (~) operator used for boolean masking when dropping outliers.

  • Always transform test data with the same fitted scaler/PCA object used on train.

  • Validate shapes: after PCA with 5 components -> X_train_PCA.shape = (n_train, 5).


Ethical & Practical Considerations Briefly Touched

  • Large models (e.g., SVM with 1000 features) are computationally expensive -> motivates PCA.

  • Bias/Variance trade-off: DT over-fits (low bias, high variance); RF averages out variance.

  • Reproducibility: using fixed random seeds ensures results can be audited and replicated.


Take-Home Checklist

  • [ ] Clean data: handle NA, duplicates, outliers.

  • [ ] Explore distributions & correlations visually.

  • [ ] Decide feature set via correlation, feature-importance or PCA.

  • [ ] Scale numeric features.

  • [ ] Split data (train/test) with fixed random_state.

  • [ ] Train multiple models (LR, DT, RF, SVR).

  • [ ] Compute R^2, MSE, RMSE, MAE for both splits.

  • [ ] Compare and justify best model choice.

  • [ ] Document parameters (e.g., max_depth, n_estimators, kernel, C, gamma) for reproducibility.

  • [ ] Reflect on generalisation, over-fitting signs, and computational efficiency.

End of consolidated notes – can serve as standalone study guide for the video’s content.