Decision Trees, Random Forests, SVMs & PCA – Lab Walk-Through and Assignment Tips
Quick Recap of Morning Theory
Morning lecture focused on Decision-Tree based learning; practice session now applies those ideas in code.
Topics flagged for later weeks (≈ Week 9):
Entropy
Information Gain
Decision Tree (DT)
Core idea: recursively split feature–space so that terminal leaves hold (ideally) single-class/constant targets.
Parts of the tree:
Root node – first split.
Internal nodes – intermediate splits.
Leaves – output predictions (class labels or continuous values).
Classification and regression capable:
Code demo uses
DecisionTreeRegressor(max_depth=3, random_state=42)
.
Key weakness
High bias / low generalisation; performance varies widely across datasets.
random_state = 42
Setting the seed fixes pseudo-randomness so that repeated runs yield identical splits, bootstraps, etc.
Seen earlier in Week 6.
Random Forest (RF)
Concept = Ensemble of many DTs (a forest).
Two randomness sources → “Random”
Bagging – each tree sees a bootstrap sample.
Feature Sub-sampling – at each split choose k features at random (commonly k = sqrt(p) for classification, k = p/3 for regression).
Prediction aggregation
Classification: majority vote ("max-o-te" mentioned).
Regression: mean (sometimes median).
Generally yields strong baseline performance; often best of classical methods.
Support Vector Machine (SVM / SVR)
For two-class linearly separable data, SVM finds the maximum-margin hyperplane.
If data are not linearly separable in the original space:
Map to higher-dimensional space through a kernel K(xi, xj).
Examples
Polynomial kernel: K(u,v) = (u.v + c)^d with degree d = 2,3,4,….
Radial Basis Function (RBF): K(u,v) = exp(-gamma * ||u-v||^2).
Lecturer’s 1-D toy example: (x) |-> (x, x^2) turns two mingled classes into separable parabola in 2-D.
Regression variant = SVR; same API in scikit-learn.
Principal Component Analysis (PCA)
Use-case: Dimensionality Reduction / Feature Compression when raw feature count is large.
Idea: Orthogonal linear transform producing principal components (PCs) ordered by explained variance ratio (EVR).
Example discussion:
Original features = 1000.
After PCA you may keep top 10 PCs containing most variance, reducing training cost while retaining information.
In code demo:
PCA(n_components=8)
because dataset has 8 features. EVR sample output:PC1 approx 25.66%
PC2 approx 22.81%
PC3 approx 16.11%
PCs 6–8 explain < 7% each → often dropped.
Full Data-Science Workflow Shown
Load packages
numpy
,pandas
,matplotlib
,seaborn
, scikit-learn modules.
Fetch dataset –
fetch_california_housing()
.Create pandas DataFrame for readability.
Pre-processing
a. Missing values –
data.isna().sum()
→ all zeros.b. Duplicate rows –
data.duplicated().sum()
→ 0.c. Outlier detection via boxplots.
Outlier removal (IQR rule)
Compute quartiles: Q1, Q3.
Interquartile range: IQR = Q3 - Q1.
Keep rows satisfying: Q1 - 1.5 * IQR <= x <= Q3 + 1.5 * IQR.
Rows reduced from 20640 to 16312 (approx 4000 removed).
Exploratory Data Analysis (EDA)
seaborn.pairplot()
for pairwise correlations.seaborn.heatmap(corr)
– visualises correlation matrix rho in [-1,1].Histograms per feature to view distributions & ranges.
Train–Test Split
train_test_split(test_size=0.2, random_state=42)
=> 80% train / 20% test.
Feature Scaling –
StandardScaler()
on train then transform test.Feature-Selection Strategies
i. Correlation ranking: sort absolute correlations with target.
Top three in example:
median_income
,avg_rooms
,house_age
.ii. Random-Forest feature importance: bar-plot generated; choose variables with importance > 0.10.
iii. PCA: retain first 5 PCs based on cumulative EVR.
Modelling
Linear Regression (
LinearRegression
).Decision-Tree Regressor (
DecisionTreeRegressor
).Random-Forest Regressor (
RandomForestRegressor
).Support Vector Regressor (
SVR
).
Evaluation Metrics
R^2 = 1 - (sum(y - yhat)^2) / (sum(y - ybar)^2).
MSE = (1/n) * sum(y - y_hat)^2.
RMSE = sqrt(MSE).
MAE = (1/n) * sum(|y - y_hat|).
Note: Train metrics usually optimistic; Test metrics reflect generalisation.
Results Table (example)
DT (train) R^2 approx 1 -> over-fitting evident.
RF best balance on test data.
Insurance-Charges Assignment Hints
Dataset columns: age, bmi, children (numeric) plus sex, smoker, region (categorical) and target
charges
.Categorical columns require encoding not normalisation (e.g., One-Hot, LabelEncoder).
Numeric scaling optional (bmi, age, children).
Evaluate test split ratios:
test_size=0.1,0.2,0.3
; lower test share (e.g., 0.1) may slightly boost accuracy.Many EDA plots still possible (histograms, count-plots, heatmap of numeric subset).
Practical Coding Remarks & Troubleshooting
Copy–pasting code may break indentation or variable scope; watch for mismatched variable names.
The bitwise NOT (
~
) operator used for boolean masking when dropping outliers.Always transform test data with the same fitted scaler/PCA object used on train.
Validate shapes: after
PCA
with 5 components ->X_train_PCA.shape = (n_train, 5)
.
Ethical & Practical Considerations Briefly Touched
Large models (e.g., SVM with 1000 features) are computationally expensive -> motivates PCA.
Bias/Variance trade-off: DT over-fits (low bias, high variance); RF averages out variance.
Reproducibility: using fixed random seeds ensures results can be audited and replicated.
Take-Home Checklist
[ ] Clean data: handle NA, duplicates, outliers.
[ ] Explore distributions & correlations visually.
[ ] Decide feature set via correlation, feature-importance or PCA.
[ ] Scale numeric features.
[ ] Split data (train/test) with fixed
random_state
.[ ] Train multiple models (LR, DT, RF, SVR).
[ ] Compute R^2, MSE, RMSE, MAE for both splits.
[ ] Compare and justify best model choice.
[ ] Document parameters (e.g.,
max_depth
,n_estimators
, kernel, C, gamma) for reproducibility.[ ] Reflect on generalisation, over-fitting signs, and computational efficiency.
End of consolidated notes – can serve as standalone study guide for the video’s content.