Applied Machine Learning – Comprehensive Bullet-Point Notes
Course Information & Context
- Institution: BITS-Pilani – Work Integrated Learning Programmes (WILP)
- Core paper: ZG568 – “Introduction / Applied Machine Learning” (multi-campus offering, flipped-classroom pedagogy)
- Authors & editors: Dr Sugata Ghosal, Dr Rama Satish K V, Prof Brahma Naidu, Team-AI, etc.
- Delivery model
• Recorded videos + live contact sessions (CS-xx)
• Jupyter/Colab demos; local Anaconda install recommended
• Attendance counted only if logged-in until session end & interactive - Principal textbooks / references
• T1 Hands-On ML (A. Géron 2e/3e)
• T2 Tan, Steinbach, Kumar – Data Mining
• R1 Interpretable ML (C. Molnar)
• R2 P. Domingos – “Few Useful Things …”
Modular Structure (10 Modules)
- M1 Introduction → definitions, types, challenges
- M2 & M3 “Big Picture” → End-to-End pipeline (problem framing → EDA → model → deploy)
- M4 Linear Prediction (LR, GD, regularisation, bias/variance)
- M5–M6 Classification I/II (LogReg, SVM, NB, DT, Ensemble)
- M7 Unsupervised (PCA, k-means, EM, apps)
- M8–M9 NN & Deep Nets (MLP, CNN, RNN, apps)
- M10 FAccT ML (Fairness, Accountability, Transparency, Robustness)
Machine Learning: What & Why
- ML = algorithms \mathcal{A} that improve performance P on task T with experience E.
- Traditional vs ML pipeline diagrams (program ↔ data interchange)
- Typical tasks & examples
• Classification, Regression, Sequence-decision (RL)
• Spam filter, OCR, medical imaging, autonomous driving, VQA, credit-fraud, recommender, GAN image generation, speech-ASR
Framing an ML Problem (Housing-price demo)
- Steps: business objective → choose supervised multivariate regression → pick metric (RMSE, MAE) → verify assumptions (batch vs online, instance vs model-based)
- Key elements of a full project:
1 Framing 2 Data types 3 Pre-processing 4 EDA/visualisation 5 Feature engineering 6 Model build/test
Data Types & Representation
- Attribute categories: Nominal, Ordinal, Interval, Ratio; Continuous vs Discrete
- Data containers: record table, data-matrix, document-term, transactions, graphs, sequences, spatio-temporal
- Important characteristics: dimensionality, sparsity, resolution, size
Data Pre-processing
- Quality issues: insufficient, non-representative, noise/outliers, missing, irrelevant features
- Remedies: cleaning, imputation, feature engineering (selection & extraction), regularisation
- Transformation ops: aggregation, sampling, binning, scaling, encoding, DR (PCA/SVD, t-SNE), curse-of-dimensionality note
Summary Statistics & Proximity
- Location: mean \mu, median, percentiles (p-tile)
- Spread: range, variance \sigma^2, std, MAD
- Distances
• Euclidean d2(x,y)=\sqrt{\sum (xk-yk)^2}
• Minkowski dr, Manhattan r=1, Chebyshev r\to\infty
• Mahalanobis d_M=\sqrt{(x-\mu)^T\Sigma^{-1}(x-\mu)}
• Cosine sim \cos(\theta)=\frac{x\cdot y}{|x|\,|y|}, Correlation \rho
Visualisation Cheatsheet
- 1-D: histogram, boxplot
- 2-D/3-D: scatter, heatmap, contour, pair-plot
- Matrix/corr plots for high-dimension
Supervised Learning
Regression (Linear)
- Model \hat{y}=w^Tx+b; cost J=\frac1{2m}\sum (\hat{y}-y)^2
- Closed-form w=(X^TX)^{-1}X^Ty &
GD update w:=w-\eta \nabla J; variants (batch, mini-batch, SGD) - Regularisation:
Ridge +\lambda |w|2^2, Lasso +\lambda |w|1, Early-stopping - Bias–Variance decomposition E[(y-\hat{f})^2]=\text{Bias}^2+\text{Var}+\sigma^2
Classification
- Logistic regression \sigma(z)=1/(1+e^{-z}); predict class by P(y=1|x)
- Naïve Bayes:
P(y|x)\propto P(y)\prodi P(xi|y); independence assumption; Laplace smoothing - Linear SVM, kernel SVM (soft margin, C, kernels)
- Decision Tree (ID3 info-gain Gain(S,A)=H(S)-\sumv\frac{|Sv|}{|S|}H(Sv)); overfit control via pruning; CART Gini 1-\sum pi^2
- Ensembles:
• Bagging / Random Forest (bootstrap, decorrelation)
• Boosting / AdaBoost (weight update wi\leftarrow wi e^{\alphat I(yi\ne ht(xi))})
• Error reduction when diverse & better-than-chance base-learners
Evaluation & Model Selection
- Confusion matrix terms: TP, FP, FN, TN
- Metrics: Accuracy, Precision \frac{TP}{TP+FP}, Recall \frac{TP}{TP+FN}, F1, ROC & AUC, cost-matrix, class-imbalance
- Data split: hold-out, k-fold CV (k≈10), LOOCV, bootstrap (.632)
- Hyper-parameter tuning via grid/random/Bayesian; nested CV
Unsupervised Learning
Dimensionality Reduction
- PCA: maximise variance; eigen-decomp of \Sigma; projection Z=U^T X; scree plot variance explained
- Application: eigenfaces, compression
Clustering
- K-means: minimise SSE; steps init→assign→update; issues (initialisation, scale, k, shapes); Elbow & silhouette
- K-medoids, Hierarchical (agglomerative, linkage), density-based (DBSCAN)
- GMM & EM; soft-clustering; likelihood \sumk \pik \mathcal{N}(x|\muk,\Sigmak)
- Cluster validity indices: SSE, Silhouette, Entropy/PI, Dunn, external ARI
Neural Networks & Deep Learning
- Perceptron learning rule w:=w+\eta (t-o)x; limitations (linear separability)
- MLP: layers, activations (ReLU, LeakyReLU, \tanh, sigmoid); back-prop chain rule; vanishing/exploding gradients
- Initialisation: Xavier \sigma=\sqrt{\frac{1}{n{in}}}, He \sqrt{\frac{2}{n{in}}}
- Optimisers: Momentum, Nesterov, RMSProp, Adam; LR scheduling, batch-norm
- Regularisation: Dropout (p≈0.5), L1/L2, early-stop
Convolutional NN (CNN)
- Convolution layer (kernel f\times f, stride s, padding p) output size (n+2p-f)/s+1
- Feature maps, channels, parameter count f^2\,c{in}\,c{out}
- Pooling (max/avg, stride 2), flatten, FC
- Typical stacks: [conv-BN-ReLU]* → pool → dense
Recurrent NN (RNN)
- Sequence modelling ht=\sigma(Wh h{t-1}+Wx x_t+b); many-to-one, many-to-many tasks
- LSTM gates (forget ft, input it, candidate gt, output ot) with cell state c_t
- GRU simplified (reset & update gates)
End-to-End ML Pipeline Recap
- Problem & metric
- Data acquire & store
- EDA / visualise
- Pre-process & feature eng.
- Split train/val/test
- Select & train models
- Tune hyper-params
- Evaluate, cross-validate
- Deploy – monitor, A/B, retrain
Fairness, Accountability, Transparency & Ethics (FAccT)
- Fairness definitions
• Demographic parity P(\hat y=1|A=0)=P(\hat y=1|A=1)
• Equalised odds P(\hat y=1|A,Y)=\text{indep of }A
• Equal opportunity (TPR parity) - Bias sources: prejudice in data, under-estimation, sampling bias
- Mitigation: balanced datasets, re-weigh, regularisers (Prejudice Index PI=\sum P(y,s)\log\frac{P(y,s)}{P(y)P(s)}), fair-representation, post-processing thresholds
- Interpretability
• Intrinsic (linear, DT, rule-lists) vs Post-hoc (LIME, SHAP)
• Global vs Local; model-specific vs model-agnostic - Privacy (data masking, DP), Security (poisoning, adversarial), Accountability (model cards, audits)
Key Mathematical Expressions
- Gradient descent w^{(k+1)}=w^{(k)}-\eta\,\nabla J(w^{(k)})
- AdaBoost weight \alphat=\frac12\ln\frac{1-\epsilont}{\epsilon_t}
- Logistic cost J= -\frac1m\sum \big[y\log\hat y+(1-y)\log(1-\hat y)\big]
- PCA eigen-problem \Sigma ui = \lambdai u_i
- LSTM equations
ft=\sigma(Wf[xt,h{t-1}]+bf)
it=\sigma(Wi[xt,h{t-1}]+bi)
\tilde ct=\tanh(Wc[xt,h{t-1}]+bc)
ct=ft\odot c{t-1}+it\odot\tilde ct
ot=\sigma(Wo[xt,h{t-1}]+bo)
ht=ot\odot\tanh(ct)
Practical Tips & Colab Resources
- Hands-On-ML notebooks:
• 02endtoendmachinelearningproject.ipynb (housing)
• 11trainingdeepneuralnetworks.ipynb (BN, LR sched, etc.) - Use GPU runtime via Colab; remember set random seeds for reproducibility; monitor tensor-board.
Closing Reminders
- Select metrics aligning with business cost, esp. in imbalanced or regulated settings.
- Always validate assumptions & monitor post-deployment drift.
- Strive for interpretable, fair & robust models alongside accuracy.