Data Pre-processing, Normalisation & Linear-Regression Lecture – Comprehensive Notes

Today’s Agenda - Data-pre-processing overview (consolidation → cleaning → transformation → reduction).
- Intro to two learning algorithms: Linear Regression & Decision Tree (clustering briefly touched).
- Q&A on Assignment 1 (due this Friday) and early guidance for Assignment 2 (due 25-July; no lecture next week because it’s break week; slides/recording will be posted over the weekend).
- Reminder to on-line students to ask questions live.
Assignment Timeline - A1: due this Friday.
- A2 posted; resources (slides) coming weekend/early next week.
- Break week → no lecture next week.
- A2 due 25-July; tips will compensate for missed lecture.
- Students may ZIP notebook + PDF if LMS allows single upload slot.
- Sample report: include Findings section (contains visualisation & results).
- Visualising non-numeric columns → redesign or map to counts/aggregates.

Data Consolidation - Collect, select, integrate disparate sources (e.g.
multi-country Excel/CSV files → warehouse).
Data Cleaning - Detect & handle:
• Missing values
• Noise/outliers (sudden spikes: e.g. 9 999 in otherwise 20-21 range)
• Inconsistencies (naming variations, wrong datatypes).
- Techniques:
  • Row/column deletion when largely empty
  • Imputation: copy from equivalent cell, insert mean/median, or create a new categorical level “missing”.
  • Noise smoothing: bin + replace by mean/median; rare – by min/max.
Data Transformation - Smoothing, aggregation, normalisation (crucial when scales differ).
- Goal: bring all features to comparable range (e.g. 0 to 1, or -1 to 1) so ML models converge/stabilise.
Data Reduction - Feature Selection + Compression.
- Dimensionality Reduction: Principal Component Analysis (PCA) most common.
  • Example: 10 000-dimensional signal → PCA → 50 principal components representing most variance.
  • Benefits: lower storage, faster training, reduced over-fitting.

Copy Imputation
EH
Annual Income
…
–
—
→ copy neighbour’s value into blanks.
Mean Imputation
- Compute mean of column (e.g. mean x = 26) and fill missing entries.
Categorical Gap
- Column ‘likes’: apple, orange, –. Create new category “Missing/Other”.

x' = x / x_max

• Very fast; may be coarse.

x' = (x - xmin) / (xmax - xmin) * (newmax - newmin) + newmin

• Example (range 8 to 20 -> 0 to 1):

(10-8)/(20-8)=0.166.

z = (x - mean) / standard_deviation

• Compute mean, standarddeviation. Example set {8,10,15,20} → mean=13.25, standarddeviation approx 4.67 → z_8 approx -1.12.

• Widely used in health-care & business analytics.

Upsampling/Over-sampling - Minority class (e.g. 50 images) synthetically enlarged to 4 000 via data augmentation: rotation, flipping, noise injection, pixel shifts.
Down-sampling/Under-sampling - Majority class randomly reduced to match minority (e.g. keep 50 of 4 000).
Motivation: balance improves sensitivity, specificity, overall accuracy.

Histogram: quick view of value ranges for each numeric feature.
PairPlot (seaborn): scatter-matrix coloured by species → visual observation that Petal features separate classes best.
Correlation Matrix (heat-map): - Scale –1 → 1; diagonal = 1.
- Highest off-diagonal correlation rho=0.96 between Petal Length & Petal Width → strong predictive power.
- Third-best feature: Sepal Length (rho=0.87 with target proxy).

train_test_split(X, y, test_size=0.2, random_state=42) - 80 % (120 obs) training, 20 % (30 obs) testing.
- random_state locks shuffling for reproducibility; omit to get different splits each run.
- Rule of thumb: more training → higher accuracy; e.g. 90/10 > 80/20 > 70/30.

Simple Linear Regression (2 features) - Features: Petal Length (x), Petal Width (y).
- Equation: y_hat = mx + c (best-fit line learnt).
- Metrics:
  • Mean Absolute Error approx 0.29
  • Mean Squared Error approx 0.129
  • RMSE approx 0.35
  • R^2 approx 0.76.
Multiple Linear Regression (3 features) - Features: Sepal Width, Petal Length, Petal Width.
- Improved metrics:
  • Lower MAE/MSE/RMSE,
  • R^2 approx 0.855 (better fit).
- Demonstrates benefit of adding informative features.

sklearn.datasets, sklearn.model_selection, sklearn.preprocessing (StandardScaler, MinMaxScaler), sklearn.linear_model, sklearn.metrics.
Visual: matplotlib.pyplot, seaborn.

Normalising healthcare data (varying units, huge magnitudes) avoids model bias and improves interpretability.
Class imbalance (e.g. rare diseases) must be rectified; otherwise models become overconfident on majority and unsafe in practice.
PCA or dimensionality reduction safeguards privacy by storing only abstract components instead of raw personal data.

Report Structure: Title page → Executive Summary → Introduction → Methodology → Findings (visuals & stats) → Conclusions → References.
Include both Jupyter Notebook (.ipynb) and Report (PDF/Word) – zip if LMS needs one file.
Use data-frame visualisations (df.head(), histograms, pairplots) or aggregate counts for non-numeric categories.
Marks not penalised for presenting table via print instead of DataFrame if rendering fails.
Use sensible test/train ratio (recommend 80/20) and justify choice.
Finding section = where you narrate the insight gleaned from plots/correlations.

Know each pre-processing stage and be able to cite at least two techniques per stage.
Memorise and apply three normalisation methods, with quick verbal example.
Explain why class balance matters and name two augmentation tricks.
Use histogram/pairplot/correlation matrix for feature selection justification.
Demonstrate traintestsplit syntax and effect of random_state.
Interpret MAE, MSE, RMSE, R^2.
Distinguish PCA (reduction) from Feature Selection (filter/wrapper).

No lecture next week (break).
Slides (or short video) with tips for Assignment 2 will be uploaded by weekend.
Assignment 1 due this Friday; follow submission guidelines (ZIP if single slot).
Use CDC link for cancer dataset background; Data_Value corresponds to absolute/adjusted counts per capita – verify on source site.
Ask questions early via online forums – lecturer responsive.

EH	Annual Income	…

–	—
→ copy neighbour’s value into blanks.