Data Pre-processing, Normalisation & Linear-Regression Lecture – Comprehensive Notes
Session Logistics
Today’s Agenda - Data-pre-processing overview (consolidation → cleaning → transformation → reduction).
Intro to two learning algorithms: Linear Regression & Decision Tree (clustering briefly touched).
Q&A on Assignment 1 (due this Friday) and early guidance for Assignment 2 (due 25-July; no lecture next week because it’s break week; slides/recording will be posted over the weekend).
Reminder to on-line students to ask questions live.
Assignment Timeline - A1: due this Friday.
A2 posted; resources (slides) coming weekend/early next week.
Break week → no lecture next week.
A2 due 25-July; tips will compensate for missed lecture.
Students may ZIP notebook + PDF if LMS allows single upload slot.
Sample report: include Findings section (contains visualisation & results).
Visualising non-numeric columns → redesign or map to counts/aggregates.
Data Pre-processing Pipeline
Data Consolidation - Collect, select, integrate disparate sources (e.g.
multi-country Excel/CSV files → warehouse).
Data Cleaning - Detect & handle:
• Missing values
• Noise/outliers (sudden spikes: e.g. 9 999 in otherwise 20-21 range)
• Inconsistencies (naming variations, wrong datatypes).
Techniques:
• Row/column deletion when largely empty
• Imputation: copy from equivalent cell, insert mean/median, or create a new categorical level “missing”.
• Noise smoothing: bin + replace by mean/median; rare – by min/max.
Data Transformation - Smoothing, aggregation, normalisation (crucial when scales differ).
Goal: bring all features to comparable range (e.g. 0 to 1, or -1 to 1) so ML models converge/stabilise.
Data Reduction - Feature Selection + Compression.
Dimensionality Reduction: Principal Component Analysis (PCA) most common.
• Example: 10 000-dimensional signal → PCA → 50 principal components representing most variance.
• Benefits: lower storage, faster training, reduced over-fitting.
Missing-Value Imputation Examples
Copy Imputation
EH
Annual Income
…
–
—
→ copy neighbour’s value into blanks.
Mean Imputation
Compute mean of column (e.g. mean x = 26) and fill missing entries.
Categorical Gap
Column ‘likes’: apple, orange, –. Create new category “Missing/Other”.
Noise Handling via Binning
Sorted data: {2,10,18,19,20,22,25,28}.
Bin size = 3 → groups {2,10,18},{18,19,20},{22,25,28}.
Replace each value by group mean or median (≈10,19,25).
Effectively smooths abrupt spikes.
Normalisation Techniques
Divide-by-Max (simple)
x' = x / x_max
• Very fast; may be coarse.
Min-Max Scaling
x' = (x - xmin) / (xmax - xmin) * (newmax - newmin) + newmin
• Example (range 8 to 20 -> 0 to 1):
(10-8)/(20-8)=0.166.
Z-Score Standardisation (most popular)
z = (x - mean) / standard_deviation
• Compute mean, standarddeviation. Example set {8,10,15,20} → mean=13.25, standarddeviation approx 4.67 → z_8 approx -1.12.
• Widely used in health-care & business analytics.
Class-Imbalance Remedies
Upsampling/Over-sampling - Minority class (e.g. 50 images) synthetically enlarged to 4 000 via data augmentation: rotation, flipping, noise injection, pixel shifts.
Down-sampling/Under-sampling - Majority class randomly reduced to match minority (e.g. keep 50 of 4 000).
Motivation: balance improves sensitivity, specificity, overall accuracy.
Machine-Learning Demonstration (Iris Data, scikit-learn)
Dataset Snapshot
150 rows × 4 features: Sepal Length, Sepal Width, Petal Length, Petal Width.
Target labels: 0 = Setosa, 1 = Versicolor, 2 = Virginica.
Feature Exploration & Selection
Histogram: quick view of value ranges for each numeric feature.
PairPlot (seaborn): scatter-matrix coloured by species → visual observation that Petal features separate classes best.
Correlation Matrix (heat-map): - Scale –1 → 1; diagonal = 1.
Highest off-diagonal correlation rho=0.96 between Petal Length & Petal Width → strong predictive power.
Third-best feature: Sepal Length (rho=0.87 with target proxy).
Train-Test Split
train_test_split(X, y, test_size=0.2, random_state=42)
- 80 % (120 obs) training, 20 % (30 obs) testing.random_state
locks shuffling for reproducibility; omit to get different splits each run.Rule of thumb: more training → higher accuracy; e.g. 90/10 > 80/20 > 70/30.
Linear Regression Experiments
Simple Linear Regression (2 features) - Features: Petal Length (x), Petal Width (y).
Equation: y_hat = mx + c (best-fit line learnt).
Metrics:
• Mean Absolute Error approx 0.29
• Mean Squared Error approx 0.129
• RMSE approx 0.35
• R^2 approx 0.76.
Multiple Linear Regression (3 features) - Features: Sepal Width, Petal Length, Petal Width.
Improved metrics:
• Lower MAE/MSE/RMSE,
• R^2 approx 0.855 (better fit).
Demonstrates benefit of adding informative features.
Visualisations
Scatter of Actual vs Predicted with regression line.
Comparative bar-chart of error metrics for 2-feature vs 3-feature models.
Python Libraries Referenced
sklearn.datasets
,sklearn.model_selection
,sklearn.preprocessing (StandardScaler, MinMaxScaler)
,sklearn.linear_model
,sklearn.metrics
.Visual:
matplotlib.pyplot
,seaborn
.
Ethical & Practical Considerations
Normalising healthcare data (varying units, huge magnitudes) avoids model bias and improves interpretability.
Class imbalance (e.g. rare diseases) must be rectified; otherwise models become overconfident on majority and unsafe in practice.
PCA or dimensionality reduction safeguards privacy by storing only abstract components instead of raw personal data.
Assignment Guidance Highlights
Report Structure: Title page → Executive Summary → Introduction → Methodology → Findings (visuals & stats) → Conclusions → References.
Include both Jupyter Notebook (
.ipynb
) and Report (PDF/Word) – zip if LMS needs one file.Use data-frame visualisations (
df.head()
, histograms, pairplots) or aggregate counts for non-numeric categories.Marks not penalised for presenting table via
print
instead ofDataFrame
if rendering fails.Use sensible test/train ratio (recommend 80/20) and justify choice.
Finding section = where you narrate the insight gleaned from plots/correlations.
Key Formulae Recap
Linear regression: y_hat = mx + c.
Min-Max scaling: x' = (x - xmin) / (xmax - x_min) * (b - a) + a.
Z-score: z = (x - mean) / standard_deviation.
Variance: standarddeviation^2 = (1/n) * sum( (xi - mean)^2 ).
RMSE: sqrt( (1/n) * sum( (yhati - y_i)^2 ) ).
Take-Away Checklist for Exam/Assignments
Know each pre-processing stage and be able to cite at least two techniques per stage.
Memorise and apply three normalisation methods, with quick verbal example.
Explain why class balance matters and name two augmentation tricks.
Use histogram/pairplot/correlation matrix for feature selection justification.
Demonstrate traintestsplit syntax and effect of
random_state
.Interpret MAE, MSE, RMSE, R^2.
Distinguish PCA (reduction) from Feature Selection (filter/wrapper).
Closing Reminders
No lecture next week (break).
Slides (or short video) with tips for Assignment 2 will be uploaded by weekend.
Assignment 1 due this Friday; follow submission guidelines (ZIP if single slot).
Use CDC link for cancer dataset background;
Data_Value
corresponds to absolute/adjusted counts per capita – verify on source site.Ask questions early via online forums – lecturer responsive.