JB

Data Pre-processing, Normalisation & Linear-Regression Lecture – Comprehensive Notes

Session Logistics
  • Today’s Agenda - Data-pre-processing overview (consolidation → cleaning → transformation → reduction).

    • Intro to two learning algorithms: Linear Regression & Decision Tree (clustering briefly touched).

    • Q&A on Assignment 1 (due this Friday) and early guidance for Assignment 2 (due 25-July; no lecture next week because it’s break week; slides/recording will be posted over the weekend).

    • Reminder to on-line students to ask questions live.

  • Assignment Timeline - A1: due this Friday.

    • A2 posted; resources (slides) coming weekend/early next week.

    • Break week → no lecture next week.

    • A2 due 25-July; tips will compensate for missed lecture.

    • Students may ZIP notebook + PDF if LMS allows single upload slot.

    • Sample report: include Findings section (contains visualisation & results).

    • Visualising non-numeric columns → redesign or map to counts/aggregates.


Data Pre-processing Pipeline
  1. Data Consolidation - Collect, select, integrate disparate sources (e.g.

    multi-country Excel/CSV files → warehouse).

  2. Data Cleaning - Detect & handle:

    • Missing values

    • Noise/outliers (sudden spikes: e.g. 9 999 in otherwise 20-21 range)

    • Inconsistencies (naming variations, wrong datatypes).

    • Techniques:

      • Row/column deletion when largely empty

      • Imputation: copy from equivalent cell, insert mean/median, or create a new categorical level “missing”.

      • Noise smoothing: bin + replace by mean/median; rare – by min/max.

  3. Data Transformation - Smoothing, aggregation, normalisation (crucial when scales differ).

    • Goal: bring all features to comparable range (e.g. 0 to 1, or -1 to 1) so ML models converge/stabilise.

  4. Data Reduction - Feature Selection + Compression.

    • Dimensionality Reduction: Principal Component Analysis (PCA) most common.

      • Example: 10 000-dimensional signal → PCA → 50 principal components representing most variance.

      • Benefits: lower storage, faster training, reduced over-fitting.


Missing-Value Imputation Examples
  • Copy Imputation

    EH

    Annual Income

    → copy neighbour’s value into blanks.

  • Mean Imputation

    • Compute mean of column (e.g. mean x = 26) and fill missing entries.

  • Categorical Gap

    • Column ‘likes’: apple, orange, –. Create new category “Missing/Other”.


Noise Handling via Binning
  • Sorted data: {2,10,18,19,20,22,25,28}.

  • Bin size = 3 → groups {2,10,18},{18,19,20},{22,25,28}.

  • Replace each value by group mean or median (≈10,19,25).

  • Effectively smooths abrupt spikes.


Normalisation Techniques
  1. Divide-by-Max (simple)

x' = x / x_max

• Very fast; may be coarse.

  1. Min-Max Scaling

x' = (x - xmin) / (xmax - xmin) * (newmax - newmin) + newmin

• Example (range 8 to 20 -> 0 to 1):

(10-8)/(20-8)=0.166.

  1. Z-Score Standardisation (most popular)

z = (x - mean) / standard_deviation

• Compute mean, standarddeviation. Example set {8,10,15,20} → mean=13.25, standarddeviation approx 4.67 → z_8 approx -1.12.

• Widely used in health-care & business analytics.


Class-Imbalance Remedies
  • Upsampling/Over-sampling - Minority class (e.g. 50 images) synthetically enlarged to 4 000 via data augmentation: rotation, flipping, noise injection, pixel shifts.

  • Down-sampling/Under-sampling - Majority class randomly reduced to match minority (e.g. keep 50 of 4 000).

  • Motivation: balance improves sensitivity, specificity, overall accuracy.


Machine-Learning Demonstration (Iris Data, scikit-learn)
Dataset Snapshot
  • 150 rows × 4 features: Sepal Length, Sepal Width, Petal Length, Petal Width.

  • Target labels: 0 = Setosa, 1 = Versicolor, 2 = Virginica.

Feature Exploration & Selection
  • Histogram: quick view of value ranges for each numeric feature.

  • PairPlot (seaborn): scatter-matrix coloured by species → visual observation that Petal features separate classes best.

  • Correlation Matrix (heat-map): - Scale –1 → 1; diagonal = 1.

    • Highest off-diagonal correlation rho=0.96 between Petal Length & Petal Width → strong predictive power.

    • Third-best feature: Sepal Length (rho=0.87 with target proxy).

Train-Test Split
  • train_test_split(X, y, test_size=0.2, random_state=42) - 80 % (120 obs) training, 20 % (30 obs) testing.

    • random_state locks shuffling for reproducibility; omit to get different splits each run.

    • Rule of thumb: more training → higher accuracy; e.g. 90/10 > 80/20 > 70/30.

Linear Regression Experiments
  1. Simple Linear Regression (2 features) - Features: Petal Length (x), Petal Width (y).

    • Equation: y_hat = mx + c (best-fit line learnt).

    • Metrics:

      • Mean Absolute Error approx 0.29

      • Mean Squared Error approx 0.129

      • RMSE approx 0.35

      • R^2 approx 0.76.

  2. Multiple Linear Regression (3 features) - Features: Sepal Width, Petal Length, Petal Width.

    • Improved metrics:

      • Lower MAE/MSE/RMSE,

      • R^2 approx 0.855 (better fit).

    • Demonstrates benefit of adding informative features.

Visualisations
  • Scatter of Actual vs Predicted with regression line.

  • Comparative bar-chart of error metrics for 2-feature vs 3-feature models.

Python Libraries Referenced
  • sklearn.datasets, sklearn.model_selection, sklearn.preprocessing (StandardScaler, MinMaxScaler), sklearn.linear_model, sklearn.metrics.

  • Visual: matplotlib.pyplot, seaborn.


Ethical & Practical Considerations
  • Normalising healthcare data (varying units, huge magnitudes) avoids model bias and improves interpretability.

  • Class imbalance (e.g. rare diseases) must be rectified; otherwise models become overconfident on majority and unsafe in practice.

  • PCA or dimensionality reduction safeguards privacy by storing only abstract components instead of raw personal data.


Assignment Guidance Highlights
  • Report Structure: Title page → Executive Summary → Introduction → Methodology → Findings (visuals & stats) → Conclusions → References.

  • Include both Jupyter Notebook (.ipynb) and Report (PDF/Word) – zip if LMS needs one file.

  • Use data-frame visualisations (df.head(), histograms, pairplots) or aggregate counts for non-numeric categories.

  • Marks not penalised for presenting table via print instead of DataFrame if rendering fails.

  • Use sensible test/train ratio (recommend 80/20) and justify choice.

  • Finding section = where you narrate the insight gleaned from plots/correlations.


Key Formulae Recap
  • Linear regression: y_hat = mx + c.

  • Min-Max scaling: x' = (x - xmin) / (xmax - x_min) * (b - a) + a.

  • Z-score: z = (x - mean) / standard_deviation.

  • Variance: standarddeviation^2 = (1/n) * sum( (xi - mean)^2 ).

  • RMSE: sqrt( (1/n) * sum( (yhati - y_i)^2 ) ).


Take-Away Checklist for Exam/Assignments
  • Know each pre-processing stage and be able to cite at least two techniques per stage.

  • Memorise and apply three normalisation methods, with quick verbal example.

  • Explain why class balance matters and name two augmentation tricks.

  • Use histogram/pairplot/correlation matrix for feature selection justification.

  • Demonstrate traintestsplit syntax and effect of random_state.

  • Interpret MAE, MSE, RMSE, R^2.

  • Distinguish PCA (reduction) from Feature Selection (filter/wrapper).


Closing Reminders
  • No lecture next week (break).

  • Slides (or short video) with tips for Assignment 2 will be uploaded by weekend.

  • Assignment 1 due this Friday; follow submission guidelines (ZIP if single slot).

  • Use CDC link for cancer dataset background; Data_Value corresponds to absolute/adjusted counts per capita – verify on source site.

  • Ask questions early via online forums – lecturer responsive.