Introduction to Data quality and exploration – AI1010 Lecture 8

Data Quality Essentials

Data quality drives AI performance; better data often has a bigger impact than a better algorithm.
Scale of data in modern AI: Google Translate > 10B sentence pairs; Tesla Autopilot > 1B miles; GPT pre-training on > 500B tokens (trillions recently); ImageNet-scale image datasets.

Data Types and Structure

Structured data: rows = objects, columns = measurements; represent each row as a n imes d matrix where n = number of samples, d = number of features.
VC-dimension: relates model complexity to generalization; for linear models in d dimensions, VC\text{-dimension} \approx d.
Data types: Structured, Unstructured, Semi-structured; examples: CSV/database, text/images, JSON/XML/HTML.

Data Sources & Lifecycle

Data sources: Internal, External, Generated; Crowdsourced labels.
Public datasets: ImageNet, Common Crawl, Wikipedia dumps.
Data lifecycle: iterative, continuous; version control for data and models is crucial for reproducibility.

Data Size, Shape & Splits

Size categories: Small < 1\text{GB}; Medium 1-100\text{GB}; Big > 100\text{GB}.
Shape: n = samples, d = features; high n is good; high d risks the "curse of dimensionality".
Data splitting strategies: Random Split (Train 60–80%, Val 10–20%, Test 10–20%); Stratified Split; Time-based Split; Cross-Validation.

Exploratory Data Analysis (EDA)

Purpose: get an overall sense of the data; compute summary statistics (distinct values, min, max, mean, median, variance, skewness).
Visualization: histograms, scatter plots, higher-dimensional methods.
EDA helps data-checks and catching skewness or anomalies before modeling.

Data Quality Issues

Missing values, Noise, Outliers, Duplicates, Inconsistency.
Data cleaning is often the largest portion of a project (often up to ~80% of resources).

Missing Data: Detection & Patterns

Representations: missing values can appear as \text{NaN}, \text{NULL}, \"N/A\", empty strings, placeholders like 0 or -999.
Missingness types:
- MCAR (Missing Completely At Random): no pattern with any variable.
- MAR (Missing At Random): missingness related to observed variables.
- MNAR (Missing Not At Random): missingness related to the value itself.

Handling Missing Data: Strategies

Deletion: Row deletion (few affected rows) or Column deletion (if > 50% missing).
Imputation (Filling in values): Simple (mean/median for numeric, mode for categorical); Advanced (KNN or model-based).
Indicator feature: add a binary column indicating whether a value was imputed.

Imputation Methods (Examples)

Simple: mean/median (numeric), mode (categorical).
K-NN imputation: fill based on nearest neighbours (e.g., K=2).

Outlier Detection & Handling

Outliers can skew statistics and models.
Detection:
- Visualization: Boxplots.
- Statistical: Z-score, IQR.
Boxplot method: define Q1, Q3; IQR = Q3 − Q1; outliers beyond [Q1 − 1.5×IQR, Q3 + 1.5×IQR].
Z-score: Z = \frac{x - \mu}{\sigma}; |Z| > 3 often an outlier.
IQR method is robust to non-normal data.
Handling: remove, cap (winsorize), transform (e.g., log), or keep if genuine rare event.

Feature Scaling

Puts all features on similar scale; crucial for distance-based algorithms (k-NN, SVM, neural nets).

Duplicates & Consistency

Duplicates: exact duplicates easy to detect; near duplicates require entity resolution.
Consistency: standardize formats, units, and textual names (lowercase, remove special chars).

Data Validation

Automated checks for integrity: Range checks, Type checks, Referential Integrity, Business Rules.
Best Practice: automate checks in the data pipeline using asserts or data quality tools.
Example: Referential integrity – DepartmentID in Employees must exist in Departments.

Feature Engineering

Create new features from existing data (often the biggest performance gains):
- Ratios (e.g., clicks/views → click-through rate)
- Aggregations (average purchase value for a customer)
- Polynomial features (interactions, powers)
- Date/Time features (day of week, month, is weekend)
- Text features (word count, sentiment scores)

Text & Image Preprocessing

Text: Tokenization, Lowercasing, Stop-word removal, Stemming/Lemmatization; Vectorization (e.g., TF-IDF).
Image: Resize, Normalize pixel values, Data augmentation (rotate, flip, crop), Color space conversion.

Data Ethics & Privacy

Consent and privacy regulations (GDPR, CCPA);
PII handling; Bias in data can be learned/amplified by models; Anonymization.
Real-world concerns: facial recognition bias, medical data privacy, location tracking.