Introduction to Data quality and exploration – AI1010 Lecture 8
Data Quality Essentials
- Data quality drives AI performance; better data often has a bigger impact than a better algorithm.
- Scale of data in modern AI: Google Translate > 10B sentence pairs; Tesla Autopilot > 1B miles; GPT pre-training on > 500B tokens (trillions recently); ImageNet-scale image datasets.
Data Types and Structure
- Structured data: rows = objects, columns = measurements; represent each row as a n imes d matrix where n = number of samples, d = number of features.
- VC-dimension: relates model complexity to generalization; for linear models in d dimensions, VC\text{-dimension} \approx d.
- Data types: Structured, Unstructured, Semi-structured; examples: CSV/database, text/images, JSON/XML/HTML.
Data Sources & Lifecycle
- Data sources: Internal, External, Generated; Crowdsourced labels.
- Public datasets: ImageNet, Common Crawl, Wikipedia dumps.
- Data lifecycle: iterative, continuous; version control for data and models is crucial for reproducibility.
Data Size, Shape & Splits
- Size categories: Small < 1\text{GB}; Medium 1-100\text{GB}; Big > 100\text{GB}.
- Shape: n = samples, d = features; high n is good; high d risks the "curse of dimensionality".
- Data splitting strategies: Random Split (Train 60–80%, Val 10–20%, Test 10–20%); Stratified Split; Time-based Split; Cross-Validation.
Exploratory Data Analysis (EDA)
- Purpose: get an overall sense of the data; compute summary statistics (distinct values, min, max, mean, median, variance, skewness).
- Visualization: histograms, scatter plots, higher-dimensional methods.
- EDA helps data-checks and catching skewness or anomalies before modeling.
Data Quality Issues
- Missing values, Noise, Outliers, Duplicates, Inconsistency.
- Data cleaning is often the largest portion of a project (often up to ~80% of resources).
Missing Data: Detection & Patterns
- Representations: missing values can appear as \text{NaN}, \text{NULL}, \"N/A\", empty strings, placeholders like 0 or -999.
- Missingness types:
- MCAR (Missing Completely At Random): no pattern with any variable.
- MAR (Missing At Random): missingness related to observed variables.
- MNAR (Missing Not At Random): missingness related to the value itself.
Handling Missing Data: Strategies
- Deletion: Row deletion (few affected rows) or Column deletion (if > 50% missing).
- Imputation (Filling in values): Simple (mean/median for numeric, mode for categorical); Advanced (KNN or model-based).
- Indicator feature: add a binary column indicating whether a value was imputed.
Imputation Methods (Examples)
- Simple: mean/median (numeric), mode (categorical).
- K-NN imputation: fill based on nearest neighbours (e.g., K=2).
Outlier Detection & Handling
- Outliers can skew statistics and models.
- Detection:
- Visualization: Boxplots.
- Statistical: Z-score, IQR.
- Boxplot method: define Q1, Q3; IQR = Q3 − Q1; outliers beyond [Q1 − 1.5×IQR, Q3 + 1.5×IQR].
- Z-score: Z = \frac{x - \mu}{\sigma}; |Z| > 3 often an outlier.
- IQR method is robust to non-normal data.
- Handling: remove, cap (winsorize), transform (e.g., log), or keep if genuine rare event.
Feature Scaling
- Puts all features on similar scale; crucial for distance-based algorithms (k-NN, SVM, neural nets).
Duplicates & Consistency
- Duplicates: exact duplicates easy to detect; near duplicates require entity resolution.
- Consistency: standardize formats, units, and textual names (lowercase, remove special chars).
Data Validation
- Automated checks for integrity: Range checks, Type checks, Referential Integrity, Business Rules.
- Best Practice: automate checks in the data pipeline using asserts or data quality tools.
- Example: Referential integrity – DepartmentID in Employees must exist in Departments.
Feature Engineering
- Create new features from existing data (often the biggest performance gains):
- Ratios (e.g., clicks/views → click-through rate)
- Aggregations (average purchase value for a customer)
- Polynomial features (interactions, powers)
- Date/Time features (day of week, month, is weekend)
- Text features (word count, sentiment scores)
Text & Image Preprocessing
- Text: Tokenization, Lowercasing, Stop-word removal, Stemming/Lemmatization; Vectorization (e.g., TF-IDF).
- Image: Resize, Normalize pixel values, Data augmentation (rotate, flip, crop), Color space conversion.
Data Ethics & Privacy
- Consent and privacy regulations (GDPR, CCPA);
- PII handling; Bias in data can be learned/amplified by models; Anonymization.
- Real-world concerns: facial recognition bias, medical data privacy, location tracking.