AE

Introduction to Data quality and exploration – AI1010 Lecture 8

Data Quality Essentials

  • Data quality drives AI performance; better data often has a bigger impact than a better algorithm.
  • Scale of data in modern AI: Google Translate > 10B sentence pairs; Tesla Autopilot > 1B miles; GPT pre-training on > 500B tokens (trillions recently); ImageNet-scale image datasets.

Data Types and Structure

  • Structured data: rows = objects, columns = measurements; represent each row as a n imes d matrix where n = number of samples, d = number of features.
  • VC-dimension: relates model complexity to generalization; for linear models in d dimensions, VC\text{-dimension} \approx d.
  • Data types: Structured, Unstructured, Semi-structured; examples: CSV/database, text/images, JSON/XML/HTML.

Data Sources & Lifecycle

  • Data sources: Internal, External, Generated; Crowdsourced labels.
  • Public datasets: ImageNet, Common Crawl, Wikipedia dumps.
  • Data lifecycle: iterative, continuous; version control for data and models is crucial for reproducibility.

Data Size, Shape & Splits

  • Size categories: Small < 1\text{GB}; Medium 1-100\text{GB}; Big > 100\text{GB}.
  • Shape: n = samples, d = features; high n is good; high d risks the "curse of dimensionality".
  • Data splitting strategies: Random Split (Train 60–80%, Val 10–20%, Test 10–20%); Stratified Split; Time-based Split; Cross-Validation.

Exploratory Data Analysis (EDA)

  • Purpose: get an overall sense of the data; compute summary statistics (distinct values, min, max, mean, median, variance, skewness).
  • Visualization: histograms, scatter plots, higher-dimensional methods.
  • EDA helps data-checks and catching skewness or anomalies before modeling.

Data Quality Issues

  • Missing values, Noise, Outliers, Duplicates, Inconsistency.
  • Data cleaning is often the largest portion of a project (often up to ~80% of resources).

Missing Data: Detection & Patterns

  • Representations: missing values can appear as \text{NaN}, \text{NULL}, \"N/A\", empty strings, placeholders like 0 or -999.
  • Missingness types:
    • MCAR (Missing Completely At Random): no pattern with any variable.
    • MAR (Missing At Random): missingness related to observed variables.
    • MNAR (Missing Not At Random): missingness related to the value itself.

Handling Missing Data: Strategies

  • Deletion: Row deletion (few affected rows) or Column deletion (if > 50% missing).
  • Imputation (Filling in values): Simple (mean/median for numeric, mode for categorical); Advanced (KNN or model-based).
  • Indicator feature: add a binary column indicating whether a value was imputed.

Imputation Methods (Examples)

  • Simple: mean/median (numeric), mode (categorical).
  • K-NN imputation: fill based on nearest neighbours (e.g., K=2).

Outlier Detection & Handling

  • Outliers can skew statistics and models.
  • Detection:
    • Visualization: Boxplots.
    • Statistical: Z-score, IQR.
  • Boxplot method: define Q1, Q3; IQR = Q3 − Q1; outliers beyond [Q1 − 1.5×IQR, Q3 + 1.5×IQR].
  • Z-score: Z = \frac{x - \mu}{\sigma}; |Z| > 3 often an outlier.
  • IQR method is robust to non-normal data.
  • Handling: remove, cap (winsorize), transform (e.g., log), or keep if genuine rare event.

Feature Scaling

  • Puts all features on similar scale; crucial for distance-based algorithms (k-NN, SVM, neural nets).

Duplicates & Consistency

  • Duplicates: exact duplicates easy to detect; near duplicates require entity resolution.
  • Consistency: standardize formats, units, and textual names (lowercase, remove special chars).

Data Validation

  • Automated checks for integrity: Range checks, Type checks, Referential Integrity, Business Rules.
  • Best Practice: automate checks in the data pipeline using asserts or data quality tools.
  • Example: Referential integrity – DepartmentID in Employees must exist in Departments.

Feature Engineering

  • Create new features from existing data (often the biggest performance gains):
    • Ratios (e.g., clicks/views → click-through rate)
    • Aggregations (average purchase value for a customer)
    • Polynomial features (interactions, powers)
    • Date/Time features (day of week, month, is weekend)
    • Text features (word count, sentiment scores)

Text & Image Preprocessing

  • Text: Tokenization, Lowercasing, Stop-word removal, Stemming/Lemmatization; Vectorization (e.g., TF-IDF).
  • Image: Resize, Normalize pixel values, Data augmentation (rotate, flip, crop), Color space conversion.

Data Ethics & Privacy

  • Consent and privacy regulations (GDPR, CCPA);
  • PII handling; Bias in data can be learned/amplified by models; Anonymization.
  • Real-world concerns: facial recognition bias, medical data privacy, location tracking.