Introduction to Data quality and exploration – AI1010 Lecture 8

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/36

flashcard set

Earn XP

Description and Tags

Vocabulary flashcards covering key concepts from the lecture notes on data quality and exploration.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

37 Terms

1
New cards

Data quality

The quality of data that constrains a model's performance; better data often yields bigger gains than algorithm improvements.

2
New cards

Scale of data

The size and volume of data used for training; examples include 10B sentence pairs (Translate), 1B miles (Autopilot), 500B+ words (GPT), and ImageNet-scale image datasets.

3
New cards

VC-dimension

A theoretical measure of model capacity and how data quantity affects generalization.

4
New cards

Structured data

Rows = objects; columns = measurements; each row is a p-dimensional vector embedded in a p-dimensional space; both n and p can be large.

5
New cards

Sparse matrix

A data representation where most entries are zero, common in text data (many word IDs not present in a document).

6
New cards

Data types

Structured data (tables), Unstructured data (free-form), Semi-structured data (partially organized, e.g., JSON, XML, HTML).

7
New cards

Semi-structured examples

JSON, XML, and HTML illustrate data that has some organizational properties but is not fully tabular.

8
New cards

Data sources

Internal sources (company databases, logs), External sources (APIs, open datasets), Generated data (simulations), Crowdsourced data (labeling).

9
New cards

Data lifecycle

An iterative, continuous process; version control for both data and models is crucial for reproducibility.

10
New cards

Data size terminology

Small data (

11
New cards

n and d

In a data table, n = number of samples (rows) and d = number of features (columns).

12
New cards

Curse of dimensionality

As dimensionality (d) increases, learning becomes harder and more data may be needed; distance metrics become less meaningful.

13
New cards

Exploratory Data Analysis (EDA)

Getting an overall sense of the dataset via summary statistics and visualizations before modeling.

14
New cards

Summary statistics

Metrics such as distinct values, max, min, mean, median, variance, and skewness.

15
New cards

Skewness

A measure of asymmetry in the distribution of a variable.

16
New cards

Outliers

Extreme values that may be errors or rare events; detectable via boxplots, Z-scores, and IQR.

17
New cards

Missing values

Absent data due to sensor failure or missing user input; represented as NaN/NULL/empty.

18
New cards

Missingness MCAR

Missing Completely At Random — no relation to any other variable.

19
New cards

Missingness MAR

Missing At Random — missingness related to observed variables.

20
New cards

Missingness MNAR

Missing Not At Random — missingness related to the value itself.

21
New cards

Handling missing data: Deletion

Row deletion (drop the whole sample) or column deletion (drop the feature) when appropriate.

22
New cards

Imputation

Filling in missing values; simple (mean/median/mode) or advanced (KNN or model-based).

23
New cards

Indicator feature

Add a binary column indicating whether a value was imputed.

24
New cards

KNN imputation

Filling missing values using the values from the k nearest neighbors with known data.

25
New cards

Outlier detection: Boxplots

Boxplots help visualize distribution and identify potential outliers.

26
New cards

Z-score

Standardized value (z = (x - μ)/σ); |z| > 3 often indicates an outlier.

27
New cards

IQR

Interquartile Range; outliers are values outside [Q1 - 1.5IQR, Q3 + 1.5IQR].

28
New cards

Feature scaling

Putting all features on similar scales; important for distance-based algorithms like k-NN, SVM, and neural networks.

29
New cards

Duplicates and consistency

Detect exact duplicates and near duplicates; ensure consistent formats across data.

30
New cards

Data validation

Automated checks for data integrity (range, type, referential integrity, business rules).

31
New cards

Referential integrity

Foreign keys must exist in the parent table (e.g., DepartmentID in Employees must exist in Departments).

32
New cards

Feature engineering

Creating new features from existing data (ratios, aggregations, polynomial features, date/time features, text features) to boost performance.

33
New cards

Text preprocessing

Tokenization, lowercasing, stop-word removal, stemming/lemmatization; vectorization (e.g., TF‑IDF).

34
New cards

Image preprocessing

Resizing, normalization, data augmentation (rotate, flip, crop), color space conversion.

35
New cards

Data splitting strategies

Random split (train/validation/test), stratified split for class balance, time-based split for time series, cross-validation.

36
New cards

Data ethics & privacy

Consent, GDPR/CCPA, handling PII, bias, anonymization; concerns like facial recognition bias and privacy.

37
New cards

Shuffling before split

Always shuffle data before creating train/validation/test partitions.