1/36
Vocabulary flashcards covering key concepts from the lecture notes on data quality and exploration.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Data quality
The quality of data that constrains a model's performance; better data often yields bigger gains than algorithm improvements.
Scale of data
The size and volume of data used for training; examples include 10B sentence pairs (Translate), 1B miles (Autopilot), 500B+ words (GPT), and ImageNet-scale image datasets.
VC-dimension
A theoretical measure of model capacity and how data quantity affects generalization.
Structured data
Rows = objects; columns = measurements; each row is a p-dimensional vector embedded in a p-dimensional space; both n and p can be large.
Sparse matrix
A data representation where most entries are zero, common in text data (many word IDs not present in a document).
Data types
Structured data (tables), Unstructured data (free-form), Semi-structured data (partially organized, e.g., JSON, XML, HTML).
Semi-structured examples
JSON, XML, and HTML illustrate data that has some organizational properties but is not fully tabular.
Data sources
Internal sources (company databases, logs), External sources (APIs, open datasets), Generated data (simulations), Crowdsourced data (labeling).
Data lifecycle
An iterative, continuous process; version control for both data and models is crucial for reproducibility.
Data size terminology
Small data (
n and d
In a data table, n = number of samples (rows) and d = number of features (columns).
Curse of dimensionality
As dimensionality (d) increases, learning becomes harder and more data may be needed; distance metrics become less meaningful.
Exploratory Data Analysis (EDA)
Getting an overall sense of the dataset via summary statistics and visualizations before modeling.
Summary statistics
Metrics such as distinct values, max, min, mean, median, variance, and skewness.
Skewness
A measure of asymmetry in the distribution of a variable.
Outliers
Extreme values that may be errors or rare events; detectable via boxplots, Z-scores, and IQR.
Missing values
Absent data due to sensor failure or missing user input; represented as NaN/NULL/empty.
Missingness MCAR
Missing Completely At Random — no relation to any other variable.
Missingness MAR
Missing At Random — missingness related to observed variables.
Missingness MNAR
Missing Not At Random — missingness related to the value itself.
Handling missing data: Deletion
Row deletion (drop the whole sample) or column deletion (drop the feature) when appropriate.
Imputation
Filling in missing values; simple (mean/median/mode) or advanced (KNN or model-based).
Indicator feature
Add a binary column indicating whether a value was imputed.
KNN imputation
Filling missing values using the values from the k nearest neighbors with known data.
Outlier detection: Boxplots
Boxplots help visualize distribution and identify potential outliers.
Z-score
Standardized value (z = (x - μ)/σ); |z| > 3 often indicates an outlier.
IQR
Interquartile Range; outliers are values outside [Q1 - 1.5IQR, Q3 + 1.5IQR].
Feature scaling
Putting all features on similar scales; important for distance-based algorithms like k-NN, SVM, and neural networks.
Duplicates and consistency
Detect exact duplicates and near duplicates; ensure consistent formats across data.
Data validation
Automated checks for data integrity (range, type, referential integrity, business rules).
Referential integrity
Foreign keys must exist in the parent table (e.g., DepartmentID in Employees must exist in Departments).
Feature engineering
Creating new features from existing data (ratios, aggregations, polynomial features, date/time features, text features) to boost performance.
Text preprocessing
Tokenization, lowercasing, stop-word removal, stemming/lemmatization; vectorization (e.g., TF‑IDF).
Image preprocessing
Resizing, normalization, data augmentation (rotate, flip, crop), color space conversion.
Data splitting strategies
Random split (train/validation/test), stratified split for class balance, time-based split for time series, cross-validation.
Data ethics & privacy
Consent, GDPR/CCPA, handling PII, bias, anonymization; concerns like facial recognition bias and privacy.
Shuffling before split
Always shuffle data before creating train/validation/test partitions.