1/63
Vocabulary flashcards covering key terms from Lecture 3 on Feature Engineering, ETL, data quality, outliers, scaling, encoding, dimensionality reduction, and feature selection.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Structured Dataset
Data organized in rows (cases) and columns (variables) such as spreadsheets or relational tables.
Unstructured Data
Information without a predefined data model, e.g., text, images, video, or audio files.
Quantitative Variable
A numeric variable measuring quantity; may be discrete (counts) or continuous (measurements).
Categorical Variable
A variable whose values represent categories; can be nominal (no order) or ordinal (ordered).
ETL Pipeline
Extract-Transform-Load process that collects, cleans/standardizes, and stores data in a warehouse for analysis.
Internal Data
Structured information generated within an organization (e.g., ERP, CRM systems).
External Data
Data obtained from outside sources (APIs, web scraping) to augment internal datasets.
Structured Format
Data stored in rigid tables with fixed schemas (e.g., CSV, SQL tables).
Semi-Structured Format
Data with loose structure, often key-value pairs, such as JSON or XML files.
Datasheet for Datasets
Documentation describing a dataset’s origin, context, structure, and recommended use.
Data Dictionary
A reference listing variable names, types, allowed values, and meanings for a dataset.
Data Card
Concise document summarizing dataset purpose, collection method, and ethical considerations.
Data Quality Dimension
Aspect used to assess data (availability, usability, reliability, relevance, interpretability).
Feature Engineering
Creating, transforming, or selecting variables to improve model performance.
CRISP-DM
Cross-Industry Standard Process for Data Mining; phases include business understanding, data understanding, preparation, modeling, evaluation, deployment.
Exploratory Data Analysis (EDA)
Initial investigation using statistics and visuals to understand data structure and quality.
Preprocessing
Stage that cleans data by correcting errors, handling outliers, encoding, scaling, and imputing.
Feature Extraction
Reducing dimensionality by deriving informative, lower-dimensional representations of data.
Feature Selection
Choosing the most useful variables while removing redundant or irrelevant ones.
Descriptive Statistics
Numerical summaries such as mean, median, standard deviation, min, and max.
Covariance
Measure indicating whether two variables change together (positive or negative).
Covariance Matrix
Square table showing covariances between all pairs of variables; diagonal contains variances.
Correlation Coefficient (r)
Standardized measure (-1 to 1) of direction and strength of linear relationship between variables.
Histogram
Bar chart of frequency counts that visualizes a variable’s distribution via bins.
Box Plot
Graphic showing median, interquartile range (IQR), whiskers, and potential outliers.
Interquartile Range (IQR)
Distance between the 25th and 75th percentiles; used in outlier detection.
Outlier
Observation markedly distant from typical values, possibly due to error or rare behavior.
Z-Score
Standardized value calculated as (x-mean)/standard deviation; identifies deviations from mean.
Percentile Method
Outlier identification by flagging values outside chosen percentile thresholds (e.g., 5th & 95th).
Mahalanobis Distance
Multivariate metric measuring how many standard deviations an observation is from data center, adjusted for covariance.
MCAR (Missing Completely At Random)
Missingness independent of both observed and unobserved data.
MAR (Missing At Random)
Missingness related to observed variables but not to the missing value itself.
NMAR (Not Missing At Random)
Missingness depends on the unobserved value or external decision/event.
Logistic Regression
Statistical model predicting binary outcomes using log-odds transformation.
One-Hot Encoding
Converting a categorical column into multiple binary indicator columns (0/1).
Trimming
Removing observations classified as outliers from the dataset.
Discretization
Binning continuous values into categorical intervals.
Capping (Winsorizing)
Replacing extreme values with specified percentile thresholds (e.g., 1st and 99th).
Mean/Median Imputation
Filling missing numeric values with the feature’s mean or median.
Random Sampling Imputation
Replacing missing entries with randomly selected observed values from the same feature.
Extreme Value Imputation
Filling missing numeric values with minimum or maximum to flag missingness at distribution edge.
Click-Through Rate (CTR)
Metric defined as clicks divided by impressions, measuring engagement.
Scaling
Transforming features onto comparable ranges to prevent dominance by large-magnitude variables.
Min-Max Scaling
Rescales values to [0,1] using (x-min)/(max-min).
Standard Scaling
Centers and rescales values to mean 0, variance 1 (z-score transformation).
MaxAbs Scaling
Divides values by the feature’s maximum absolute value, producing range [-1,1]; suited for sparse data.
Curse of Dimensionality
Problems arising when datasets contain many features, including sparsity, distance inflation, and overfitting.
Principal Components Analysis (PCA)
Technique transforming correlated variables into orthogonal components ordered by explained variance.
Principal Component
Linear combination of original variables capturing maximal variance in data.
Eigenvalue
Amount of variance explained by a principal component.
Eigenvector
Direction (weights) defining a principal component in feature space.
Representation Learning
Automatically discovering useful features from raw data, often using neural networks.
TF-IDF Matrix
Term Frequency–Inverse Document Frequency representation quantifying word importance across documents.
Low-Variance Filter
Feature selection rule removing variables with little or no variability.
Mutual Information (MI)
Non-parametric measure quantifying how much knowing a feature reduces uncertainty about the target.
Deep Convolutional Neural Network (CNN)
Neural architecture specialized for extracting hierarchical features from images.
Data Warehouse
Central repository where cleaned and integrated data is stored for analysis.
Duplicate Row
Exact copy of a record; often removed during preprocessing.
Heuristic
Practical rule-of-thumb that produces quick, often approximate solutions.
Data Sparsity
Condition where many feature values are zero, common after one-hot encoding.
Multicollinearity
Situation where predictor variables are highly correlated, complicating model interpretation.
Overfitting
Model learns training data noise, reducing performance on new data.
Variance Explained
Proportion of total variability captured by a model or component.
Indicator Variable
Binary variable signaling presence/absence of a condition (e.g., missingness flag).