Lecture 3 flashcards Feature Engineering & Data Preparation

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/63

flashcard set

Earn XP

Description and Tags

Vocabulary flashcards covering key terms from Lecture 3 on Feature Engineering, ETL, data quality, outliers, scaling, encoding, dimensionality reduction, and feature selection.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

64 Terms

1
New cards

Structured Dataset

Data organized in rows (cases) and columns (variables) such as spreadsheets or relational tables.

2
New cards

Unstructured Data

Information without a predefined data model, e.g., text, images, video, or audio files.

3
New cards

Quantitative Variable

A numeric variable measuring quantity; may be discrete (counts) or continuous (measurements).

4
New cards

Categorical Variable

A variable whose values represent categories; can be nominal (no order) or ordinal (ordered).

5
New cards

ETL Pipeline

Extract-Transform-Load process that collects, cleans/standardizes, and stores data in a warehouse for analysis.

6
New cards

Internal Data

Structured information generated within an organization (e.g., ERP, CRM systems).

7
New cards

External Data

Data obtained from outside sources (APIs, web scraping) to augment internal datasets.

8
New cards

Structured Format

Data stored in rigid tables with fixed schemas (e.g., CSV, SQL tables).

9
New cards

Semi-Structured Format

Data with loose structure, often key-value pairs, such as JSON or XML files.

10
New cards

Datasheet for Datasets

Documentation describing a dataset’s origin, context, structure, and recommended use.

11
New cards

Data Dictionary

A reference listing variable names, types, allowed values, and meanings for a dataset.

12
New cards

Data Card

Concise document summarizing dataset purpose, collection method, and ethical considerations.

13
New cards

Data Quality Dimension

Aspect used to assess data (availability, usability, reliability, relevance, interpretability).

14
New cards

Feature Engineering

Creating, transforming, or selecting variables to improve model performance.

15
New cards

CRISP-DM

Cross-Industry Standard Process for Data Mining; phases include business understanding, data understanding, preparation, modeling, evaluation, deployment.

16
New cards

Exploratory Data Analysis (EDA)

Initial investigation using statistics and visuals to understand data structure and quality.

17
New cards

Preprocessing

Stage that cleans data by correcting errors, handling outliers, encoding, scaling, and imputing.

18
New cards

Feature Extraction

Reducing dimensionality by deriving informative, lower-dimensional representations of data.

19
New cards

Feature Selection

Choosing the most useful variables while removing redundant or irrelevant ones.

20
New cards

Descriptive Statistics

Numerical summaries such as mean, median, standard deviation, min, and max.

21
New cards

Covariance

Measure indicating whether two variables change together (positive or negative).

22
New cards

Covariance Matrix

Square table showing covariances between all pairs of variables; diagonal contains variances.

23
New cards

Correlation Coefficient (r)

Standardized measure (-1 to 1) of direction and strength of linear relationship between variables.

24
New cards

Histogram

Bar chart of frequency counts that visualizes a variable’s distribution via bins.

25
New cards

Box Plot

Graphic showing median, interquartile range (IQR), whiskers, and potential outliers.

26
New cards

Interquartile Range (IQR)

Distance between the 25th and 75th percentiles; used in outlier detection.

27
New cards

Outlier

Observation markedly distant from typical values, possibly due to error or rare behavior.

28
New cards

Z-Score

Standardized value calculated as (x-mean)/standard deviation; identifies deviations from mean.

29
New cards

Percentile Method

Outlier identification by flagging values outside chosen percentile thresholds (e.g., 5th & 95th).

30
New cards

Mahalanobis Distance

Multivariate metric measuring how many standard deviations an observation is from data center, adjusted for covariance.

31
New cards

MCAR (Missing Completely At Random)

Missingness independent of both observed and unobserved data.

32
New cards

MAR (Missing At Random)

Missingness related to observed variables but not to the missing value itself.

33
New cards

NMAR (Not Missing At Random)

Missingness depends on the unobserved value or external decision/event.

34
New cards

Logistic Regression

Statistical model predicting binary outcomes using log-odds transformation.

35
New cards

One-Hot Encoding

Converting a categorical column into multiple binary indicator columns (0/1).

36
New cards

Trimming

Removing observations classified as outliers from the dataset.

37
New cards

Discretization

Binning continuous values into categorical intervals.

38
New cards

Capping (Winsorizing)

Replacing extreme values with specified percentile thresholds (e.g., 1st and 99th).

39
New cards

Mean/Median Imputation

Filling missing numeric values with the feature’s mean or median.

40
New cards

Random Sampling Imputation

Replacing missing entries with randomly selected observed values from the same feature.

41
New cards

Extreme Value Imputation

Filling missing numeric values with minimum or maximum to flag missingness at distribution edge.

42
New cards

Click-Through Rate (CTR)

Metric defined as clicks divided by impressions, measuring engagement.

43
New cards

Scaling

Transforming features onto comparable ranges to prevent dominance by large-magnitude variables.

44
New cards

Min-Max Scaling

Rescales values to [0,1] using (x-min)/(max-min).

45
New cards

Standard Scaling

Centers and rescales values to mean 0, variance 1 (z-score transformation).

46
New cards

MaxAbs Scaling

Divides values by the feature’s maximum absolute value, producing range [-1,1]; suited for sparse data.

47
New cards

Curse of Dimensionality

Problems arising when datasets contain many features, including sparsity, distance inflation, and overfitting.

48
New cards

Principal Components Analysis (PCA)

Technique transforming correlated variables into orthogonal components ordered by explained variance.

49
New cards

Principal Component

Linear combination of original variables capturing maximal variance in data.

50
New cards

Eigenvalue

Amount of variance explained by a principal component.

51
New cards

Eigenvector

Direction (weights) defining a principal component in feature space.

52
New cards

Representation Learning

Automatically discovering useful features from raw data, often using neural networks.

53
New cards

TF-IDF Matrix

Term Frequency–Inverse Document Frequency representation quantifying word importance across documents.

54
New cards

Low-Variance Filter

Feature selection rule removing variables with little or no variability.

55
New cards

Mutual Information (MI)

Non-parametric measure quantifying how much knowing a feature reduces uncertainty about the target.

56
New cards

Deep Convolutional Neural Network (CNN)

Neural architecture specialized for extracting hierarchical features from images.

57
New cards

Data Warehouse

Central repository where cleaned and integrated data is stored for analysis.

58
New cards

Duplicate Row

Exact copy of a record; often removed during preprocessing.

59
New cards

Heuristic

Practical rule-of-thumb that produces quick, often approximate solutions.

60
New cards

Data Sparsity

Condition where many feature values are zero, common after one-hot encoding.

61
New cards

Multicollinearity

Situation where predictor variables are highly correlated, complicating model interpretation.

62
New cards

Overfitting

Model learns training data noise, reducing performance on new data.

63
New cards

Variance Explained

Proportion of total variability captured by a model or component.

64
New cards

Indicator Variable

Binary variable signaling presence/absence of a condition (e.g., missingness flag).