Data Cleaning, Preprocessing, Transformation, and Visualization

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/34

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 10:08 PM on 6/17/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

35 Terms

1
New cards

Exploratory data analysis (EDA)

Early analysis to understand patterns, issues, and distributions before modeling.

2
New cards

Data preprocessing

Cleaning, transforming, and organizing raw data before analysis/modeling.

3
New cards

Data cleaning / cleansing

Finding and correcting errors/inconsistencies in a dataset.

4
New cards

Data cleaning software

Tools that automate some cleaning based on predefined rules; still may need human review.

5
New cards

Manual review

Human inspection to decide whether questionable data should be removed or kept.

6
New cards

Irrelevant feedback entries

Data not useful for the current analysis; should be reviewed before removal.

7
New cards

Missing data

Expected values not present; may require removal or imputation.

8
New cards

Imputation

Filling missing values with estimates such as mean or median.

9
New cards

Duplicate rows

Repeated data entries; can bias analysis.

10
New cards

Static columns

Columns with constant values; often removed because they add no predictive information.

11
New cards

Low variance columns

Columns with little variation; may have little modeling value.

12
New cards

Standardization via Z-score

Scales data so each feature has mean 0 and standard deviation 1.

13
New cards

Min-max scaling

Scales values into a fixed range, commonly 0 to 1.

14
New cards

Feature engineering

Selecting, modifying, or creating features to improve model performance.

15
New cards

Feature transformation

Changing a feature’s values to make them more useful for modeling.

16
New cards

Skewness

Asymmetry in a distribution; one tail is longer or heavier.

17
New cards

Long tail

A distribution where a small number of observations stretch far from most values.

18
New cards

Outlier

A value very different from most others; can strongly affect models such as linear regression.

19
New cards

Linear regression

A regression model that can be sensitive to outliers because they affect the fitted line/parameters.

20
New cards

Parameter

A learned model value, such as a regression coefficient.

21
New cards

Biased prediction

A prediction systematically pushed in a wrong/skewed direction.

22
New cards

Robust model

A model less affected by outliers/noise.

23
New cards

Power transformation

A mathematical transformation that raises values to a power to improve distribution shape.

24
New cards

Box-Cox power transformation

Transformation used to stabilize variance and make data more normal.

25
New cards

Exponent

The power used in a transformation; exponent 1/2 means square root.

26
New cards

Square root transformation

Transformation using exponent 1/2; reduces right-skew and long tails in many positive-valued features.

27
New cards

Reciprocal transformation

Uses 1/x or exponent -1, not exponent 1/2.

28
New cards

Logarithm transformation

Uses logarithms; another common skewness treatment but not the same as square root.

29
New cards

Normal distribution

Symmetric bell-shaped distribution; many models work better when features are closer to normal.

30
New cards

Histogram

Chart showing counts/frequency of values within bins.

31
New cards

Bins

Intervals used to group values in a histogram or binning strategy.

32
New cards

Matplotlib

Python library for creating graphs and charts.

33
New cards

Pyplot

Matplotlib module used to create static, interactive, and animated visualizations.

34
New cards

NumPy

Python library for arrays, linear algebra, and numerical operations.

35
New cards

Append

Adding additional data columns/rows from another dataset; not a plotting module.