Lecture 5: Data cleaning

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/18

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

19 Terms

1
New cards

Q: What percentage of a data scientist’s time is spent cleaning data?

About 80%.

2
New cards

Q: What phrase captures the importance of clean data?

“Garbage in, garbage out.”

3
New cards

Q: What are key steps in data cleaning?

Remove duplicates, fix typos/structural errors, handle outliers carefully, and manage missing data.

4
New cards

Q: What is an outlier?

A data point that is significantly different from the rest.

5
New cards

Q: Name three methods for detecting outliers.

Statistical methods (e.g., Grubbs’ test), distance-based methods, and density-based methods.

6
New cards

Q: What is trimming in outlier treatment?

Removing extreme values from the dataset.

7
New cards

Q: What is winsorization?

Replacing extreme values with values closer to the bulk of the data.

8
New cards

Q: What is Mahalanobis distance used for?

Detecting outliers in multivariate data.

9
New cards

Q: What are naive methods of imputation?

Mean imputation, median imputation, linear interpolation.

10
New cards

Q: Why can naive imputation methods be dangerous?

They can distort variances and covariances, causing biased results.

11
New cards

Q: What is low-rank matrix completion?

Filling missing values based on the assumption that data lies in a low-dimensional subspace (e.g., Netflix problem).

12
New cards

Q: What is “tidy data”?

Data where each variable is a column, each observation is a row, and each value is a cell.

13
New cards

Q: What R functions are used to reshape data?

pivotlonger() (gather) and pivotwider() (spread).

14
New cards

Q: Name two R packages useful for missing data imputation.

missForest and imputeTS.

15
New cards

Q: What are depth-based approaches to outlier detection?

Methods that identify outliers based on their position on the outer layers (convex hulls) of the data space.

16
New cards

Q: Why are angles more reliable than distances in high-dimensional spaces?

Because distances become meaningless, but angles stay stable (basis for methods like ABOD).

17
New cards

Q: What is multiple imputation?

Generating multiple different versions of imputed datasets to better capture uncertainty.

18
New cards

Q: When should you NOT simply remove outliers?

When they might carry important information for the model.

19
New cards

Q: What does the Local Outlier Factor (LOF) method do?

Detects outliers by comparing the density around a point to the density around its neighbors