Lecture 5: Data cleaning

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/18

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

19 Terms

New cards

Q: What percentage of a data scientist’s time is spent cleaning data?

About 80%.

New cards

Q: What phrase captures the importance of clean data?

“Garbage in, garbage out.”

New cards

Q: What are key steps in data cleaning?

Remove duplicates, fix typos/structural errors, handle outliers carefully, and manage missing data.

New cards

Q: What is an outlier?

A data point that is significantly different from the rest.

New cards

Q: Name three methods for detecting outliers.

Statistical methods (e.g., Grubbs’ test), distance-based methods, and density-based methods.

New cards

Q: What is trimming in outlier treatment?

Removing extreme values from the dataset.

New cards

Q: What is winsorization?

Replacing extreme values with values closer to the bulk of the data.

New cards

Q: What is Mahalanobis distance used for?

Detecting outliers in multivariate data.

New cards

Q: What are naive methods of imputation?

Mean imputation, median imputation, linear interpolation.

New cards

Q: Why can naive imputation methods be dangerous?

They can distort variances and covariances, causing biased results.

New cards

Q: What is low-rank matrix completion?

Filling missing values based on the assumption that data lies in a low-dimensional subspace (e.g., Netflix problem).

New cards

Q: What is “tidy data”?

Data where each variable is a column, each observation is a row, and each value is a cell.

New cards

Q: What R functions are used to reshape data?

pivotlonger() (gather) and pivotwider() (spread).

New cards

Q: Name two R packages useful for missing data imputation.

missForest and imputeTS.

New cards

Q: What are depth-based approaches to outlier detection?

Methods that identify outliers based on their position on the outer layers (convex hulls) of the data space.

New cards

Q: Why are angles more reliable than distances in high-dimensional spaces?

Because distances become meaningless, but angles stay stable (basis for methods like ABOD).

New cards

Q: What is multiple imputation?

Generating multiple different versions of imputed datasets to better capture uncertainty.

New cards

Q: When should you NOT simply remove outliers?

When they might carry important information for the model.

New cards

Q: What does the Local Outlier Factor (LOF) method do?

Detects outliers by comparing the density around a point to the density around its neighbors