1/18
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Q: What percentage of a data scientist’s time is spent cleaning data?
About 80%.
Q: What phrase captures the importance of clean data?
“Garbage in, garbage out.”
Q: What are key steps in data cleaning?
Remove duplicates, fix typos/structural errors, handle outliers carefully, and manage missing data.
Q: What is an outlier?
A data point that is significantly different from the rest.
Q: Name three methods for detecting outliers.
Statistical methods (e.g., Grubbs’ test), distance-based methods, and density-based methods.
Q: What is trimming in outlier treatment?
Removing extreme values from the dataset.
Q: What is winsorization?
Replacing extreme values with values closer to the bulk of the data.
Q: What is Mahalanobis distance used for?
Detecting outliers in multivariate data.
Q: What are naive methods of imputation?
Mean imputation, median imputation, linear interpolation.
Q: Why can naive imputation methods be dangerous?
They can distort variances and covariances, causing biased results.
Q: What is low-rank matrix completion?
Filling missing values based on the assumption that data lies in a low-dimensional subspace (e.g., Netflix problem).
Q: What is “tidy data”?
Data where each variable is a column, each observation is a row, and each value is a cell.
Q: What R functions are used to reshape data?
pivotlonger() (gather) and pivotwider() (spread).
Q: Name two R packages useful for missing data imputation.
missForest and imputeTS.
Q: What are depth-based approaches to outlier detection?
Methods that identify outliers based on their position on the outer layers (convex hulls) of the data space.
Q: Why are angles more reliable than distances in high-dimensional spaces?
Because distances become meaningless, but angles stay stable (basis for methods like ABOD).
Q: What is multiple imputation?
Generating multiple different versions of imputed datasets to better capture uncertainty.
Q: When should you NOT simply remove outliers?
When they might carry important information for the model.
Q: What does the Local Outlier Factor (LOF) method do?
Detects outliers by comparing the density around a point to the density around its neighbors