Midterm Project Notes: Data Cleaning and Analysis Concepts (Transcript Summary)
Steps for Part 1 Midterm Project
Write a brief statement of understanding the problem: provide your own understanding of the issue and deliverables.
Write a brief problem statement: clear, concise, and measurable, stating the objectives you’re trying to accomplish.
Data assessment: instead of submitting an Excel sheet, take a screenshot of the Excel document’s description/variables and insert it as an image into your Word document.
Deliverable scope: Part 1 covers only up to the problem statement and the data description screenshot; you will use the project dataset (not the lab dataset).
Repository and Dataset Details
Repository contains four files:
midterm sample report
midterm template
instructions
project dataset
Do not use the lab dataset from the last class; use the project dataset instead.
Dataset setup is almost identical to past datasets, but with a new set of drills.
In the last class, variables included house ID, burn, fireplaces, and other places (example terms from transcript).
This time, you have only two variables that are the same as before: House ID and Salesforce (dependent variable remains the same). Your observation unit is still houses.
The four differing variables are new and require work; the other two variables can be copied from prior work.
A small snapshot of the dataset is shown to illustrate what you’ll be using; this data will be used not only for Part 1 but for upcoming project parts as well.
The larger project is split into four parts; data cleaning is a major focus and will take two weeks.
Data cleaning is emphasized as taking the most time for a data scientist; start as soon as you’re done with this problem.
Project Report Setup and Part 1 Deliverables
A sample report is shown, based on a different project (GP analysis) to illustrate structure, not housing analysis.
Structure for your project report:
A small statement of understanding (your interpretation of the issue).
A problem statement referencing class principles with clear, measurable objectives.
Do not include an Excel sheet; insert a screenshot of the variable description into Word.
For Part 1, you should complete only up to the point of the problem statement and the screenshot entry.
Use the Project Dataset and the specified variables (these replace the lab dataset).
If you left missing fields from previous tasks, fill them in now; do not submit the same incomplete items.
Deadlines mentioned: Lab assignments due earlier; Project report due Sunday; module three classwork/homework also due Sunday.
Emphasis on starting data cleaning early and treating data preparation as an ongoing, iterative process.
Data Preparation and Cleaning Overview
Data preparation comprises several stages:
Data collection: how data is gathered and brought into systems.
Data management: organizing data within data systems; reference to database management systems (DBMS).
Cleaning: correcting errors, misspellings, duplicates; eliminating duplicates when needed.
Transforming: merging datasets to derive useful insights.
Data cleaning is framed as essential due to "garbage in, garbage out"; bad inputs yield bad analyses.
Excel can support cleaning via functions like sort and filter, but detailed cleaning techniques are introduced at a high level here.
The process is iterative and interrelated among steps; you may loop back to refine earlier steps as you proceed.
Data Preparation Engine Core Concepts
Core components (brief overview):
Cleaning the data (fixing errors, typos, duplicates)
Transforming data (merging tables, deriving new features)
Managing data quality to ensure reliable analyses
Practical note: you will encounter many steps in sequence, with feedback loops between steps.
Summary Statistics and Distribution Concepts
Types of variables (two broad categories):
Quantitative (numerical)
Categorical (qualitative)
Summary statistics for quantitative data:
Mean (average):
\mu = \frac{1}{n} \sum{i=1}^{n} xiMedian (middle value after sorting):
If n is odd, the middle value is the median. If n is even, the median is the average of the two middle values.
Mode (most frequent value):
The value(s) that occur most frequently; can be unimodal, bimodal, multimodal, or have no mode.
Why use median over mean:
When distribution is skewed, the mean is pulled toward the tail; the median better represents central tendency in skewed data.
Skewness and distributions:
Left-skewed (negatively skewed): tail extends to the left; mean < median.
Right-skewed (positively skewed): tail extends to the right; mean > median.
Histograms and distributions:
A histogram shows frequency counts across bins; helps diagnose skewness and the distribution shape.
Visual tools to understand data shape:
Box plots help identify variability and potential outliers; the central box represents the interquartile range around the median.
When to rely on which measure:
For quantitative variables, the median is generally robust, especially in skewed distributions.
Modes and distribution shape:
A dataset can have one mode (unimodal) or two modes (bimodal); more modes indicate multiple subgroups.
If almost all values are similar across categories, a mode is informative; otherwise, the concept of a single central tendency may be less informative for categorical variables.
Missing Values: Strategies and Mechanisms
What is missing data?
Planned fields, typos, nonresponse, or data not collected; missingness can arise from data entry or survey nonresponse.
Impact of missing values:
Simply deleting rows with missing values can bias results; missingness can introduce distortion if not handled properly.
Deletion strategies:
Deleting rows with missing target variable is sometimes acceptable; deleting rows with missing independent variables is generally not advised unless necessary.
If a variable has excessive missingness (e.g., >30%), consider dropping that variable.
Imputation (filling in missing values) using internal data (not external data):
Mean imputation: replace missing values with the mean of non-missing values; suitable for symmetric distributions.
Median imputation: replace missing values with the median; preferred for skewed distributions.
Mode imputation: for categorical variables, replace missing values with the most frequent category.
Example rationale: Agent Grad variable with missing values: compute mean; however, if the distribution is skewed, use the median instead to avoid bias.
Illustrative example from transcript: If the median of a variable is 23, fill missing values with 23; a mean would be biased by a few extreme values (e.g., 34, 54).
Categorical vs quantitative imputation:
For categorical variables, use the mode; for quantitative variables, prefer the median (especially for skewed data).
Missingness mechanisms (high-level):
Missing at Random (MAR): missingness related to observed data but not the missing value itself.
Missing Completely at Random (MCAR): missingness unrelated to any data.
Missing Not at Random (MNAR): missingness related to the missing values themselves (not explicitly detailed in the transcript, but commonly discussed in practice).
Practical imputation guidance (in-class scope):
Use simple imputation methods (mean/median/mode) initially; more advanced methods exist (multivariate, ML-based), but the class focuses on basic approaches.
Handling missingness in different variable types:
For a categorical variable with many missing values, consider adding a new category called "Missing" to preserve information about nonresponse.
For quantitative variables with substantial missingness, prefer deletion only if the variable is not critical, otherwise impute using a robust statistic (median preferred for skewed distributions).
Diagnostic examples:
A variable like Psych may have the highest nonmissing count and thus be a good candidate for imputing missing values with the mode.
A case where a variable has high missingness and should be dropped if missingness is excessive.
Additional strategies (not used in this class):
Bivariate and multivariate imputation; external data imputation; machine learning-based imputation approaches.
Outliers: Detection and Handling
Outliers are extremely high or low data points that lie far from the bulk of the data; they can be data entry errors or true extreme values.
Common approaches to handle outliers:
Deletion: removing outliers is generally not recommended unless absolutely necessary and justified; it can distort the data if not done carefully.
Winsorization (Winsorizing): cap extreme values at a threshold to reduce their impact; values beyond the threshold are set to the threshold value.
Box plot-based detection: use the interquartile range (IQR) to identify potential outliers visually and statistically.
Interquartile range (IQR) concept:
IQR is the distance between the first and third quartiles:
IQR = Q3 - Q1.
Outlier boundaries (using 1.5*IQR rule):
ext{Lower bound} = Q1 - 1.5\ times\ IQR, \ ext{Upper bound} = Q3 + 1.5\ times\ IQR.
Any data points outside these bounds are considered potential outliers.
Z-score method note:
Mentioned as a method for detecting outliers in certain distributions, but not covered in depth in this class.
Practical example from transcript:
A value like 125 could be flagged as an outlier depending on the distribution and bounds.
For an obviously erroneous value (e.g., sleeping 26 hours a day), median-based imputation can be used to replace such suspected errors rather than simple deletion.
Box plots and IQR as diagnostic tools:
Box plot highlights the central tendency, spread, and potential outliers at a glance.
Summary rule of thumb:
Outliers should not be removed by default; use transparent criteria (e.g., defined by IQR bounds or domain knowledge) and consider robust methods (e.g., Winsorization) to mitigate their influence when appropriate.
Imputation and Data Cleaning Tactics (Practical Notes)
Univariate imputation uses information from a single variable; multivariate approaches use relationships between variables (not the focus here).
For categorical variables, if missingness is substantial, create a "Missing" category to preserve information about nonresponse.
For quantitative variables, prefer the median over the mean when distributions are skewed; use mean imputation mainly for symmetric distributions.
The process is iterative and may require revisiting variable distributions, missingness patterns, and outlier handling as you clean the data.
Emphasis on internal data: imputation should be based on the data you have, not external sources, for this course.
Data Visualization and Diagnosis (Supplemental)
Histograms help diagnose distribution shape and identify skewness.
Box plots help identify spread and potential outliers visually.
Frequency tables (count tables) support understanding of categorical variable distributions and can reveal coding or data-entry errors.
Practical and Ethical Implications
Clean data lead to reliable insights; poor data quality propagates errors and biased conclusions.
Iterative cleansing requires transparency about methods and decisions, especially when imputing or trimming data.
Trust and provenance: when data come from external sources, extra caution is needed to validate reliability before imputation or analysis.
Key Formulas and Concepts
Mean (average):
\mu = \frac{1}{n} \sum{i=1}^{n} xi
Median (central tendency):
Arrange data in order; if n is odd, median is the value in position $(n+1)/2$; if n is even, median is the average of the two middle values:
For even n: \text{median} = \frac{x{(n/2)} + x{(n/2+1)}}{2}
Mode (most frequent value):
The value(s) with the highest frequency in the dataset; can be unimodal, bimodal, multimodal, or have no mode.
Interquartile Range (IQR):
IQR = Q3 - Q1
Outlier bounds (1.5 IQR rule):
ext{Lower bound} = Q1 - 1.5 \cdot IQR, \ ext{Upper bound} = Q3 + 1.5 \cdot IQR
Box plot interpretation: the central box spans from $Q1$ to $Q3$ with the median at $Q_2$; whiskers extend to the most extreme values within the plausible range; points outside are potential outliers.
If you’d like, I can tailor these notes to a specific section or expand any example with more step-by-step calculations. The above covers the major and minor points discussed in the transcript, including definitions, procedures, and practical guidelines for data cleaning, missing values, outliers, and basic imputation.