T

Midterm Project Notes: Data Cleaning and Analysis Concepts (Transcript Summary)

Steps for Part 1 Midterm Project

  • Write a brief statement of understanding the problem: provide your own understanding of the issue and deliverables.

  • Write a brief problem statement: clear, concise, and measurable, stating the objectives you’re trying to accomplish.

  • Data assessment: instead of submitting an Excel sheet, take a screenshot of the Excel document’s description/variables and insert it as an image into your Word document.

  • Deliverable scope: Part 1 covers only up to the problem statement and the data description screenshot; you will use the project dataset (not the lab dataset).

Repository and Dataset Details

  • Repository contains four files:

    • midterm sample report

    • midterm template

    • instructions

    • project dataset

  • Do not use the lab dataset from the last class; use the project dataset instead.

  • Dataset setup is almost identical to past datasets, but with a new set of drills.

  • In the last class, variables included house ID, burn, fireplaces, and other places (example terms from transcript).

  • This time, you have only two variables that are the same as before: House ID and Salesforce (dependent variable remains the same). Your observation unit is still houses.

  • The four differing variables are new and require work; the other two variables can be copied from prior work.

  • A small snapshot of the dataset is shown to illustrate what you’ll be using; this data will be used not only for Part 1 but for upcoming project parts as well.

  • The larger project is split into four parts; data cleaning is a major focus and will take two weeks.

  • Data cleaning is emphasized as taking the most time for a data scientist; start as soon as you’re done with this problem.

Project Report Setup and Part 1 Deliverables

  • A sample report is shown, based on a different project (GP analysis) to illustrate structure, not housing analysis.

  • Structure for your project report:

    • A small statement of understanding (your interpretation of the issue).

    • A problem statement referencing class principles with clear, measurable objectives.

  • Do not include an Excel sheet; insert a screenshot of the variable description into Word.

  • For Part 1, you should complete only up to the point of the problem statement and the screenshot entry.

  • Use the Project Dataset and the specified variables (these replace the lab dataset).

  • If you left missing fields from previous tasks, fill them in now; do not submit the same incomplete items.

  • Deadlines mentioned: Lab assignments due earlier; Project report due Sunday; module three classwork/homework also due Sunday.

  • Emphasis on starting data cleaning early and treating data preparation as an ongoing, iterative process.

Data Preparation and Cleaning Overview

  • Data preparation comprises several stages:

    • Data collection: how data is gathered and brought into systems.

    • Data management: organizing data within data systems; reference to database management systems (DBMS).

    • Cleaning: correcting errors, misspellings, duplicates; eliminating duplicates when needed.

    • Transforming: merging datasets to derive useful insights.

  • Data cleaning is framed as essential due to "garbage in, garbage out"; bad inputs yield bad analyses.

  • Excel can support cleaning via functions like sort and filter, but detailed cleaning techniques are introduced at a high level here.

  • The process is iterative and interrelated among steps; you may loop back to refine earlier steps as you proceed.

Data Preparation Engine Core Concepts

  • Core components (brief overview):

    • Cleaning the data (fixing errors, typos, duplicates)

    • Transforming data (merging tables, deriving new features)

    • Managing data quality to ensure reliable analyses

  • Practical note: you will encounter many steps in sequence, with feedback loops between steps.

Summary Statistics and Distribution Concepts

  • Types of variables (two broad categories):

    • Quantitative (numerical)

    • Categorical (qualitative)

  • Summary statistics for quantitative data:

    • Mean (average):


    • \mu = \frac{1}{n} \sum{i=1}^{n} xi

    • Median (middle value after sorting):

    • If n is odd, the middle value is the median. If n is even, the median is the average of the two middle values.

    • Mode (most frequent value):

    • The value(s) that occur most frequently; can be unimodal, bimodal, multimodal, or have no mode.

  • Why use median over mean:

    • When distribution is skewed, the mean is pulled toward the tail; the median better represents central tendency in skewed data.

  • Skewness and distributions:

    • Left-skewed (negatively skewed): tail extends to the left; mean < median.

    • Right-skewed (positively skewed): tail extends to the right; mean > median.

  • Histograms and distributions:

    • A histogram shows frequency counts across bins; helps diagnose skewness and the distribution shape.

  • Visual tools to understand data shape:

    • Box plots help identify variability and potential outliers; the central box represents the interquartile range around the median.

  • When to rely on which measure:

    • For quantitative variables, the median is generally robust, especially in skewed distributions.

  • Modes and distribution shape:

    • A dataset can have one mode (unimodal) or two modes (bimodal); more modes indicate multiple subgroups.

    • If almost all values are similar across categories, a mode is informative; otherwise, the concept of a single central tendency may be less informative for categorical variables.

Missing Values: Strategies and Mechanisms

  • What is missing data?

    • Planned fields, typos, nonresponse, or data not collected; missingness can arise from data entry or survey nonresponse.

  • Impact of missing values:

    • Simply deleting rows with missing values can bias results; missingness can introduce distortion if not handled properly.

  • Deletion strategies:

    • Deleting rows with missing target variable is sometimes acceptable; deleting rows with missing independent variables is generally not advised unless necessary.

    • If a variable has excessive missingness (e.g., >30%), consider dropping that variable.

  • Imputation (filling in missing values) using internal data (not external data):

    • Mean imputation: replace missing values with the mean of non-missing values; suitable for symmetric distributions.

    • Median imputation: replace missing values with the median; preferred for skewed distributions.

    • Mode imputation: for categorical variables, replace missing values with the most frequent category.

    • Example rationale: Agent Grad variable with missing values: compute mean; however, if the distribution is skewed, use the median instead to avoid bias.

    • Illustrative example from transcript: If the median of a variable is 23, fill missing values with 23; a mean would be biased by a few extreme values (e.g., 34, 54).

  • Categorical vs quantitative imputation:

    • For categorical variables, use the mode; for quantitative variables, prefer the median (especially for skewed data).

  • Missingness mechanisms (high-level):

    • Missing at Random (MAR): missingness related to observed data but not the missing value itself.

    • Missing Completely at Random (MCAR): missingness unrelated to any data.

    • Missing Not at Random (MNAR): missingness related to the missing values themselves (not explicitly detailed in the transcript, but commonly discussed in practice).

  • Practical imputation guidance (in-class scope):

    • Use simple imputation methods (mean/median/mode) initially; more advanced methods exist (multivariate, ML-based), but the class focuses on basic approaches.

  • Handling missingness in different variable types:

    • For a categorical variable with many missing values, consider adding a new category called "Missing" to preserve information about nonresponse.

    • For quantitative variables with substantial missingness, prefer deletion only if the variable is not critical, otherwise impute using a robust statistic (median preferred for skewed distributions).

  • Diagnostic examples:

    • A variable like Psych may have the highest nonmissing count and thus be a good candidate for imputing missing values with the mode.

    • A case where a variable has high missingness and should be dropped if missingness is excessive.

  • Additional strategies (not used in this class):

    • Bivariate and multivariate imputation; external data imputation; machine learning-based imputation approaches.

Outliers: Detection and Handling

  • Outliers are extremely high or low data points that lie far from the bulk of the data; they can be data entry errors or true extreme values.

  • Common approaches to handle outliers:

    • Deletion: removing outliers is generally not recommended unless absolutely necessary and justified; it can distort the data if not done carefully.

    • Winsorization (Winsorizing): cap extreme values at a threshold to reduce their impact; values beyond the threshold are set to the threshold value.

    • Box plot-based detection: use the interquartile range (IQR) to identify potential outliers visually and statistically.

  • Interquartile range (IQR) concept:

    • IQR is the distance between the first and third quartiles:

    • IQR = Q3 - Q1.

    • Outlier boundaries (using 1.5*IQR rule):

    • ext{Lower bound} = Q1 - 1.5\ times\ IQR, \ ext{Upper bound} = Q3 + 1.5\ times\ IQR.

    • Any data points outside these bounds are considered potential outliers.

  • Z-score method note:

    • Mentioned as a method for detecting outliers in certain distributions, but not covered in depth in this class.

  • Practical example from transcript:

    • A value like 125 could be flagged as an outlier depending on the distribution and bounds.

    • For an obviously erroneous value (e.g., sleeping 26 hours a day), median-based imputation can be used to replace such suspected errors rather than simple deletion.

  • Box plots and IQR as diagnostic tools:

    • Box plot highlights the central tendency, spread, and potential outliers at a glance.

  • Summary rule of thumb:

    • Outliers should not be removed by default; use transparent criteria (e.g., defined by IQR bounds or domain knowledge) and consider robust methods (e.g., Winsorization) to mitigate their influence when appropriate.

Imputation and Data Cleaning Tactics (Practical Notes)

  • Univariate imputation uses information from a single variable; multivariate approaches use relationships between variables (not the focus here).

  • For categorical variables, if missingness is substantial, create a "Missing" category to preserve information about nonresponse.

  • For quantitative variables, prefer the median over the mean when distributions are skewed; use mean imputation mainly for symmetric distributions.

  • The process is iterative and may require revisiting variable distributions, missingness patterns, and outlier handling as you clean the data.

  • Emphasis on internal data: imputation should be based on the data you have, not external sources, for this course.

Data Visualization and Diagnosis (Supplemental)

  • Histograms help diagnose distribution shape and identify skewness.

  • Box plots help identify spread and potential outliers visually.

  • Frequency tables (count tables) support understanding of categorical variable distributions and can reveal coding or data-entry errors.

Practical and Ethical Implications

  • Clean data lead to reliable insights; poor data quality propagates errors and biased conclusions.

  • Iterative cleansing requires transparency about methods and decisions, especially when imputing or trimming data.

  • Trust and provenance: when data come from external sources, extra caution is needed to validate reliability before imputation or analysis.

Key Formulas and Concepts

  • Mean (average):

    • \mu = \frac{1}{n} \sum{i=1}^{n} xi

  • Median (central tendency):

    • Arrange data in order; if n is odd, median is the value in position $(n+1)/2$; if n is even, median is the average of the two middle values:

    • For even n: \text{median} = \frac{x{(n/2)} + x{(n/2+1)}}{2}

  • Mode (most frequent value):

    • The value(s) with the highest frequency in the dataset; can be unimodal, bimodal, multimodal, or have no mode.

  • Interquartile Range (IQR):

    • IQR = Q3 - Q1

  • Outlier bounds (1.5 IQR rule):

    • ext{Lower bound} = Q1 - 1.5 \cdot IQR, \ ext{Upper bound} = Q3 + 1.5 \cdot IQR

  • Box plot interpretation: the central box spans from $Q1$ to $Q3$ with the median at $Q_2$; whiskers extend to the most extreme values within the plausible range; points outside are potential outliers.


If you’d like, I can tailor these notes to a specific section or expand any example with more step-by-step calculations. The above covers the major and minor points discussed in the transcript, including definitions, procedures, and practical guidelines for data cleaning, missing values, outliers, and basic imputation.