1/116
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Data Understanding
The process of evaluating data using descriptive statistics and visualization to identify invalid values, missing values, unexpected distributions, and outliers
Descriptive Statistics
Methods to organize, describe, and summarize data; includes frequency, min/max, central tendency (mean, median, mode), and dispersion (range, variance, standard deviation)
PROC FREQ
SAS procedure that produces a one-way frequency table for each variable in the TABLES statement; output includes frequency, percent, and cumulative statistics
BY Statement
Used in PROC FREQ to request separate analyses for each group; data must first be sorted or indexed by the BY variable
Crosstabulation Table
A two-way frequency table generated by placing an asterisk between two variables (e.g., Sex*Country); shows frequency, percent, row pct, and col pct for each combination
Row Pct
The percentage of the row total represented by a given cell; denominator is the row total
Col Pct
The percentage of the column total represented by a given cell; denominator is the column total
ORDER=FREQ Option
PROC FREQ option that displays results in descending frequency order; useful for identifying duplicates or dominant values
Duplicate Check with PROC FREQ
Use ORDER=FREQ in PROC FREQ on an ID variable; any value with frequency > 1 indicates a duplicate record
PROC PRINT for Invalid Data
Used after PROC FREQ to display specific observations with invalid, missing, or out-of-range values using a WHERE clause
IF-THEN Statement
SAS statement that executes a command for observations meeting a specific condition; syntax: IF expression THEN statement
Data Fixing Principle
Before correcting any data problem, you must first have a reason — always consult the data expert or source system owner
Data Preparation
The most time-consuming part of analytics (60–90% of a data scientist's time); the most important step and primary source of model error
Variable Cleaning
Addressing data quality problems (incorrect values, outliers, missing values) through solutions such as removing, transforming, binning, or leaving values as-is
MCAR
Missing Completely at Random — missing values with no pattern; example: survey responses lost in transit
MAR
Missing at Random — missingness is related to another observed variable; example: younger people not disclosing health status
MNAR
Missing Not at Random — missingness is related to the unobserved value itself; the true value can often be inferred; example: felony records not collected for certain groups
Listwise Deletion
Handling missing values by deleting the entire observation; simple but risks biasing data if missingness is not completely random
Imputation
Handling missing values by replacing them with an estimated value such as a constant, mean, median, or regression-calculated value
Feature Engineering
Creating new variables from existing data to improve modeling; methods include variable creation, dummy coding, binning, and scaling
Dummy Coding (One-Hot Encoding)
Converting a categorical variable with k categories into k-1 binary (0/1) variables to avoid implying ordinal rank
Binning (Bucketing)
Grouping continuous values into discrete ranges when exact values are less important than the category; can produce ordinal or dummy-coded bin variables
Normalization (Min-Max Scaling)
Rescaling values to 0–1 using (x - min) / (max - min); best for unsupervised learning; preserves original distribution shape
Standardization (Z-Score Scaling)
Rescaling values to mean = 0, SD = 1 using (x - mean) / std dev; best when data is near bell-shaped or contains extreme outliers
Normalization vs. Standardization
Normalization reduces scale gap while preserving distribution shape; standardization centers at 0 so no variable has disproportionate pull on the model
Overfitting
When a model fits training data too precisely and performs poorly on new data; caused by too many features or insufficient data
Curse of Dimensionality
When too many features are created, exponentially more data is needed to maintain model accuracy
PROC MEANS
SAS procedure producing summary statistics (N, mean, std dev, min, max) for numeric variables; VAR identifies variables, CLASS defines subgroups
VAR Statement (PROC MEANS)
Identifies which numeric variables to analyze; without it, PROC MEANS analyzes all numeric variables
CLASS Statement (PROC MEANS)
Identifies grouping variables for subgroup analysis; data does not need to be pre-sorted
N Obs vs. N (PROC MEANS)
N Obs = count of observations per class combination; N = count of non-missing values for the analysis variable
Median
The middle value when observations are sorted; requested in PROC MEANS with MEDIAN keyword; 50% of values fall above and below
Q1 and Q3
Q1 = value at or below which 25% of observations fall; Q3 = value at or below which 75% fall
IQR (Qrange)
Interquartile range = Q3 minus Q1; measures the spread of the middle 50% of data, reducing outlier influence
Normal Distribution
Bell-shaped, symmetric distribution; 68% of values within 1 SD, 95% within 2 SD, 99% within 3 SD
Skewness
Measures distribution asymmetry; left-skewed = longer left tail (negative); right-skewed = longer right tail (positive); normal ≈ 0
Kurtosis
Measures concentration toward tails vs. center; positive = more peaked (leptokurtic); negative = flatter (platykurtic); normal = 0
Determining Normality
Assess by comparing mean and median, and checking skewness/kurtosis (concern if > 1 or < -1); use PROC MEANS or PROC UNIVARIATE
PROC UNIVARIATE
SAS procedure providing comprehensive statistics including moments, quantiles, extreme observations, missing values, and normality tests
Extreme Observations (PROC UNIVARIATE)
Shows the five lowest and five highest values with their observation numbers; useful for identifying out-of-range values
ID Statement (PROC UNIVARIATE)
Displays a specified identifying variable (e.g., Employee_ID) alongside observation numbers in the Extreme Observations section
Test for Normality (PROC UNIVARIATE)
Goodness-of-fit tests (Kolmogorov-Smirnov, Cramer-von Mises, Anderson-Darling); p-value GREATER THAN 0.05 supports normality — opposite of most hypothesis tests
ProbPlot (PROC UNIVARIATE)
A probability plot comparing observed quantiles to theoretical normal quantiles; approximately normal if points fall near the diagonal line
Histogram
Groups a continuous variable into bins and displays frequency of each bin; used to visualize distribution shape
Box Plot
Shows quartile statistics; box = IQR (middle 50%), diamond = mean, whiskers extend 1.5×IQR; points beyond whiskers are outliers
PROC SGPLOT
SAS procedure for statistical graphics; key statements: VBOX/HBOX (box plots), VBAR/HBAR (bar charts), HISTOGRAM, SCATTER, REFLINE
Correlation
A numerical relationship between two variables; does not imply causation
Pearson Correlation Coefficient
Measures the strength of the linear relationship between two continuous variables; ranges from -1 to +1; p-value indicates generalizability to population
Correlation Strength Guide
0.00–0.30 = negligible; 0.30–0.50 = low; 0.50–0.70 = moderate; 0.70–0.90 = high; 0.90–1.00 = very high
Multicollinearity
High correlation between independent variables; inflates standard errors and distorts coefficients; a problem in predictive modeling
Confounding
When a variable is highly correlated with the DV and distorts model results
PROC CORR
SAS procedure for correlation analysis; RANK option orders by highest correlation; PLOTS=MATRIX creates a scatter plot matrix
Principal Component Analysis (PCA)
Creates new components from existing features to reduce dimensionality and address multicollinearity
Eigenvalue
Indicates how much variance each principal component explains; Kaiser Rule: keep components with eigenvalue > 1
Eigenvector
Shows the correlations between new PCA components and the original features
Scree Plot
Chart of eigenvalues by component; the "elbow" indicates where additional components stop adding meaningful explanatory power
PROC PRINCOMP
SAS procedure used to run Principal Component Analysis
Simpson's Paradox
A trend seen in a single variable that reverses when variables are combined; often fixed by segmenting on a key variable
Anscombe's Quartet
Four datasets with nearly identical summary statistics but very different distributions; illustrates the importance of visualizing data
Data Partitioning
Splitting data into training, validation, and test sets to reduce overfitting and improve generalizability
Training Set
The largest data partition (~70%); used to build and develop the model
Validation Set
A smaller partition used to monitor and tune the model during development; large differences from training results may signal overfitting
Test Set
Held out until the end; provides a final unbiased estimate of model performance
Linear Regression
A GLM used to model relationships between continuous variables; Y = β0 + β1 * X + ε (response = intercept + slope × predictor + error)
Assumption 1 — Normality
Errors are normally distributed
Assumption 2 — Homoscedasticity
Errors have equal variances across all predicted values
Assumption 3 — Independence
Errors are independent of each other
Heteroscedasticity
Unequal variance of errors; a violation of the linear regression equal variance assumption
PROC REG
SAS procedure for linear regression; syntax: MODEL dependent = independent(s); use QUIT to stop iterative processing
F Value
Tests the overall significance of the regression model; want Pr > F < 0.05
Root MSE
Average difference between predicted and actual values; closer to 0 is better
R-squared (R²)
Proportion of variance in the DV explained by the model; closer to 1 is better
Adjusted R²
R² penalized for number of predictors; used in multiple regression to compare models fairly
Intercept (β0)
The predicted value of Y when all predictors equal zero
Slope Coefficient (β1)
The amount Y changes for every one-unit increase in predictor X
Pr > |t|
P-value for each individual predictor in regression; want < 0.05 to consider the predictor significant
Residuals vs. Predicted Values Plot
Diagnostic plot to verify equal variance and independence assumptions; should show no pattern
Residuals vs. Quantile Plot
Diagnostic plot to verify normality of errors; residuals should follow the diagonal line
Parabolic Pattern in Residual Plot
Indicates a violation of the independence or equal variance assumption; suggests model misspecification
VIF (Variance Inflation Factor)
Measures collinearity among predictors; VIF > 10 indicates a predictor is redundant; added with /vif in PROC REG
Rule for Adjusting VIF Issues
Always remove or adjust one variable at a time, then rerun the model before making further changes
Stepwise Selection
Variable selection that starts with nothing and allows both addition and removal of variables; use /selection=stepwise in PROC REG
Parsimony
Preferring simpler models with fewer predictors when performance is comparable; both a business and statistical justification
Model Justification (Statistical)
Significant p-values, high R², low MSE
Model Justification (Business)
Parsimony (simplicity), explainability, and cost of implementation
Transformations — Purpose
Applied when regression assumptions are not met; transforms skewed data to better approximate normality
Transformations — Square Root
Use sqrt(x) for moderate positive skew; sqrt(max(x+1)-x) for moderate negative skew
Transformations — Log
Use log10(x) for greater positive skew; cannot be applied when values include 0
Transformations — Inverse
Use 1/x for severe positive skew; the most aggressive transformation option
Outlier Removal
Delete records with extreme values (e.g., beyond 2 SDs from the mean) to reduce skew and improve model fit
Logistic Regression
A linear classification method with a binary DV that predicts the probability of an outcome between 0 and 1 using a logit transformation
Sigmoid Function
f(x) = 1 / (1 + e^-x); the logistic function that outputs values between 0 and 1
Assumptions of Logistic Regression
Binary DV, linear relationship after transformation, independence, no multicollinearity; does NOT require homoscedasticity or normality of errors
Pearson Chi-Square Test
Tests whether observed frequencies differ from expected; significant if p < 0.05, indicating an association between variables
Odds Ratio
Ratio of outcome probabilities P(1)/P(0); not a simple likelihood; must be positive; > 1 means the numerator group is more likely to experience the event
Odds Ratio (Continuous Predictor)
For each 1-unit increase in the predictor, the odds of the target event are multiplied by the odds ratio value
PROC Logistic
SAS procedure for logistic regression; CLASS specifies categorical variables; MODEL defines response and predictors; event='1' sets the outcome of interest
Forward Selection
Starts with no predictors; adds one significant variable at a time
Backward Selection
Starts with all predictors; removes one insignificant variable at a time
AIC Statistic
Model fit measure for explanatory models; lower values = better fit; used for relative comparison only