BIA 484 Exam 2

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/116

There's no tags or description

Looks like no tags are added yet.

Last updated 12:08 AM on 3/31/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

117 Terms

New cards

Data Understanding

The process of evaluating data using descriptive statistics and visualization to identify invalid values, missing values, unexpected distributions, and outliers

New cards

Descriptive Statistics

Methods to organize, describe, and summarize data; includes frequency, min/max, central tendency (mean, median, mode), and dispersion (range, variance, standard deviation)

New cards

PROC FREQ

SAS procedure that produces a one-way frequency table for each variable in the TABLES statement; output includes frequency, percent, and cumulative statistics

New cards

BY Statement

Used in PROC FREQ to request separate analyses for each group; data must first be sorted or indexed by the BY variable

New cards

Crosstabulation Table

A two-way frequency table generated by placing an asterisk between two variables (e.g., Sex*Country); shows frequency, percent, row pct, and col pct for each combination

New cards

Row Pct

The percentage of the row total represented by a given cell; denominator is the row total

New cards

Col Pct

The percentage of the column total represented by a given cell; denominator is the column total

New cards

ORDER=FREQ Option

PROC FREQ option that displays results in descending frequency order; useful for identifying duplicates or dominant values

New cards

Duplicate Check with PROC FREQ

Use ORDER=FREQ in PROC FREQ on an ID variable; any value with frequency > 1 indicates a duplicate record

New cards

PROC PRINT for Invalid Data

Used after PROC FREQ to display specific observations with invalid, missing, or out-of-range values using a WHERE clause

New cards

IF-THEN Statement

SAS statement that executes a command for observations meeting a specific condition; syntax: IF expression THEN statement

New cards

Data Fixing Principle

Before correcting any data problem, you must first have a reason — always consult the data expert or source system owner

New cards

Data Preparation

The most time-consuming part of analytics (60–90% of a data scientist's time); the most important step and primary source of model error

New cards

Variable Cleaning

Addressing data quality problems (incorrect values, outliers, missing values) through solutions such as removing, transforming, binning, or leaving values as-is

New cards

MCAR

Missing Completely at Random — missing values with no pattern; example: survey responses lost in transit

New cards

MAR

Missing at Random — missingness is related to another observed variable; example: younger people not disclosing health status

New cards

MNAR

Missing Not at Random — missingness is related to the unobserved value itself; the true value can often be inferred; example: felony records not collected for certain groups

New cards

Listwise Deletion

Handling missing values by deleting the entire observation; simple but risks biasing data if missingness is not completely random

New cards

Imputation

Handling missing values by replacing them with an estimated value such as a constant, mean, median, or regression-calculated value

New cards

Feature Engineering

Creating new variables from existing data to improve modeling; methods include variable creation, dummy coding, binning, and scaling

New cards

Dummy Coding (One-Hot Encoding)

Converting a categorical variable with k categories into k-1 binary (0/1) variables to avoid implying ordinal rank

New cards

Binning (Bucketing)

Grouping continuous values into discrete ranges when exact values are less important than the category; can produce ordinal or dummy-coded bin variables

New cards

Normalization (Min-Max Scaling)

Rescaling values to 0–1 using (x - min) / (max - min); best for unsupervised learning; preserves original distribution shape

New cards

Standardization (Z-Score Scaling)

Rescaling values to mean = 0, SD = 1 using (x - mean) / std dev; best when data is near bell-shaped or contains extreme outliers

New cards

Normalization vs. Standardization

Normalization reduces scale gap while preserving distribution shape; standardization centers at 0 so no variable has disproportionate pull on the model

New cards

Overfitting

When a model fits training data too precisely and performs poorly on new data; caused by too many features or insufficient data

New cards

Curse of Dimensionality

When too many features are created, exponentially more data is needed to maintain model accuracy

New cards

PROC MEANS

SAS procedure producing summary statistics (N, mean, std dev, min, max) for numeric variables; VAR identifies variables, CLASS defines subgroups

New cards

VAR Statement (PROC MEANS)

Identifies which numeric variables to analyze; without it, PROC MEANS analyzes all numeric variables

New cards

CLASS Statement (PROC MEANS)

Identifies grouping variables for subgroup analysis; data does not need to be pre-sorted

New cards

N Obs vs. N (PROC MEANS)

N Obs = count of observations per class combination; N = count of non-missing values for the analysis variable

New cards

Median

The middle value when observations are sorted; requested in PROC MEANS with MEDIAN keyword; 50% of values fall above and below

New cards

Q1 and Q3

Q1 = value at or below which 25% of observations fall; Q3 = value at or below which 75% fall

New cards

IQR (Qrange)

Interquartile range = Q3 minus Q1; measures the spread of the middle 50% of data, reducing outlier influence

New cards

Normal Distribution

Bell-shaped, symmetric distribution; 68% of values within 1 SD, 95% within 2 SD, 99% within 3 SD

New cards

Skewness

Measures distribution asymmetry; left-skewed = longer left tail (negative); right-skewed = longer right tail (positive); normal ≈ 0

New cards

Kurtosis

Measures concentration toward tails vs. center; positive = more peaked (leptokurtic); negative = flatter (platykurtic); normal = 0

New cards

Determining Normality

Assess by comparing mean and median, and checking skewness/kurtosis (concern if > 1 or < -1); use PROC MEANS or PROC UNIVARIATE

New cards

PROC UNIVARIATE

SAS procedure providing comprehensive statistics including moments, quantiles, extreme observations, missing values, and normality tests

New cards

Extreme Observations (PROC UNIVARIATE)

Shows the five lowest and five highest values with their observation numbers; useful for identifying out-of-range values

New cards

ID Statement (PROC UNIVARIATE)

Displays a specified identifying variable (e.g., Employee_ID) alongside observation numbers in the Extreme Observations section

New cards

Test for Normality (PROC UNIVARIATE)

Goodness-of-fit tests (Kolmogorov-Smirnov, Cramer-von Mises, Anderson-Darling); p-value GREATER THAN 0.05 supports normality — opposite of most hypothesis tests

New cards

ProbPlot (PROC UNIVARIATE)

A probability plot comparing observed quantiles to theoretical normal quantiles; approximately normal if points fall near the diagonal line

New cards

Histogram

Groups a continuous variable into bins and displays frequency of each bin; used to visualize distribution shape

New cards

Box Plot

Shows quartile statistics; box = IQR (middle 50%), diamond = mean, whiskers extend 1.5×IQR; points beyond whiskers are outliers

New cards

PROC SGPLOT

SAS procedure for statistical graphics; key statements: VBOX/HBOX (box plots), VBAR/HBAR (bar charts), HISTOGRAM, SCATTER, REFLINE

New cards

Correlation

A numerical relationship between two variables; does not imply causation

New cards

Pearson Correlation Coefficient

Measures the strength of the linear relationship between two continuous variables; ranges from -1 to +1; p-value indicates generalizability to population

New cards

Correlation Strength Guide

0.00–0.30 = negligible; 0.30–0.50 = low; 0.50–0.70 = moderate; 0.70–0.90 = high; 0.90–1.00 = very high

New cards

Multicollinearity

High correlation between independent variables; inflates standard errors and distorts coefficients; a problem in predictive modeling

New cards

Confounding

When a variable is highly correlated with the DV and distorts model results

New cards

PROC CORR

SAS procedure for correlation analysis; RANK option orders by highest correlation; PLOTS=MATRIX creates a scatter plot matrix

New cards

Principal Component Analysis (PCA)

Creates new components from existing features to reduce dimensionality and address multicollinearity

New cards

Eigenvalue

Indicates how much variance each principal component explains; Kaiser Rule: keep components with eigenvalue > 1

New cards

Eigenvector

Shows the correlations between new PCA components and the original features

New cards

Scree Plot

Chart of eigenvalues by component; the "elbow" indicates where additional components stop adding meaningful explanatory power

New cards

PROC PRINCOMP

SAS procedure used to run Principal Component Analysis

New cards

Simpson's Paradox

A trend seen in a single variable that reverses when variables are combined; often fixed by segmenting on a key variable

New cards

Anscombe's Quartet

Four datasets with nearly identical summary statistics but very different distributions; illustrates the importance of visualizing data

New cards

Data Partitioning

Splitting data into training, validation, and test sets to reduce overfitting and improve generalizability

New cards

Training Set

The largest data partition (~70%); used to build and develop the model

New cards

Validation Set

A smaller partition used to monitor and tune the model during development; large differences from training results may signal overfitting

New cards

Test Set

Held out until the end; provides a final unbiased estimate of model performance

New cards

Linear Regression

A GLM used to model relationships between continuous variables; Y = β0 + β1 * X + ε (response = intercept + slope × predictor + error)

New cards

Assumption 1 — Normality

Errors are normally distributed

New cards

Assumption 2 — Homoscedasticity

Errors have equal variances across all predicted values

New cards

Assumption 3 — Independence

Errors are independent of each other

New cards

Heteroscedasticity

Unequal variance of errors; a violation of the linear regression equal variance assumption

New cards

PROC REG

SAS procedure for linear regression; syntax: MODEL dependent = independent(s); use QUIT to stop iterative processing

New cards

F Value

Tests the overall significance of the regression model; want Pr > F < 0.05

New cards

Root MSE

Average difference between predicted and actual values; closer to 0 is better

New cards

R-squared (R²)

Proportion of variance in the DV explained by the model; closer to 1 is better

New cards

Adjusted R²

R² penalized for number of predictors; used in multiple regression to compare models fairly

New cards

Intercept (β0)

The predicted value of Y when all predictors equal zero

New cards

Slope Coefficient (β1)

The amount Y changes for every one-unit increase in predictor X

New cards

Pr > |t|

P-value for each individual predictor in regression; want < 0.05 to consider the predictor significant

New cards

Residuals vs. Predicted Values Plot

Diagnostic plot to verify equal variance and independence assumptions; should show no pattern

New cards

Residuals vs. Quantile Plot

Diagnostic plot to verify normality of errors; residuals should follow the diagonal line

New cards

Parabolic Pattern in Residual Plot

Indicates a violation of the independence or equal variance assumption; suggests model misspecification

New cards

VIF (Variance Inflation Factor)

Measures collinearity among predictors; VIF > 10 indicates a predictor is redundant; added with /vif in PROC REG

New cards

Rule for Adjusting VIF Issues

Always remove or adjust one variable at a time, then rerun the model before making further changes

New cards

Stepwise Selection

Variable selection that starts with nothing and allows both addition and removal of variables; use /selection=stepwise in PROC REG

New cards

Parsimony

Preferring simpler models with fewer predictors when performance is comparable; both a business and statistical justification

New cards

Model Justification (Statistical)

Significant p-values, high R², low MSE

New cards

Model Justification (Business)

Parsimony (simplicity), explainability, and cost of implementation

New cards

Transformations — Purpose

Applied when regression assumptions are not met; transforms skewed data to better approximate normality

New cards

Transformations — Square Root

Use sqrt(x) for moderate positive skew; sqrt(max(x+1)-x) for moderate negative skew

New cards

Transformations — Log

Use log10(x) for greater positive skew; cannot be applied when values include 0

New cards

Transformations — Inverse

Use 1/x for severe positive skew; the most aggressive transformation option

New cards

Outlier Removal

Delete records with extreme values (e.g., beyond 2 SDs from the mean) to reduce skew and improve model fit

New cards

Logistic Regression

A linear classification method with a binary DV that predicts the probability of an outcome between 0 and 1 using a logit transformation

New cards

Sigmoid Function

f(x) = 1 / (1 + e^-x); the logistic function that outputs values between 0 and 1

New cards

Assumptions of Logistic Regression

Binary DV, linear relationship after transformation, independence, no multicollinearity; does NOT require homoscedasticity or normality of errors

New cards

Pearson Chi-Square Test

Tests whether observed frequencies differ from expected; significant if p < 0.05, indicating an association between variables

New cards

Odds Ratio

Ratio of outcome probabilities P(1)/P(0); not a simple likelihood; must be positive; > 1 means the numerator group is more likely to experience the event

New cards

Odds Ratio (Continuous Predictor)

For each 1-unit increase in the predictor, the odds of the target event are multiplied by the odds ratio value

New cards

PROC Logistic

SAS procedure for logistic regression; CLASS specifies categorical variables; MODEL defines response and predictors; event='1' sets the outcome of interest

New cards

Forward Selection

Starts with no predictors; adds one significant variable at a time

New cards

Backward Selection

Starts with all predictors; removes one insignificant variable at a time

100

New cards

AIC Statistic

Model fit measure for explanatory models; lower values = better fit; used for relative comparison only