BIA 484 Exam 2

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/116

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 12:08 AM on 3/31/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

117 Terms

1
New cards

Data Understanding

The process of evaluating data using descriptive statistics and visualization to identify invalid values, missing values, unexpected distributions, and outliers

2
New cards

Descriptive Statistics

Methods to organize, describe, and summarize data; includes frequency, min/max, central tendency (mean, median, mode), and dispersion (range, variance, standard deviation)

3
New cards

PROC FREQ

SAS procedure that produces a one-way frequency table for each variable in the TABLES statement; output includes frequency, percent, and cumulative statistics

4
New cards

BY Statement

Used in PROC FREQ to request separate analyses for each group; data must first be sorted or indexed by the BY variable

5
New cards

Crosstabulation Table

A two-way frequency table generated by placing an asterisk between two variables (e.g., Sex*Country); shows frequency, percent, row pct, and col pct for each combination

6
New cards

Row Pct

The percentage of the row total represented by a given cell; denominator is the row total

7
New cards

Col Pct

The percentage of the column total represented by a given cell; denominator is the column total

8
New cards

ORDER=FREQ Option

PROC FREQ option that displays results in descending frequency order; useful for identifying duplicates or dominant values

9
New cards

Duplicate Check with PROC FREQ

Use ORDER=FREQ in PROC FREQ on an ID variable; any value with frequency > 1 indicates a duplicate record

10
New cards

PROC PRINT for Invalid Data

Used after PROC FREQ to display specific observations with invalid, missing, or out-of-range values using a WHERE clause

11
New cards

IF-THEN Statement

SAS statement that executes a command for observations meeting a specific condition; syntax: IF expression THEN statement

12
New cards

Data Fixing Principle

Before correcting any data problem, you must first have a reason — always consult the data expert or source system owner

13
New cards

Data Preparation

The most time-consuming part of analytics (60–90% of a data scientist's time); the most important step and primary source of model error

14
New cards

Variable Cleaning

Addressing data quality problems (incorrect values, outliers, missing values) through solutions such as removing, transforming, binning, or leaving values as-is

15
New cards

MCAR

Missing Completely at Random — missing values with no pattern; example: survey responses lost in transit

16
New cards

MAR

Missing at Random — missingness is related to another observed variable; example: younger people not disclosing health status

17
New cards

MNAR

Missing Not at Random — missingness is related to the unobserved value itself; the true value can often be inferred; example: felony records not collected for certain groups

18
New cards

Listwise Deletion

Handling missing values by deleting the entire observation; simple but risks biasing data if missingness is not completely random

19
New cards

Imputation

Handling missing values by replacing them with an estimated value such as a constant, mean, median, or regression-calculated value

20
New cards

Feature Engineering

Creating new variables from existing data to improve modeling; methods include variable creation, dummy coding, binning, and scaling

21
New cards

Dummy Coding (One-Hot Encoding)

Converting a categorical variable with k categories into k-1 binary (0/1) variables to avoid implying ordinal rank

22
New cards

Binning (Bucketing)

Grouping continuous values into discrete ranges when exact values are less important than the category; can produce ordinal or dummy-coded bin variables

23
New cards

Normalization (Min-Max Scaling)

Rescaling values to 0–1 using (x - min) / (max - min); best for unsupervised learning; preserves original distribution shape

24
New cards

Standardization (Z-Score Scaling)

Rescaling values to mean = 0, SD = 1 using (x - mean) / std dev; best when data is near bell-shaped or contains extreme outliers

25
New cards

Normalization vs. Standardization

Normalization reduces scale gap while preserving distribution shape; standardization centers at 0 so no variable has disproportionate pull on the model

26
New cards

Overfitting

When a model fits training data too precisely and performs poorly on new data; caused by too many features or insufficient data

27
New cards

Curse of Dimensionality

When too many features are created, exponentially more data is needed to maintain model accuracy

28
New cards

PROC MEANS

SAS procedure producing summary statistics (N, mean, std dev, min, max) for numeric variables; VAR identifies variables, CLASS defines subgroups

29
New cards

VAR Statement (PROC MEANS)

Identifies which numeric variables to analyze; without it, PROC MEANS analyzes all numeric variables

30
New cards

CLASS Statement (PROC MEANS)

Identifies grouping variables for subgroup analysis; data does not need to be pre-sorted

31
New cards

N Obs vs. N (PROC MEANS)

N Obs = count of observations per class combination; N = count of non-missing values for the analysis variable

32
New cards

Median

The middle value when observations are sorted; requested in PROC MEANS with MEDIAN keyword; 50% of values fall above and below

33
New cards

Q1 and Q3

Q1 = value at or below which 25% of observations fall; Q3 = value at or below which 75% fall

34
New cards

IQR (Qrange)

Interquartile range = Q3 minus Q1; measures the spread of the middle 50% of data, reducing outlier influence

35
New cards

Normal Distribution

Bell-shaped, symmetric distribution; 68% of values within 1 SD, 95% within 2 SD, 99% within 3 SD

36
New cards

Skewness

Measures distribution asymmetry; left-skewed = longer left tail (negative); right-skewed = longer right tail (positive); normal ≈ 0

37
New cards

Kurtosis

Measures concentration toward tails vs. center; positive = more peaked (leptokurtic); negative = flatter (platykurtic); normal = 0

38
New cards

Determining Normality

Assess by comparing mean and median, and checking skewness/kurtosis (concern if > 1 or < -1); use PROC MEANS or PROC UNIVARIATE

39
New cards

PROC UNIVARIATE

SAS procedure providing comprehensive statistics including moments, quantiles, extreme observations, missing values, and normality tests

40
New cards

Extreme Observations (PROC UNIVARIATE)

Shows the five lowest and five highest values with their observation numbers; useful for identifying out-of-range values

41
New cards

ID Statement (PROC UNIVARIATE)

Displays a specified identifying variable (e.g., Employee_ID) alongside observation numbers in the Extreme Observations section

42
New cards

Test for Normality (PROC UNIVARIATE)

Goodness-of-fit tests (Kolmogorov-Smirnov, Cramer-von Mises, Anderson-Darling); p-value GREATER THAN 0.05 supports normality — opposite of most hypothesis tests

43
New cards

ProbPlot (PROC UNIVARIATE)

A probability plot comparing observed quantiles to theoretical normal quantiles; approximately normal if points fall near the diagonal line

44
New cards

Histogram

Groups a continuous variable into bins and displays frequency of each bin; used to visualize distribution shape

45
New cards

Box Plot

Shows quartile statistics; box = IQR (middle 50%), diamond = mean, whiskers extend 1.5×IQR; points beyond whiskers are outliers

46
New cards

PROC SGPLOT

SAS procedure for statistical graphics; key statements: VBOX/HBOX (box plots), VBAR/HBAR (bar charts), HISTOGRAM, SCATTER, REFLINE

47
New cards

Correlation

A numerical relationship between two variables; does not imply causation

48
New cards

Pearson Correlation Coefficient

Measures the strength of the linear relationship between two continuous variables; ranges from -1 to +1; p-value indicates generalizability to population

49
New cards

Correlation Strength Guide

0.00–0.30 = negligible; 0.30–0.50 = low; 0.50–0.70 = moderate; 0.70–0.90 = high; 0.90–1.00 = very high

50
New cards

Multicollinearity

High correlation between independent variables; inflates standard errors and distorts coefficients; a problem in predictive modeling

51
New cards

Confounding

When a variable is highly correlated with the DV and distorts model results

52
New cards

PROC CORR

SAS procedure for correlation analysis; RANK option orders by highest correlation; PLOTS=MATRIX creates a scatter plot matrix

53
New cards

Principal Component Analysis (PCA)

Creates new components from existing features to reduce dimensionality and address multicollinearity

54
New cards

Eigenvalue

Indicates how much variance each principal component explains; Kaiser Rule: keep components with eigenvalue > 1

55
New cards

Eigenvector

Shows the correlations between new PCA components and the original features

56
New cards

Scree Plot

Chart of eigenvalues by component; the "elbow" indicates where additional components stop adding meaningful explanatory power

57
New cards

PROC PRINCOMP

SAS procedure used to run Principal Component Analysis

58
New cards

Simpson's Paradox

A trend seen in a single variable that reverses when variables are combined; often fixed by segmenting on a key variable

59
New cards

Anscombe's Quartet

Four datasets with nearly identical summary statistics but very different distributions; illustrates the importance of visualizing data

60
New cards

Data Partitioning

Splitting data into training, validation, and test sets to reduce overfitting and improve generalizability

61
New cards

Training Set

The largest data partition (~70%); used to build and develop the model

62
New cards

Validation Set

A smaller partition used to monitor and tune the model during development; large differences from training results may signal overfitting

63
New cards

Test Set

Held out until the end; provides a final unbiased estimate of model performance

64
New cards

Linear Regression

A GLM used to model relationships between continuous variables; Y = β0 + β1 * X + ε (response = intercept + slope × predictor + error)

65
New cards

Assumption 1 — Normality

Errors are normally distributed

66
New cards

Assumption 2 — Homoscedasticity

Errors have equal variances across all predicted values

67
New cards

Assumption 3 — Independence

Errors are independent of each other

68
New cards

Heteroscedasticity

Unequal variance of errors; a violation of the linear regression equal variance assumption

69
New cards

PROC REG

SAS procedure for linear regression; syntax: MODEL dependent = independent(s); use QUIT to stop iterative processing

70
New cards

F Value

Tests the overall significance of the regression model; want Pr > F < 0.05

71
New cards

Root MSE

Average difference between predicted and actual values; closer to 0 is better

72
New cards

R-squared (R²)

Proportion of variance in the DV explained by the model; closer to 1 is better

73
New cards

Adjusted R²

R² penalized for number of predictors; used in multiple regression to compare models fairly

74
New cards

Intercept (β0)

The predicted value of Y when all predictors equal zero

75
New cards

Slope Coefficient (β1)

The amount Y changes for every one-unit increase in predictor X

76
New cards

Pr > |t|

P-value for each individual predictor in regression; want < 0.05 to consider the predictor significant

77
New cards

Residuals vs. Predicted Values Plot

Diagnostic plot to verify equal variance and independence assumptions; should show no pattern

78
New cards

Residuals vs. Quantile Plot

Diagnostic plot to verify normality of errors; residuals should follow the diagonal line

79
New cards

Parabolic Pattern in Residual Plot

Indicates a violation of the independence or equal variance assumption; suggests model misspecification

80
New cards

VIF (Variance Inflation Factor)

Measures collinearity among predictors; VIF > 10 indicates a predictor is redundant; added with /vif in PROC REG

81
New cards

Rule for Adjusting VIF Issues

Always remove or adjust one variable at a time, then rerun the model before making further changes

82
New cards

Stepwise Selection

Variable selection that starts with nothing and allows both addition and removal of variables; use /selection=stepwise in PROC REG

83
New cards

Parsimony

Preferring simpler models with fewer predictors when performance is comparable; both a business and statistical justification

84
New cards

Model Justification (Statistical)

Significant p-values, high R², low MSE

85
New cards

Model Justification (Business)

Parsimony (simplicity), explainability, and cost of implementation

86
New cards

Transformations — Purpose

Applied when regression assumptions are not met; transforms skewed data to better approximate normality

87
New cards

Transformations — Square Root

Use sqrt(x) for moderate positive skew; sqrt(max(x+1)-x) for moderate negative skew

88
New cards

Transformations — Log

Use log10(x) for greater positive skew; cannot be applied when values include 0

89
New cards

Transformations — Inverse

Use 1/x for severe positive skew; the most aggressive transformation option

90
New cards

Outlier Removal

Delete records with extreme values (e.g., beyond 2 SDs from the mean) to reduce skew and improve model fit

91
New cards

Logistic Regression

A linear classification method with a binary DV that predicts the probability of an outcome between 0 and 1 using a logit transformation

92
New cards

Sigmoid Function

f(x) = 1 / (1 + e^-x); the logistic function that outputs values between 0 and 1

93
New cards

Assumptions of Logistic Regression

Binary DV, linear relationship after transformation, independence, no multicollinearity; does NOT require homoscedasticity or normality of errors

94
New cards

Pearson Chi-Square Test

Tests whether observed frequencies differ from expected; significant if p < 0.05, indicating an association between variables

95
New cards

Odds Ratio

Ratio of outcome probabilities P(1)/P(0); not a simple likelihood; must be positive; > 1 means the numerator group is more likely to experience the event

96
New cards

Odds Ratio (Continuous Predictor)

For each 1-unit increase in the predictor, the odds of the target event are multiplied by the odds ratio value

97
New cards

PROC Logistic

SAS procedure for logistic regression; CLASS specifies categorical variables; MODEL defines response and predictors; event='1' sets the outcome of interest

98
New cards

Forward Selection

Starts with no predictors; adds one significant variable at a time

99
New cards

Backward Selection

Starts with all predictors; removes one insignificant variable at a time

100
New cards

AIC Statistic

Model fit measure for explanatory models; lower values = better fit; used for relative comparison only

Explore top notes

note
Biology Exam 1
Updated 548d ago
0.0(0)
note
enzymes_handouts
Updated 424d ago
0.0(0)
note
Diffusion and Concentration
Updated 1224d ago
0.0(0)
note
Translation
Updated 1334d ago
0.0(0)
note
Group 16 elements
Updated 1355d ago
0.0(0)
note
Reconstruction
Updated 1132d ago
0.0(0)
note
Heimler APUSH TP 5.10
Updated 472d ago
0.0(0)
note
Biology Exam 1
Updated 548d ago
0.0(0)
note
enzymes_handouts
Updated 424d ago
0.0(0)
note
Diffusion and Concentration
Updated 1224d ago
0.0(0)
note
Translation
Updated 1334d ago
0.0(0)
note
Group 16 elements
Updated 1355d ago
0.0(0)
note
Reconstruction
Updated 1132d ago
0.0(0)
note
Heimler APUSH TP 5.10
Updated 472d ago
0.0(0)

Explore top flashcards

flashcards
Unit 1 AP GOV
63
Updated 353d ago
0.0(0)
flashcards
Psicologia social Parte 2
80
Updated 620d ago
0.0(0)
flashcards
SAT Math Formulas & Terms
26
Updated 387d ago
0.0(0)
flashcards
Macbeth Vocab #4
20
Updated 1155d ago
0.0(0)
flashcards
Chemistry equations
26
Updated 1043d ago
0.0(0)
flashcards
Civil Rights EK 3
60
Updated 60d ago
0.0(0)
flashcards
Unit 1 AP GOV
63
Updated 353d ago
0.0(0)
flashcards
Psicologia social Parte 2
80
Updated 620d ago
0.0(0)
flashcards
SAT Math Formulas & Terms
26
Updated 387d ago
0.0(0)
flashcards
Macbeth Vocab #4
20
Updated 1155d ago
0.0(0)
flashcards
Chemistry equations
26
Updated 1043d ago
0.0(0)
flashcards
Civil Rights EK 3
60
Updated 60d ago
0.0(0)