data analysis final

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/64

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:45 PM on 4/25/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

65 Terms

1
New cards

correlation coeffeicient

measurement of the strength of a linear relationship between two variables

2
New cards

interpolation

predicting y values for values that are between the x values in the dataset

3
New cards

extrapolation

predciting y values for x values that are beyond the values in the dataset

4
New cards

coefficient of determination / RÂČ

the ratio of how much one variable explains the variation

5
New cards

MSE

what can you use as an estimate of variance for significance testing

6
New cards

t test

tests for a significant regression relationship

7
New cards

f test

tests the regression model for statistical significance

8
New cards

adjusted RÂČ

ratio of explained variation in the regression model that accounts for the number of predictor variables in the model

9
New cards

multicollinearity

when two predictor variables high a high linear correlation

10
New cards

VIF > 10

how do you determine serious multicollinearity?

11
New cards

VIF > 4

how do you determine potential multicollinearity?

12
New cards

confidence intervals around a mean response

measures the accuracy of the mean response of a population, constructed around a predicted value of Y at a given value of X

13
New cards

prediction intervals for an individual response

measures the accuracy of a single individual’s predicted response

14
New cards

larger

is the standard error of a prediction interval for an individual response larger or smaller than the standard error of E[y|x]?

15
New cards

analysis of variance

statistical models used to analyze the differences amongst group means and amongst the associated procedures with each

16
New cards

dummy variables / indicator variables

variables representing a non-numerical response by encoding the responses into numbers

17
New cards

K - 1 rule

the number of dummy variables needed to represent a categorical variable with K levels

18
New cards

not different from reference variable

what does a high p value mean when using dummy variables?

19
New cards

mean centering

model where the intercept is centered around the mean of the predictors instead of when the predictors = 0

20
New cards

interaction terms

statistical test evaluating whether a variable has the same or a different effect (slope) across different dummy variables

21
New cards

curvilinear relationship

relationship between variables where the effect is not a straight line, requiring the use of a polynomial term in the regression equation

22
New cards

residuals plot is non-linear/curved

when is there a curvilinear relationship?

23
New cards

simple first order model with one predictor variable

regression modeling a linear relationship with one predictor variable

24
New cards

second order model with one predictor variable

regression modeling a curvilinear relationship with one predictor variable

25
New cards

linearity, independence, normality, constant variance

what are the 4 core assumptions of a valid regression model?

26
New cards

independence

assumption that each residual error term is independent of the others

27
New cards

normally distributed, randomly scattered around 0, constant spread from left to right, no obvious patterns

what are the 4 factors considered when checking residuals?

28
New cards

homoscedasticity

residuals display a constant, even band of variance across all fitted values

29
New cards

heteroscedasticity

residuals are fanned out, funneled, contract, or have a shape, indicating non constant error variance

30
New cards

logarithmic transformation, power transformation

what are the two transformations used on the response variable to fix heteroscedasticity?

31
New cards

time ordered, clustered data, repeated measures

what 3 qualifications prove that data points are not independent?

32
New cards

time ordered

observations of independent data should not be sequential over time

33
New cards

clustered data

observations of independent data should not be similar geographically or organiziationally/companies

34
New cards

repeated measures

observations of independent data should not be repeated measurements of the same subject

35
New cards

influential observation / outlier

observation that has a disproportionate affect on the slope and Y-intercept in regression

36
New cards

leverage

how much an observation influences the model

37
New cards

standardized residuals

residuals scaled by the overall model’s variability so you can compare residuals accross observations

38
New cards

studentized residuals

residuals scaled using the error estimate excluding that particular observation, which is better for detecting outliers

39
New cards

cook’s distance

measure of how much the regression model will change if you remove one observation

40
New cards

collect more data near that point, use transformations, remove the point

what 3 methods can you use to handle outliers?

41
New cards

D > 4/n

when do you remove an influential point from data?

42
New cards

reg = lm(y ~ x, data = df)

what is the R syntax for lienar regression

43
New cards

abline(reg, col = “red”)

what is the R syntax for adding a regression line to a graph

44
New cards

res = reg$residuals

what is the R syntax for finding residuals

45
New cards

str(variable)

what is the R syntax for checking variable types

46
New cards

variable = factor(variable)

what is the R syntax forcreating dummy variables

47
New cards

qq plot

plot of residuals telling us if residuals are normally distributed. should be a relatively straight, diagonal line

48
New cards

the sum of each data point minus the mean

how do you find the mean center of a variable?

49
New cards

parsimonious model

relatively simple regression models with few predictor variables and relatively high RÂČ

50
New cards

forward selection

model selection method adding one predictor at a time to an intercept only model as long as there is a significant reduction in the residual SSE. variables cannot be removed once added

51
New cards

backward elimination

model selection method removing one predictor at a time from a full model based on P-values. variables cannot be added again once removed.

52
New cards

best subsets regression

model selection method evaluating every combination of predictors and selects the best model based on a chosen criterion, like BIC or RÂČ. gives the lowest error

53
New cards

plot()

what is the R script for generating four diagnostic plots?

54
New cards

vif()

what the R script for the variance inflation factor?

55
New cards

stepwise regression

hybrid model selection method that can both add remove variables at each step. starts with an intercept only model

56
New cards

heuristics

procedures that are one variable at a time, such as stepwise regression, forward selection, or backward elimination

57
New cards

Marlows CP

model selection criterion that compares a reduced model to the full multiple model to see if the model has enough predictors without overfitting

58
New cards

lowest CP

how do you select a model using marlows CP?

59
New cards

akaike information criterion / AIC

model selection criterion that balances model fit and complexity by judging the amount of information lost by a given model while penaliziing for number of predictors. focuses on best fit

60
New cards

Bayesian information criterion / BIC

model selection criterion that evaluates a model's fit while adjusting for the number of predictors, used to prevent overfitting. tends to pick simplest model

61
New cards

lowest BIC

how do you select a model using BIC?

62
New cards

lowest AIC

how do you select a model using AIC?

63
New cards

step()

what is the R script for stepwise regression?

64
New cards

regsubsets()

what is the R script for best subsets regression?

65
New cards

qqplot(reg)

what is the R script for qq plot?