1/54
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
2 Types of Experiments
Randomized experiment and observational study
Randomized Experiment (2 types)
Matched pairs, Randomized Comparative
Type of conclusion that can be made from randomized experiment
Causal (If explanatory is randomly assigned)
Type of conclusion that can be made from observational study
Only association
Confidence Interval Formula
Sample statistic +- Z/T mult X standard error
What uses Z mult vs what uses t mult
Proportion uses Z
averages or rho uses T
SLR Conditions
Linearity
Independence
Normality
Equal Variance
Randomness
What to use to determine each (Note about randomness)
Independence, Randomness: Problem statement
Randomness of sampling determines if you can generalize to the pop. or not
Linearity, Equal Variance: Residuals vs Fitted plots
Normality: QQ Plot
Residuals Formula
Observed - Predicted
Empirical Rule
68% - 1 SD
95% - 2 SD
99.7 % - 3 SD
Concavity rule with transforms
If concave up, try power > 1
If concave down, try power < 1
Outlier vs influential
Outlier: Far from regression line
Influential: Has big impact on the regression fit
For an influential point, how does studentized compare to standardized residual
|Studentized| > |Standardized|
What leverage measures
Points potential to be influential
Only looks at how far a points x value is from the mean cloud of x
Cook’s distance combines…
Residuals for y distance and leverage for x distance
R² interpretation and formula
Proportion of variability in response explained by the model
SSModel / SSTotal
3 Ways to test for regression
T-test for the slope
Is B1 = 0 or not equal to 0
T-test for correlation
Is there a linear relationship between x and y
Overall F-Test
Are all slopes = 0 or is at least one of them not equal to 0
Confidence interval vs Prediction interval (Interpretations) (Which is wider?)
CI: 95% confident that the true mean y value at this x is within these bounds
PI: 95% a new individual value of y at this x is within these bounds
Three types of Anova Tables
Type 1: Sequential Sum of Squares
Type 2: Hierarchical Sum of Squares
Type 3: Marginal Anova
Sequential Sum of Squares (What is it? df for predictors and residuals? Compresses to what?)
Additional variability explained when new variable is added to the model (sequentially adding)
Predictors df is how many slopes were needed to include it in the model (always 1 for quantitative)
Residuals: n-k-1
Compresses to overall anova table (model row has all the predictors combined, residuals row is just residuals)
Hierarchical (What is it? Matches thing…)
Additional variability explained by adding this new variable to a model containing everything else
P values match the table of coefficients
Maginal Anova
Like hierarchical, but with interaction terms
VIF Above ___ is typically bad
above 5 (r² > 0.8)
Note about predictions from a model with multicollinearity
Predictions are fine, but individual coefficient conclusions are not
“Good” for mallow cp
<= m+1, where m is the number of predictors in the subset model
CP, AIC, BIC preference for small models
CP and AIC are moderate, BIC prefers small ones a lot
Methods of picking models (4)
Best Subsets - fits all 2^k models and picks best basedon criterion
Backwards Elimination - starts with full until deleting a term doesn’t improve it (succeptible to multicollinearity)
Forward Selection - Keep adding until no longer improves
Stepwise Regression - adds stuff with forward, but also checks with backwards if something can be remove
Nested F-test
“Is anything gained by adding these terms to a smaller model”
Experimental Unit
Thing that is assigned treatment (usually a row in the dataset)
Balanced?
If each level of the explanatory factor gets the same number of experimental units
Two ways of writing the equation for an anova model and what each tests (what links?)
Y=μ_i+ε_i
H_0: u_1 = u_2 = … = u_i
H_a: at least one u_i not equal u_j
Y_i=μ+α_i+ε
H_: alpha_1 = alpha_2 = … = alpha_i
H_a: at least one alpha_i not equal to 0
Link: mu_i = u + alpha_i
group to group vs unit to unit and scales of each if the treatment is important
group to group: different levels
unit to unit: per unit observations
If important, group to group » unit to unit
In anova table, how do you find F value for row?
Divide MS/MSE
More b/w group variability ___ p values. More samples ___ p value
decreases, decreases
Conditions for Anova (3) & how to check
Normality - qq-plot
Equal variance (sd of groups: max/min < 2)
Independence - how data was collected
Interpreting effect size of difference in means
> 0.5: Moderate
> 1: Large
What is FWER and ways to control
FWER: Family Wise Error Rate: Chance of making at least one type 1 error with multiple hypothesis tests
Meaning of an additive effect
If effect of treatment A is the same for all levels of treatment B
Main effects
Effect of one factor averaged over all levels of the other factors
Two-Way Factorial design requirements (2)
at least two levels for each factor
Every combination is tested
Experimental Design Principles
Blocking: Block out a nuisance factor & assign treatments across all levels of that nuisance factor
Comparison: More than one group is needed & one is a placebo/control & include all levels that need to be studied
Crossing: include all combinations of factor levels
Replication: 2+ observations for each cell
Randomization: Randomly assign units
Two way main effects model (no interaction)
Y = mu + alpha_i + beta_j + e
Extra condition for two way main effects model
Effects are additive (If not, we would need an interaction term and that’s another model)
Two way main effects model (w/ interaction)
Y = mu + alpha_i + beta_j + alpha*beta_ij + e
Explain each form of randomized block: subdivision, matching, reusing
Subdivision: Divide up by a known nuisance factor
Matching: Match people based on a known nuisance factor, and assign a treatment to each within the pair
Reusing: Randomly reuse same experimental unit under each treatment
Other method of checking condition that’s not sd_max / sd_min (and how to fix)
Checks equal variance. Plot log(sd) / log(mean)
Transform via y^(1-slope)
0 → log(y)
Odds Formula
Pi / (1 - pi)
Logit Form Formula (Key thing about error)
Ln(pi/(1-pi)) = B_0 + B_1X + …
No error term
Logit form converted to a pi= format
pi = e^(B_0 + B_1X) / (1 + e^(B_0 + B_1X)
Logistic regression’s probabiliy is based on a Y= what case?
Y=1, or that the event happens
Odds ratio (2 forms) (And interpretations)
When comparing stuff:
Change from a to be is odds_b / odds_a
“Changing from a to be, odds of Y=1 inc/dec by a factor of __”
Unit to unit change
e^B_1
Theoretical Model for SLR
Y = B_0 + B_1X + e
Plug in appropriate variables for Y and X
What forward selection uses to know what to add next
Takes one with highest correlation
What backward elimination uses to know what to remove next
Takes highest p value to remove
What’s included in a complete second order model?
First order, second order (squared terms), interaction