1/71
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Target Population
The target population is a group of items/individuals under study - this is the population we want to make a generalization about
Sample
The sample is a subset of the target populations gathered for measurement/observation - this is where we get our data from, not the observations themselves
Sampled Population
The sampled population is a subset of the target population (or should be) that could be in a sample given the sampling scheme (how the sample was decided upon and collected)
Parameter
A parameter is a constant in a probability distribution that is of interest in the study - it's a numerical description of a population like a mean of standard deviation
Subjective Inference
Subjective inference is when the sampled population is not the target population and the data must be extrapolated onto the target from the sample
When is subjective inference valid?
If a sampled population is representative of a target population, even if they're not the exact same, then subjective inference is valid. If this is not true, it is invalid. A good example of this is the latitude north and skin cancer question - if we only sample the contiguous US and want to make a subjective inference about Alaska, is it valid? No because the data is not representative of Alaska's latitude (much further north than contiguous US).
What type of inference are you making when the sampled population is completely representative of the target population? When it's not?
You make a direct inference when the sampled and target populations are the same. You make a subjective inference when they are different.
What are the three major types of statistical inference?
Hypothesis testing, significance testing, and interval estimates
What is power? How do you calculate it?
The power of a test is the probability that the decision rule will lead to conclusion of the alternative hypothesis when the alternative in, in fact, true (a true positive). You can calculate power with 1 - beta.
What is the alpha?
Alpha is the probability of type 1 error - when you incorrectly reject a null hypothesis when it is actually true (you wrongly accept the alternative).
What is the beta?
Beta is the probability of a type 2 error - when you incorrectly fail to reject the null hypothesis when it is false (you accept the null and reject the TRUE alternative)
What happens to beta and power as you decrease alpha?
When alpha decreases, the power decreases and beta increases
What happens to beta and power as you increase alpha?
As alpha increase, the power increases and your beta decreases
When is it okay to throw out outliers?
It's okay to get rid of outliers when you have reason to believe that the data was misrecorded or that the observation is not from your target population
What do t statistics measure?
They measure how many standard errors a point is from the mean. Therefore, if your t-value is greater than +2 or less than -2, it is likely that your data is significant (as nonsignificant data usually lies within 2 standard deviations of the center of a distribution)
What are studentized residuals? When do you use them and what is significant?
Studentized residuals are a data set's residuals divided by their standard errors. They are used to looks at if any outliers are obscure and may be influencing a regression. When studentized residuals are about +2 or below -2, they are suspect. If they are above 3+ or below -3, they are pretty rare and are definitely influential on the regression.
What is a Cook's Distance? When do you use this and when is it significant?
Cook's distance is the index of how much a single point changes the estimated regression parameters. It's used to see if any outliers are influencing the regression. Cook's distances are significant above 1 (heavily influencing estimated regression) and suspect around 0.5.
What are t-procedures robust against? What does it mean to be robust?
They are robust against slight to moderate deviation from normality. They are not robust to outliers that could influence the estimates of the regression. Being robust means that a model's estimate is not too heavily influences by certain (undesirable) characteristics of the data (in this case, non-normality)
How do outliers affect the power of a test?
They lower the power
How do outliers affect you decision in a hypothesis test or confidence interval analysis?
Outliers influence regression estimates and make you more likely to fail to reject the null hypothesis. They will likely widen confidence intervals (again making it harder to reject the null).
What does "least squares" mean in a least squares line?
The line that minimizes the sum of the errors (residualsO is the best fit line - this is the least squares lines as it minimizes the sum of the squared errors (SSE) - making the squares the least.
What can you use Sxx and Sxy for?
Once you calculate Sxx and Sxy, you can divide Sxy by Sxx to get an estimated slope of the line
What is the formula for SSTo? What is SSTo?
SSTo = SSE + SSR. SSTo is the total variation around the mean y.
You can also calculate SSTo by finding the sum of the squared (y values minus y bar)
What is SSE?
It is the sum of the squared residuals and it represents the "unexplained error" or the variation in y that cannot be attributed to x.
What is SSR and how do you calculate it?
SSR is the sum of the squared (yexpected values minus ybar). SSR represents the explained error - the variation in y that can be attributed to x.
What is the relationship between R^2 and SSR?
SSR / SSTo is the percent of the total error explained by x. This is equal to R^2.
How do you confirm that your observations are independent?
There is no chart that you can look at, but you should read the sampling scheme to determine if it seems to promote random and independent sampling. Otherwise, maybe look at a scatter plot and make sure there are no obvious clusters of data that are linked in some way.
How do you confirm that your y's are normally distributed at each x?
You can plot the residuals of your data in a qqPlot with 95% confidence bands - if all of the points lie within the bands, your data is relatively normal and the normality assumption is met.
How do you confirm that your residuals have constant variability (homoskedasticity)?
You can plot your residuals against the model's fitted values and see if there are any discernible patterns in your residual plot. If not, your residuals should have relatively constant variability.
How do you confirm that the means of your Yi are linearly related to Xi?
You can present your data in a scatter plot and see if it looks linear. You could also attempt to fit a linear model and compare your data to a linear plot, seeing if it deviates from that. Linearity is an extremely important assumption in linear regression as without it, your data is not appropriate for linear regression.
When a question uses the word "expected", what does it mean?
When you see "expected", assume you're looking for the mean (such as, in a confidence interval).
What is the difference between prediction and confidence intervals?
Prediction intervals predicts a range for a single value of y at a given x while a confidence interval predicts the mean value of y at a given x (or the mean value of something else like the slope).
Which will generally be wider, a prediction or confidence interval?
A prediction interval should be wider as it has to predict a more specific value.
How do you calculate the DF's in an ANOVA table?
DF total = total observations - 1
DF regression = 1 (# predictors)
DF error = n-2 (or DF total - DF regression)
How do you calculate the SS's in an ANOVA table?
SSR is the sum of the squared (y values - ybar)
SSR is the sum of the squared *y values - y expected)
SSTo is the sum of SSR and SSE
How do you calculate the MS's in an ANOVA table?
MSR is the SSR divided by 1 (DF regression)
MSE is the SSE divided by n-2 (Df error)
MSTo is the sum of MSE and MSR
How you do calculate the F value in an ANOVA table?
F is equal to MSR / MSE
What is the F-value test?
It's a test to see if the linear relationship of y with x explains the variation in y. This means the the null hypothesis will always state that the slope is zero while the alternative will state that it is not zero.
Why are t-tests generally better than f-tests?
T-tests can be done as one-sided tests (not f) and can be test with non-zero values
What is the general rule of transformations of data?
Large values are affected the most. If it's an up transformation, they increase the most. If it's down, they decrease the most.
What is the Mosteller-Tukey Bulging Rule used for?
This is a diagram used to understand in which cases to transform certain variables up or down.
How would you transform a graph that looks like the top right corner of the M-T Bulging circle?
In the top right corner, X is being transformed up and y is being transformed up. I would add a power to both x and y.
How would you transform a graph that looks like the top left corner of the M-T Bulging circle?
In the top left corner, X is being transformed down and y is being transformed up. I would add a power to y and sqrt or ln x. (or even reciprocal)
How would you transform a graph that looks like the bottom left corner of the M-T Bulging circle?
In the bottom left corner, X is being transformed down and y is being transformed down. I would sqrt or ln x or even reciprocal both x and y
Why is it not appropriate to extrapolate an estimated relationship beyond the range of the x values in the sample?
We simply don't know how data changes past our observed range. If it changed in a way we didn't predict, we would be poorly estimating.
What does it mean to be curvilinear?
When a line is truly curved but up close looks like/can be analyzed as a linear model.
Why would you want to do regression through the origin?
It might be helpful to see a regression through the origin when non-zero intercepts don't make sense with the context of the data.
What will the dimensions of a matrix be if it is made from multiplying an rxc matrix by a cxk matrix?
rxk - two matrices can only be multiplied if the first matrix's columns match the second matrix's rows
What does the ' mean in matrices?
It means transpose. A matrix of r x c dimensions will now be c x r dimensions.
What does y look like as a matrix?
Its a column vector (1 column) with n rows (n being the number of y values)
What does x look liked as a matrix?
It is a 2 column matrix with n rows (n being the number of observations we have). The first column is all 1's (important in matrix multiplication with intercepts) and the second is the x values we gathered from the sample.
What does the column vector b represent?
b is a 2 x 1 column vector with the top value being Bo (intercept) and the bottom value being B1 (slope)
Describe the values in the variance covariance matrix.
This is (usually) a 2x2 matrix where the top left corner represents the variance of the intercept and the bottom right corner represent the variance of the slope. You can use these values in intervals by taking their square root to get standard errors of the slope and intercept.
What are two important things to remember when working with transformed data?
1. you must recalculate your R^2 value!
2. you must back transform (like when interpretting intervals)
How do you calculate your own R^2 when working with transformed data?
1. Gather the fitted y values form the transformed data.
2. Back-transform the set of fits to get the set of untransformed fitted y values
3. Handcalculated R^2 by dividing SSR by SSTo
How do you calculate SSE in matrix math?
(Y-Yexpected)'(Y-Yexpected)
What does MSE represent?
The variance around the regression line
What are the useful values in a hat matrix? What do they represent?
The hat matrix is the variance covariance matrix of y expected values at given x values. The important values are the diagonal (from top left to bottom right) - these are the variances of each y expected and they can be used by taking their square root to get an SE for a confidence interval
Hat matrix should tops be 4x4
When is regression through the origin appropriate for the data?
When the plot of residuals vs predicted values looks normal (no outliers, unequal variances), normality plot looks good
What happens to R^2 for regression through the origin?
The R^2 for regression through the origin ALWAYS increases because both SSE and SSTo increase. Since R^s = 1 - SSE/SST, R^2 will increase as the errors increase (errors increase because regression through origin in not least squares line)
How are hypotheses tests and confidence intervals related?
There is a 1:1 relationship between the two as long as the same alpha is used and they are both two sided (or both one sided)
What is the goal of the least squares line?
To minimize SSE
Between SSE and SSR, which is unexplained and which is explained?
SSR is explained, SSE is unexplained
Write out both the linear model and the estimated linear model.
linear model: yi = Bo + B1Xi + Ei
estimated linear model: yi-hat = Bo-hat + B1-hatXi + Ei-hat
When do you transform x?
When you only have linearity problems in your data (no residual or normality issues)
When do you transform y?
When you have both linearity and residual/normality issues
When do you transform x and y?
When you have linear data but you have normality/residual issues
Write the estimaed linear model and distributions of the values. And assumptions.
yi-hat = Bo-hat + B1-hatXi + Ei-hat
where yi-hat is the expected [dependent variable] with distribution iid ∼ N(β0 + β1Xi , σ2 )
xi is the [independent variable],
Bo-hat is the estimated intercept (mean value of y given x=0),
B1-hat is the estimated slope (mean increase of y when x increases by 1 unit),
Ei-hat is the estimated error term with distribution iid∼ N(0, σ2 )
The assumptions are:
1. The observations are independent
2. Y's are normally distributed at each x
3. Residuals have constant variability (homoskedasticity)
4. Means of yi are linearly related to xi
5. Outlier are not driving our conclusions
Given an interpretation of the 95% confidence interval (5, 7) when x = 4 puppies and y = happiness score
We are 95% confident that the mean happiness score with 4 puppies is between 5 and 7.
Given an interpretation of the 95% prediction interval (5, 7) when x = 4 puppies and y = happiness score
There is a 0.95 probability that the happiness score at 4 puppies will be between 5 and 7.
If the research question is "Does happiness score increase with more puppies?" - give a hypothesis test and an interpretation of the two sided p-value of 0.002.
Ho: Happiness does not increase with puppies: B1 = 0
Ha: Happiness does increase with puppies: B1>0
Since two sided p is 0.002 and we are doing one sided test, our p value is 0.001. At an 0.05 alpha level, we reject the null hypothesis that happiness does not increase with puppies because the p-value of 0.001 is less than alpha - 0.05.
Why is normality of observations important?
If you data is non-normal, it will be hard to do significance testing to see if the value being testing int he alternative hypothesis actually deviates from normal values (without a valid distribution to compare to)