3020 prelim 1 kms

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/71

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

72 Terms

New cards

Target Population

The target population is a group of items/individuals under study - this is the population we want to make a generalization about

New cards

Sample

The sample is a subset of the target populations gathered for measurement/observation - this is where we get our data from, not the observations themselves

New cards

Sampled Population

The sampled population is a subset of the target population (or should be) that could be in a sample given the sampling scheme (how the sample was decided upon and collected)

New cards

Parameter

A parameter is a constant in a probability distribution that is of interest in the study - it's a numerical description of a population like a mean of standard deviation

New cards

Subjective Inference

Subjective inference is when the sampled population is not the target population and the data must be extrapolated onto the target from the sample

New cards

When is subjective inference valid?

If a sampled population is representative of a target population, even if they're not the exact same, then subjective inference is valid. If this is not true, it is invalid. A good example of this is the latitude north and skin cancer question - if we only sample the contiguous US and want to make a subjective inference about Alaska, is it valid? No because the data is not representative of Alaska's latitude (much further north than contiguous US).

New cards

What type of inference are you making when the sampled population is completely representative of the target population? When it's not?

You make a direct inference when the sampled and target populations are the same. You make a subjective inference when they are different.

New cards

What are the three major types of statistical inference?

Hypothesis testing, significance testing, and interval estimates

New cards

What is power? How do you calculate it?

The power of a test is the probability that the decision rule will lead to conclusion of the alternative hypothesis when the alternative in, in fact, true (a true positive). You can calculate power with 1 - beta.

New cards

What is the alpha?

Alpha is the probability of type 1 error - when you incorrectly reject a null hypothesis when it is actually true (you wrongly accept the alternative).

New cards

What is the beta?

Beta is the probability of a type 2 error - when you incorrectly fail to reject the null hypothesis when it is false (you accept the null and reject the TRUE alternative)

New cards

What happens to beta and power as you decrease alpha?

When alpha decreases, the power decreases and beta increases

New cards

What happens to beta and power as you increase alpha?

As alpha increase, the power increases and your beta decreases

New cards

When is it okay to throw out outliers?

It's okay to get rid of outliers when you have reason to believe that the data was misrecorded or that the observation is not from your target population

New cards

What do t statistics measure?

They measure how many standard errors a point is from the mean. Therefore, if your t-value is greater than +2 or less than -2, it is likely that your data is significant (as nonsignificant data usually lies within 2 standard deviations of the center of a distribution)

New cards

What are studentized residuals? When do you use them and what is significant?

Studentized residuals are a data set's residuals divided by their standard errors. They are used to looks at if any outliers are obscure and may be influencing a regression. When studentized residuals are about +2 or below -2, they are suspect. If they are above 3+ or below -3, they are pretty rare and are definitely influential on the regression.

New cards

What is a Cook's Distance? When do you use this and when is it significant?

Cook's distance is the index of how much a single point changes the estimated regression parameters. It's used to see if any outliers are influencing the regression. Cook's distances are significant above 1 (heavily influencing estimated regression) and suspect around 0.5.

New cards

What are t-procedures robust against? What does it mean to be robust?

They are robust against slight to moderate deviation from normality. They are not robust to outliers that could influence the estimates of the regression. Being robust means that a model's estimate is not too heavily influences by certain (undesirable) characteristics of the data (in this case, non-normality)

New cards

How do outliers affect the power of a test?

They lower the power

New cards

How do outliers affect you decision in a hypothesis test or confidence interval analysis?

Outliers influence regression estimates and make you more likely to fail to reject the null hypothesis. They will likely widen confidence intervals (again making it harder to reject the null).

New cards

What does "least squares" mean in a least squares line?

The line that minimizes the sum of the errors (residualsO is the best fit line - this is the least squares lines as it minimizes the sum of the squared errors (SSE) - making the squares the least.

New cards

What can you use Sxx and Sxy for?

Once you calculate Sxx and Sxy, you can divide Sxy by Sxx to get an estimated slope of the line

New cards

What is the formula for SSTo? What is SSTo?

SSTo = SSE + SSR. SSTo is the total variation around the mean y.

You can also calculate SSTo by finding the sum of the squared (y values minus y bar)

New cards

What is SSE?

It is the sum of the squared residuals and it represents the "unexplained error" or the variation in y that cannot be attributed to x.

New cards

What is SSR and how do you calculate it?

SSR is the sum of the squared (yexpected values minus ybar). SSR represents the explained error - the variation in y that can be attributed to x.

New cards

What is the relationship between R^2 and SSR?

SSR / SSTo is the percent of the total error explained by x. This is equal to R^2.

New cards

How do you confirm that your observations are independent?

There is no chart that you can look at, but you should read the sampling scheme to determine if it seems to promote random and independent sampling. Otherwise, maybe look at a scatter plot and make sure there are no obvious clusters of data that are linked in some way.

New cards

How do you confirm that your y's are normally distributed at each x?

You can plot the residuals of your data in a qqPlot with 95% confidence bands - if all of the points lie within the bands, your data is relatively normal and the normality assumption is met.

New cards

How do you confirm that your residuals have constant variability (homoskedasticity)?

You can plot your residuals against the model's fitted values and see if there are any discernible patterns in your residual plot. If not, your residuals should have relatively constant variability.

New cards

How do you confirm that the means of your Yi are linearly related to Xi?

You can present your data in a scatter plot and see if it looks linear. You could also attempt to fit a linear model and compare your data to a linear plot, seeing if it deviates from that. Linearity is an extremely important assumption in linear regression as without it, your data is not appropriate for linear regression.

New cards

When a question uses the word "expected", what does it mean?

When you see "expected", assume you're looking for the mean (such as, in a confidence interval).

New cards

What is the difference between prediction and confidence intervals?

Prediction intervals predicts a range for a single value of y at a given x while a confidence interval predicts the mean value of y at a given x (or the mean value of something else like the slope).

New cards

Which will generally be wider, a prediction or confidence interval?

A prediction interval should be wider as it has to predict a more specific value.

New cards

How do you calculate the DF's in an ANOVA table?

DF total = total observations - 1

DF regression = 1 (# predictors)

DF error = n-2 (or DF total - DF regression)

New cards

How do you calculate the SS's in an ANOVA table?

SSR is the sum of the squared (y values - ybar)

SSR is the sum of the squared *y values - y expected)

SSTo is the sum of SSR and SSE

New cards

How do you calculate the MS's in an ANOVA table?

MSR is the SSR divided by 1 (DF regression)

MSE is the SSE divided by n-2 (Df error)

MSTo is the sum of MSE and MSR

New cards

How you do calculate the F value in an ANOVA table?

F is equal to MSR / MSE

New cards

What is the F-value test?

It's a test to see if the linear relationship of y with x explains the variation in y. This means the the null hypothesis will always state that the slope is zero while the alternative will state that it is not zero.

New cards

Why are t-tests generally better than f-tests?

T-tests can be done as one-sided tests (not f) and can be test with non-zero values

New cards

What is the general rule of transformations of data?

Large values are affected the most. If it's an up transformation, they increase the most. If it's down, they decrease the most.

New cards

What is the Mosteller-Tukey Bulging Rule used for?

This is a diagram used to understand in which cases to transform certain variables up or down.

New cards

How would you transform a graph that looks like the top right corner of the M-T Bulging circle?

In the top right corner, X is being transformed up and y is being transformed up. I would add a power to both x and y.

New cards

How would you transform a graph that looks like the top left corner of the M-T Bulging circle?

In the top left corner, X is being transformed down and y is being transformed up. I would add a power to y and sqrt or ln x. (or even reciprocal)

New cards

How would you transform a graph that looks like the bottom left corner of the M-T Bulging circle?

In the bottom left corner, X is being transformed down and y is being transformed down. I would sqrt or ln x or even reciprocal both x and y

New cards

Why is it not appropriate to extrapolate an estimated relationship beyond the range of the x values in the sample?

We simply don't know how data changes past our observed range. If it changed in a way we didn't predict, we would be poorly estimating.

New cards

What does it mean to be curvilinear?

When a line is truly curved but up close looks like/can be analyzed as a linear model.

New cards

Why would you want to do regression through the origin?

It might be helpful to see a regression through the origin when non-zero intercepts don't make sense with the context of the data.

New cards

What will the dimensions of a matrix be if it is made from multiplying an rxc matrix by a cxk matrix?

rxk - two matrices can only be multiplied if the first matrix's columns match the second matrix's rows

New cards

What does the ' mean in matrices?

It means transpose. A matrix of r x c dimensions will now be c x r dimensions.

New cards

What does y look like as a matrix?

Its a column vector (1 column) with n rows (n being the number of y values)

New cards

What does x look liked as a matrix?

It is a 2 column matrix with n rows (n being the number of observations we have). The first column is all 1's (important in matrix multiplication with intercepts) and the second is the x values we gathered from the sample.

New cards

What does the column vector b represent?

b is a 2 x 1 column vector with the top value being Bo (intercept) and the bottom value being B1 (slope)

New cards

Describe the values in the variance covariance matrix.

This is (usually) a 2x2 matrix where the top left corner represents the variance of the intercept and the bottom right corner represent the variance of the slope. You can use these values in intervals by taking their square root to get standard errors of the slope and intercept.

New cards

What are two important things to remember when working with transformed data?

1. you must recalculate your R^2 value!

2. you must back transform (like when interpretting intervals)

New cards

How do you calculate your own R^2 when working with transformed data?

1. Gather the fitted y values form the transformed data.

2. Back-transform the set of fits to get the set of untransformed fitted y values

3. Handcalculated R^2 by dividing SSR by SSTo

New cards

How do you calculate SSE in matrix math?

(Y-Yexpected)'(Y-Yexpected)

New cards

What does MSE represent?

The variance around the regression line

New cards

What are the useful values in a hat matrix? What do they represent?

The hat matrix is the variance covariance matrix of y expected values at given x values. The important values are the diagonal (from top left to bottom right) - these are the variances of each y expected and they can be used by taking their square root to get an SE for a confidence interval

Hat matrix should tops be 4x4

New cards

When is regression through the origin appropriate for the data?

When the plot of residuals vs predicted values looks normal (no outliers, unequal variances), normality plot looks good

New cards

What happens to R^2 for regression through the origin?

The R^2 for regression through the origin ALWAYS increases because both SSE and SSTo increase. Since R^s = 1 - SSE/SST, R^2 will increase as the errors increase (errors increase because regression through origin in not least squares line)

New cards

How are hypotheses tests and confidence intervals related?

There is a 1:1 relationship between the two as long as the same alpha is used and they are both two sided (or both one sided)

New cards

What is the goal of the least squares line?

To minimize SSE

New cards

Between SSE and SSR, which is unexplained and which is explained?

SSR is explained, SSE is unexplained

New cards

Write out both the linear model and the estimated linear model.

linear model: yi = Bo + B1Xi + Ei

estimated linear model: yi-hat = Bo-hat + B1-hatXi + Ei-hat

New cards

When do you transform x?

When you only have linearity problems in your data (no residual or normality issues)

New cards

When do you transform y?

When you have both linearity and residual/normality issues

New cards

When do you transform x and y?

When you have linear data but you have normality/residual issues

New cards

Write the estimaed linear model and distributions of the values. And assumptions.

yi-hat = Bo-hat + B1-hatXi + Ei-hat

where yi-hat is the expected [dependent variable] with distribution iid ∼ N(β0 + β1Xi , σ2 )

xi is the [independent variable],

Bo-hat is the estimated intercept (mean value of y given x=0),

B1-hat is the estimated slope (mean increase of y when x increases by 1 unit),

Ei-hat is the estimated error term with distribution iid∼ N(0, σ2 )

The assumptions are:

1. The observations are independent

2. Y's are normally distributed at each x

3. Residuals have constant variability (homoskedasticity)

4. Means of yi are linearly related to xi

5. Outlier are not driving our conclusions

New cards

Given an interpretation of the 95% confidence interval (5, 7) when x = 4 puppies and y = happiness score

We are 95% confident that the mean happiness score with 4 puppies is between 5 and 7.

New cards

Given an interpretation of the 95% prediction interval (5, 7) when x = 4 puppies and y = happiness score

There is a 0.95 probability that the happiness score at 4 puppies will be between 5 and 7.

New cards

If the research question is "Does happiness score increase with more puppies?" - give a hypothesis test and an interpretation of the two sided p-value of 0.002.

Ho: Happiness does not increase with puppies: B1 = 0

Ha: Happiness does increase with puppies: B1>0

Since two sided p is 0.002 and we are doing one sided test, our p value is 0.001. At an 0.05 alpha level, we reject the null hypothesis that happiness does not increase with puppies because the p-value of 0.001 is less than alpha - 0.05.

New cards

Why is normality of observations important?

If you data is non-normal, it will be hard to do significance testing to see if the value being testing int he alternative hypothesis actually deviates from normal values (without a valid distribution to compare to)