Biostatistics exam 3 - Linear Regression

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/32

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

33 Terms

1
New cards

Regression

regression is a method that predicts values of one numerical variable from values of another numerical variable

• Fits a line through the data

- used for prediction

- measures how steeply one variable changes with another

2
New cards

Correlation versus regression

• correlation measures the aspects of the linear relationship between two numerical variables

- measures the association between X and Y

• regression predicts values of Y given X

3
New cards

Linear regression

• the most common type of regression ( there are nonlinear models)

• draws a straight line through the data to predict the response variable (Y) from the explanatory variable (X)

4
New cards

least squares regression

Line for which the sum of all the squared deviations in y is smallest

• deviations: distance between data point and the line

<p>Line for which the sum of all the squared deviations in y is smallest</p><p>• deviations: distance between data point and the line</p>
5
New cards

Formula for the linear regression

• a = y-intercept, b is the slope

<p>• a = y-intercept, b is the slope</p>
6
New cards

Slope of a linear regression

• the slope of a linear regression is the rate of change in y per unit X (rise of a run)

• also measures direction of prediction

- positive: as X increases y increases

- negative: as X increases y decreases

<p>• the slope of a linear regression is the rate of change in y per unit X (rise of a run)</p><p>• also measures direction of prediction</p><p>- positive: as X increases y increases</p><p>- negative: as X increases y decreases</p>
7
New cards

How to calculate the slope (linear regression) - equation

• numerator measures how deviations and X and Y vary together (can be pos or neg)

• denominator is the sum of squares for x

<p>• numerator measures how deviations and X and Y vary together (can be pos or neg)</p><p>• denominator is the sum of squares for x</p>
8
New cards

How to calculate the intercept (linear regression)

• one slope is calculated, getting intercept is straightforward because the least squares regression always goes through (Xbar, Ybar)

• plug mean values into line formula → rearrange to solve for intercept

<p>• one slope is calculated, getting intercept is straightforward because the least squares regression always goes through (Xbar, Ybar)</p><p>• plug mean values into line formula → rearrange to solve for intercept</p>
9
New cards

estimates / statistics and parameters for a linear regression

• estimates/statistics: (b) slope and intercept (a)

- estimated from a sample of measurements

• Parameters: slope (β) and intercept (α)

- from the true population

10
New cards

regression assumption

Regression assumes that there is a population for every value of X, and the mean Y for each of these populations lies on the regression line

• assumes the spread is the same in each subpopulation (you don't want a funnel)

<p>Regression assumes that there is a population for every value of X, and the mean Y for each of these populations lies on the regression line</p><p>• assumes the spread is the same in each subpopulation (you don't want a funnel)</p>
11
New cards

Predicting values with a linear regression

•can predict values of Y for any specified value of x

- you can't predict X based off Y because (in the study) you're using the explanatory variable to predict Y not the other way around

• predictions are mean Y for all individuals with value X

• designated Y^ "Y-hat"

• use the linear regression formula to plug in a value of x and solve for y

12
New cards

Residual

the residual of a point is the difference between its measured Y value and the value of y predicted by the regression line

<p>the residual of a point is the difference between its measured Y value and the value of y predicted by the regression line</p>
13
New cards

How do you measure how well the data fits the line?

• residuals measure the scatter of points above and below the least squares regression line

- can be positive or negative

• variance in residuals (MSresidual) quantifies the spread of the scatter

- residual mean square

- analogous to error square in ANOVA

- used to quantify the uncertainty of the slope

14
New cards

Residual mean square equation

knowt flashcard image
15
New cards

standard error of the slope (equation)

• uncertainty (precision) with the sample estimate (b) of the population slope (β)

• the sum of squares in the denominator takes into account as you add more data points you expect more spread

• in the numerator is the spread of the residuals

<p>• uncertainty (precision) with the sample estimate (b) of the population slope (β)</p><p>• the sum of squares in the denominator takes into account as you add more data points you expect more spread</p><p>• in the numerator is the spread of the residuals</p>
16
New cards

Confidence interval of the slope

knowt flashcard image
17
New cards

The two types of predictions

1. predict mean Y for a given X

- e.g. what is the mean age of all male lions whose noses are 60% black

2. predict single Y for a given X

- e.g.how old is that lion over there with a 60% black nose

* both predictions give the same value of Y-hat but they differ in precision

• can predict mean with more certainty than a single value

18
New cards

Confidence bands

measure the precision of the predicted mean Y for each given value of X

• curved because when sample size is smaller it gets wider

• width will be skinniest at the means of X-hat and Y-hat

<p>measure the precision of the predicted mean Y for each given value of X</p><p>• curved because when sample size is smaller it gets wider</p><p>• width will be skinniest at the means of X-hat and Y-hat</p>
19
New cards

Prediction intervals

Measure the precision of the predicted single Y values for each X

• wider than confidence bands because predicting a single Y value is less precise than predicting a mean Y

<p>Measure the precision of the predicted single Y values for each X</p><p>• wider than confidence bands because predicting a single Y value is less precise than predicting a mean Y</p>
20
New cards

Interpolation

Regression should be used to predict Y for any value of X lying between the smallest and largest values of X

21
New cards

Extrapolation

The prediction of the value of a response variable (Y) outside the range of X values in the data

• extended prediction Beyond where you sampled

• not recommended because there's no way to ensure the relationship continues to be linear beyond the range of the data

22
New cards

Hypotheses for testing a slope

H₀: β = 0

Ha: β ≠ 0

23
New cards

test statistic for regression slope

t-statistic → measures how well our data fit the expectation of our data

24
New cards

t-statistic equation for regression slope

• SEb = measures uncertainty

• β₀ = Null

• df = n-2

<p>• SEb = measures uncertainty</p><p>• β₀ = Null</p><p>• df = n-2</p>
25
New cards

how to get a p-value from the test statistic

determine the critical value for the t-distribution and calculate p using a stats table or computer

26
New cards

ANOVA (F) approach

In regression framework:

• deviations between the predicted values of Yi-hat and Ybar

-analogous to MSgroups

• deviations between each Yi and it's predictive value Yi-hat (residuals)

- analogous to MSerror

• using ANOVA approach will generate the same p-value as the t-test approach

• can be used to measure R²: the fraction of the variation in Y that is "explained" by X

27
New cards

Regression toward the mean

Results when two variables measured on a sample of individuals have a correlation less than one. Individuals that are far from the mean for one of the measurements will, on average, like closer to the mean for the other measurement

• in pic: solid line = linear regression, dashed line = one-to-one line with slope of 1

• are people regressing to mean or is the drug working

<p>Results when two variables measured on a sample of individuals have a correlation less than one. Individuals that are far from the mean for one of the measurements will, on average, like closer to the mean for the other measurement</p><p>• in pic: solid line = linear regression, dashed line = one-to-one line with slope of 1</p><p>• are people regressing to mean or is the drug working</p>
28
New cards

Assumptions of linear regression

At each value of X:

• there is a population of Y-values whose mean lies on the regression line

• the distribution of possible Y-values is normal (with the same variance)

• The variance of Y-values is the same at all values of X

• the Y measurements represent a random sample from the possible Y-values

<p>At each value of X:</p><p>• there is a population of Y-values whose mean lies on the regression line</p><p>• the distribution of possible Y-values is normal (with the same variance)</p><p>• The variance of Y-values is the same at all values of X</p><p>• the Y measurements represent a random sample from the possible Y-values</p>
29
New cards

3 possible issues when trying to do a linear regression

1. outliers

2. nonlinearity

3. non-normal and unequal variants

30
New cards

How to deal with outliers

If only one (or a low number) then it may be reasonable to report regression with and without outlier

<p>If only one (or a low number) then it may be reasonable to report regression with and without outlier</p>
31
New cards

How to detect nonlinearity

Can be detected by inspecting graphs

<p>Can be detected by inspecting graphs</p>
32
New cards

How to detect non-normality and unequal variances

Residual plot

33
New cards

Residual plot

Residual of every data point (Yi - Yi-hat) is plotted against Xi

• if assumptions of normality and equal variances are met then there should be a roughly symmetric cloud above / below line at zero

- you don't want a funnel (violation of subpopulation distribution assumption)

<p>Residual of every data point (Yi - Yi-hat) is plotted against Xi</p><p>• if assumptions of normality and equal variances are met then there should be a roughly symmetric cloud above / below line at zero</p><p>- you don't want a funnel (violation of subpopulation distribution assumption)</p>