Unit 9 : Linear Regression

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/21

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

22 Terms

1
New cards

The most common statistical model is

the linear regression model. This model predicts : the average Y-value, or E(y), for a specified value of X (NOT INDIVIDUAL Y-VALUES_. The true intercept value (in the whole population) is denoted Beta 0, while the true slope value (in the population) is denoted Beta1

2
New cards

Y is called the

dependent or response variable because the model predicts values of Y. Int hsi sense, Y depends on , or responds to X.

Since we regress the values of Y on X, Y is also referred to as the regressant. The primary purpose of the regression model is to understand and predict the behavior of Y based on X.

3
New cards

X is called the

independent or explanatory variable, because the model uses it to predict or explain Y. In data science contexts, X is often called the feature, and other times it is called the regressor because it forms the input for the regression model.

4
New cards

How is the equation for the linear regression model calculated?

For a given line, compute the errors, Ei for each data point, then minimize these errors. Adding positive and negative errors would cancel out so to avoid this, we square the errors then add them up. The LSL minimizes the sun of these squared errors. No other line will have a smaller sum of squared errors.

5
New cards

To make a scatterplot in R

plot(data$y ~data$x)

6
New cards

To fit a linear model in R

model = lm(data$y ~data$x)

The name of the model is arbitrary, but we need to give it a name so we can view the summary output and other model components.

model = lm(data$whatever ~ data$Whatever) summary(model)

7
New cards

Interpretation of the Coefficients - example

E(Average salary) = 30,030 + 0.807(Annual Cost)

The slope = 0.807 means for each $1 increase in Cost per year(x), average salary (y) is predicted to increase by $0.807. The slope gives the predicted increase in Y per 1-unit increase of X.

The intercept, gives the predicted, hypothetical value of Average salary (=Y) when Cost (=X) = 0.

The intercept should only be interpreted IF it has a practical meaning, and IF there is no extrapolation.

8
New cards

Extrapolation

occurs when we try to predict a y value outside the range of the x values. This can lead to wrong predictions and be very misleading.

9
New cards

Residuals

The error 𝜖𝑖 between the actual and predicted value is called the residual.


A positive residual implies the actual value lies above its mean, that is, above the trendline.

A negative residual implies the actual value lies below its mean, that is, below the trendline.

So first calculate the predicted, then subtract that from the actual to get this,

10
New cards

The success of our goal of predicting the changes in Y is given by

R². A high R² indicates a strong relationship. R² gives the proportion of variation in Y that can be predicted (accounted for) by the model. 1-R² is the proportion of the variation in Y that is accounted for by other factors not in the current model.

11
New cards

Is R² affected by changes in units?

No, it is not, but slope is affected.

12
New cards

IF R² is 0.2508, what does that imply?

That only 25% of the variation in Average x for different x can be predicted on the basis of y

13
New cards

A good model will have

small RSE, and a high R²

14
New cards

A weak model will have

large RSE, and a low R²

15
New cards

What conditions need to be in place for the linear model to provide optimal predictions?

  1. Linearity - to fit a linear model, the trend should be linear (straight) rather than nonlinear(Curved) - The scatterplot should be straight, not curved. Residual plot should show points randomly scattered around 0 (like a cloud), not curved.

  2. Constant Variance - the variance of residuals (Vertical spread) should be constant and not show a fan shape

  3. Normality : A histogram of residuals should show a Normal shape (unimodal, and bell shaped)

16
New cards

After fitting the model, we have to

plot the residuals. Ideally, the residuals would not show any pattern, only random scattered points.

17
New cards

If the residual plot shows a non-linear (curved) pattern, the linearity condition is_

not satisfied, and some type of nonlinear curve, e.g. quadratic would provide better predictions

18
New cards

If the residual plot shows a fan-shaped pattern this means

the condition for constant variance is not satisfied. A better model could be obtained by transforming the data, e.g. ln(Y)

19
New cards

QQ Plot in R - What it’s used for & command

This is to plot the observed values of the residuals, vs. the theoretical values from a Normal distribution. If the residuals follow a Normal distribution, they would fall along the line Y=X (observed residuals = theoretical normal)

Commands ibeing

qqnorm(model$residuals)

qqline(model$residuals)

20
New cards

T/F We can never infer causation on the basis on a strong correlation alone

TRUE!!

21
New cards

An observational study can show there is a link but

there could be an alternate path or lurking variable driving (causing) the trend

22
New cards

If you’re testing to see if a linear relationship between X and Y really exists..

You test this claim through a hypothesis test, which takes into account the inherent uncertainty from random sampling.

Ho: B1 = 0 (no linear relationship exists, no trend, model not useful)

Ha : B1 =/ 0 (linear relationship exists, trend exists, model useful)

If H0 is true, the line will be horizontal (no trend) and there is no linear relationship between X and Y, meaning the model would not be useful to predict Y.

If Ha is true, there will be a positive or negative trend (slope), and a linear relationship exists between X and Y