Linear Modelling

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/29

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

30 Terms

1
New cards

Correlation vs Causation

Correlation:

  • Linear association between 2 variables

  • Ranges from -1 to +1

  • Correlation does not imply causation

Causation:

  • One variable directly affects another

2
New cards

Correlation Formula

knowt flashcard image
3
New cards

What is a linear model

Expresses relationship between a response (dependent variable) “y” and one or more predictors “x1”, “x2”

Estimates average change in “y” for a one unit change in a predictor, holding other variables constant

4
New cards

Why do we include an error term?

  1. To account for natural variability in data

  2. To capture effects of variables not included in model

  3. To show predictions won’t be exact

  4. Follows null distribution: ϵi∼N(0,σ2) to make it easier to infer

5
New cards

Model Assumptions

  • Linearity: Relationship between predictors and response is linear.

  • Independence: Residuals are independent.

  • Homoscedasticity: Residuals have constant variance.

  • Normality of residuals: Residuals are normally distributed.

  • Mean of residuals = 0.

6
New cards

Goal of Modelling

  • To understand relationship between different variables

  • Once we know the relationship we can use the model to make predictions

7
New cards

Broom Package

Takes messy output of built-in-functions in R (Eg. lm) and turns them into tidy data frames

8
New cards

Functions

Represents the relationship between an input/s and an output.

9
New cards

How to find predicted values

Only consider x values that are within of the x value that you have used to estimate the model.

Eg. If data ranges from x=50 to x=100, don’t use model to predict for x=150.

The model might not hold for values outside the observed data

10
New cards

Mean Squared Error (MSE)

Measures how close the fitted model(line of best fit) is to the actual model and ensures its as close as possible.

Find average of squared differences between observed values (yi) and predict values(y hat)

<p>Measures how close the fitted model(line of best fit) is to the actual model and ensures its as close as possible. </p><p>Find average of squared differences between observed values (yi) and predict values(y hat) </p><p></p>
11
New cards

Coefficient B0:

Intercept.

When all variables are 0, what is the value of y

<p>Intercept. </p><p>When all variables are 0, what is the value of y</p>
12
New cards

Coefficient: B1

Slope

  • Measures expected change in y when x increases by 1 unit.

Alternate Formula:

Cov(x,y) / Var(x)

<p>Slope</p><ul><li><p>Measures expected change in y when x increases by 1 unit. </p></li></ul><p></p><p>Alternate Formula:</p><p>Cov(x,y) / Var(x) </p><p></p>
13
New cards

Visualising Residuals

Plots:

  1. Residual Plot

  1. Index Plot

  2. Q-Q Plot

  3. Histogram

<p>Plots:</p><ol><li><p>Residual Plot</p></li></ol><ol start="2"><li><p>Index Plot</p></li><li><p>Q-Q Plot</p></li><li><p>Histogram </p></li></ol><p></p>
14
New cards

Residual Plot

  • Shows residuals vs predictions

  • Residuals should fluctuate around 0 with no pattern

  • If there’s a pattern with residuals → indicates non-linearity (the linear model doesnt fit the data well

<ul><li><p>Shows residuals vs predictions</p><p></p></li><li><p>Residuals should fluctuate around 0 with no pattern</p></li><li><p>If there’s a pattern with residuals → indicates non-linearity (the linear model doesnt fit the data well</p></li></ul><p></p>
15
New cards

Index Plot

  • Shows residuals vs observations

  • Residuals should randomly fluctuate around 0

  • If there’s a pattern, this shows autocorrelation (residuals are not independent)

  • Autocorrelation is bad and violates assumption of independent errors

<ul><li><p>Shows residuals vs observations</p></li></ul><p></p><ul><li><p>Residuals should randomly fluctuate around 0 </p></li><li><p>If there’s a pattern, this shows autocorrelation (residuals are not independent)</p></li><li><p>Autocorrelation is bad and violates assumption of independent errors</p></li></ul><p></p>
16
New cards

Q-Q Plot

  • Compared model vs normal quantiles

  • Useful for checking normal distribution assumption

  • Should be a straight diagonal line if residuals are normally distributed

  • Large deviation from the line indicate non-normal residuals which violates assumption of model

<ul><li><p>Compared model vs normal quantiles </p><p></p></li><li><p>Useful for checking normal distribution assumption </p></li><li><p>Should be a straight diagonal line if residuals are normally distributed </p></li><li><p>Large deviation from the line indicate non-normal residuals which violates assumption of model </p></li></ul><p></p>
17
New cards

Histogram

  • Bar plot showing frequency of residuals

  • Should be bell shaped and symmetric (normally distributed)

  • Checks normality assumption (Similar to Q-Q Plot)

<ul><li><p>Bar plot showing frequency of residuals </p></li></ul><p></p><ul><li><p>Should be bell shaped and symmetric (normally distributed) </p></li><li><p>Checks normality assumption (Similar to Q-Q Plot)</p></li></ul><p></p>
18
New cards

R² (Coefficient of determination)

R² = SSE/SST

R² = 1- (SSR/SST)

Ratio between explained variance and total variance

Tells us % of total variability in dependent variable that is explained by model

  • Between 0 and 1 (1 = perfect fit)

  • Adding more variables does not decrease R² but it might increase (even if variables are not useful)

<p>R² = SSE/SST</p><p>R² = 1- (SSR/SST)</p><p>Ratio between explained variance and total variance </p><p>Tells us % of total variability in dependent variable that is explained by model </p><ul><li><p>Between 0 and 1 (1 = perfect fit) </p></li><li><p>Adding more variables does not decrease R² but it might increase (even if variables are not useful) </p></li></ul><p></p>
19
New cards

R² vs Adjusted R²

Adjusted R² takes model complexity into account

More variables increase model complexity

Adjusted R² penalises adding unnecessary predictors

20
New cards

Dummy Variable

Categorical variables are shown as dummy variables

Intercept = Mean outcome for baseline category

Coefficients = Difference in response variable and baseline

21
New cards

Why should we do EDA before modelling?

  • Helps understand data structure and spot patterns or problems (eg. missing values, outliers)

  • Shows which model is appropriate to use

22
New cards

What can models show?

  • Can show patterns not obvious in summaries

  • However, can only show patterns that aren’t fully true leading to misinterpretation

23
New cards

Why should we check assumptions and residuals

  • To make sure model is appropriate and not misleading

  • Helps detect non-linearity, heteroscedasticity and autocorrelation

24
New cards

Goodness of fit measures

  1. AIC

  2. BIC

  3. Deviance

25
New cards

AIC (Akaike Information Criterion)

  • Can be used to compare models

  • Smaller AIC = Better model

26
New cards

BIC (Bayesian Information Criterion

  • Best way to compare 2 models

  • Lower BIC = Better model

  • Penalises model complexity more heavily than AIC

27
New cards

Deviance

  • Measures residual variation (SSR - how much isnt explained by model)

  • Closer to 0 = Better fit

  • Best used for comparing 2 models (model with lower deviance is better)

28
New cards

Why do we use “average” in interpretations

  • Regression estimates the average effect of a predictor on the dependent variable

  • Actual observations vary around this average due to random error

29
New cards
30
New cards