Linear Modelling

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/29

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

30 Terms

New cards

Correlation vs Causation

Correlation:

Linear association between 2 variables
Ranges from -1 to +1
Correlation does not imply causation

Causation:

One variable directly affects another

New cards

Correlation Formula

New cards

What is a linear model

Expresses relationship between a response (dependent variable) “y” and one or more predictors “x1”, “x2”

Estimates average change in “y” for a one unit change in a predictor, holding other variables constant

New cards

Why do we include an error term?

To account for natural variability in data
To capture effects of variables not included in model
To show predictions won’t be exact
Follows null distribution: ϵi∼N(0,σ2) to make it easier to infer

New cards

Model Assumptions

Linearity: Relationship between predictors and response is linear.
Independence: Residuals are independent.
Homoscedasticity: Residuals have constant variance.
Normality of residuals: Residuals are normally distributed.
Mean of residuals = 0.

New cards

Goal of Modelling

To understand relationship between different variables
Once we know the relationship we can use the model to make predictions

New cards

Broom Package

Takes messy output of built-in-functions in R (Eg. lm) and turns them into tidy data frames

New cards

Functions

Represents the relationship between an input/s and an output.

New cards

How to find predicted values

Only consider x values that are within of the x value that you have used to estimate the model.

Eg. If data ranges from x=50 to x=100, don’t use model to predict for x=150.

The model might not hold for values outside the observed data

New cards

Mean Squared Error (MSE)

Measures how close the fitted model(line of best fit) is to the actual model and ensures its as close as possible.

Find average of squared differences between observed values (yi) and predict values(y hat)

New cards

Coefficient B0:

Intercept.

When all variables are 0, what is the value of y

New cards

Coefficient: B1

Slope

Measures expected change in y when x increases by 1 unit.

Alternate Formula:

Cov(x,y) / Var(x)

New cards

Visualising Residuals

Plots:

Residual Plot

Index Plot
Q-Q Plot
Histogram

New cards

Residual Plot

Shows residuals vs predictions
Residuals should fluctuate around 0 with no pattern
If there’s a pattern with residuals → indicates non-linearity (the linear model doesnt fit the data well

<ul><li><p>Shows residuals vs predictions</p><p></p></li><li><p>Residuals should fluctuate around 0 with no pattern</p></li><li><p>If there’s a pattern with residuals → indicates non-linearity (the linear model doesnt fit the data well</p></li></ul><p></p>

New cards

Index Plot

Shows residuals vs observations

Residuals should randomly fluctuate around 0
If there’s a pattern, this shows autocorrelation (residuals are not independent)
Autocorrelation is bad and violates assumption of independent errors

<ul><li><p>Shows residuals vs observations</p></li></ul><p></p><ul><li><p>Residuals should randomly fluctuate around 0 </p></li><li><p>If there’s a pattern, this shows autocorrelation (residuals are not independent)</p></li><li><p>Autocorrelation is bad and violates assumption of independent errors</p></li></ul><p></p>

New cards

Q-Q Plot

Compared model vs normal quantiles
Useful for checking normal distribution assumption
Should be a straight diagonal line if residuals are normally distributed
Large deviation from the line indicate non-normal residuals which violates assumption of model

<ul><li><p>Compared model vs normal quantiles </p><p></p></li><li><p>Useful for checking normal distribution assumption </p></li><li><p>Should be a straight diagonal line if residuals are normally distributed </p></li><li><p>Large deviation from the line indicate non-normal residuals which violates assumption of model </p></li></ul><p></p>

New cards

Histogram

Bar plot showing frequency of residuals

Should be bell shaped and symmetric (normally distributed)
Checks normality assumption (Similar to Q-Q Plot)

<ul><li><p>Bar plot showing frequency of residuals </p></li></ul><p></p><ul><li><p>Should be bell shaped and symmetric (normally distributed) </p></li><li><p>Checks normality assumption (Similar to Q-Q Plot)</p></li></ul><p></p>

New cards

R² (Coefficient of determination)

R² = SSE/SST

R² = 1- (SSR/SST)

Ratio between explained variance and total variance

Tells us % of total variability in dependent variable that is explained by model

Between 0 and 1 (1 = perfect fit)
Adding more variables does not decrease R² but it might increase (even if variables are not useful)

<p>R² = SSE/SST</p><p>R² = 1- (SSR/SST)</p><p>Ratio between explained variance and total variance </p><p>Tells us % of total variability in dependent variable that is explained by model </p><ul><li><p>Between 0 and 1 (1 = perfect fit) </p></li><li><p>Adding more variables does not decrease R² but it might increase (even if variables are not useful) </p></li></ul><p></p>

New cards

R² vs Adjusted R²

Adjusted R² takes model complexity into account

More variables increase model complexity

Adjusted R² penalises adding unnecessary predictors

New cards

Dummy Variable

Categorical variables are shown as dummy variables

Intercept = Mean outcome for baseline category

Coefficients = Difference in response variable and baseline

New cards

Why should we do EDA before modelling?

Helps understand data structure and spot patterns or problems (eg. missing values, outliers)
Shows which model is appropriate to use

New cards

What can models show?

Can show patterns not obvious in summaries
However, can only show patterns that aren’t fully true leading to misinterpretation

New cards

Why should we check assumptions and residuals

To make sure model is appropriate and not misleading
Helps detect non-linearity, heteroscedasticity and autocorrelation

New cards

Goodness of fit measures

AIC
BIC
Deviance

New cards

AIC (Akaike Information Criterion)

Can be used to compare models
Smaller AIC = Better model

New cards

BIC (Bayesian Information Criterion

Best way to compare 2 models
Lower BIC = Better model
Penalises model complexity more heavily than AIC

New cards

Deviance

Measures residual variation (SSR - how much isnt explained by model)
Closer to 0 = Better fit
Best used for comparing 2 models (model with lower deviance is better)

New cards

Why do we use “average” in interpretations

Regression estimates the average effect of a predictor on the dependent variable
Actual observations vary around this average due to random error

New cards