Lecture 2: Linear Regression

Associations and Linear Regression

Introduction to Linear Regression

Welcome to the summer semester and introduction to linear regression. The lecture will cover associations, simple linear regression, multivariable linear regression, linear regression assumptions, and connection to data science tasks (description, prediction, and causal inference).

Associations

Definition

An association is a connection between variables.

Two perspectives:

Changing one variable affects another: If we change one variable and another variable also changes, they are associated. For example, a light switch (on/off) is perfectly associated with a light bulb (on/off).
Information about one variable implies information about another: Knowing information about one variable gives information about another. For example, if someone is a smoker, they are probably not three years old.

Quantifying Associations

Ways to quantify an association:

Visualizations: Scatter plots to recognize patterns.
Stratified Descriptive Analysis: Stratify by a variable (e.g., age) and compare outcomes in each group.
Effect Measures: Risk ratios and odds ratios.
Correlations: Measure the strength and direction of a linear relationship.
Regression Models: Quantify the relationship between variables.

Risk Ratio

A risk ratio of 1 indicates no association between the exposure and outcome. Values larger or smaller than 1 indicate some association.

Scatter Plots

Non-Association: No matter how much $x$ changes, $y$ remains stable (horizontal line). A symmetrical cloud of points also indicates no association.
Positive Association: As $x$ increases, $y$ increases. Low values of $x$ are associated with low values of $y$ , and high values of $x$ are associated with high values of $y$ .

Reasons for Observing an Association

Possible explanations for observing an association between two variables $x$ and $y$ :

Chance: Due to noise in the data.
Causal Reasons:
- $x$ causes $y$ .
- $y$ causes $x$ .
Confounding: A third variable causes both $x$ and $y$ .
Collider Bias: Conditioning on a common effect of $x$ and $y$ .

Research Questions

The primary target of research is usually causal relationships (i.e., how an exposure X affects an outcome Y).
Statistical associations are agnostic to directionality (undirected or bidirectional).
Randomized trials rule out other explanations to interpret undirected associations as causal effects.
Descriptive aims should consider potential collider bias and causal frameworks.

Quantifying the relationship between two different variables.

Simple Linear Regression

Dependent and Independent Variables

Dependent Variable: Continuous (ratio or interval scale). Examples: income (in euros), plant growth (mm per day).
Independent Variables: Continuous or categorical. Examples: training, education, fertilizer, soil quality.

Pima Women Data Set

Data from Pima Indians in Arizona, who have a high prevalence of type 2 diabetes.
532 women, age 20 and older.
Variables:
- glu: Plasma glucose concentration.
- Body mass index (BMI): $kg / m^2$ .
- Age (years).
- Number of pregnancies.
- type: Yes/no coding for diabetes.
- Diastolic blood pressure.
- Triceps skin fold thickness.

Simple Linear Regression Model

Simple means only one independent variable.
Assumes a linear relationship between two variables.
Model is linear in its parameters.
Visualize data without a model first (scatter plot).

Scatter Plot Interpretation

Each point represents a woman in the data set (532 dots).
Body Mass Index (BMI) on the x-axis and glucose concentration on the y-axis.
Normal BMI: 17 to 25, Overweight: 25 to 30, Obese: 30+.
In this sample, many women are overweight or obese.
Normal Glucose Value: around 140; values above 150 are considered high.

Pearson's Correlation Coefficient

For continuous, approximately normally distributed data.
Correlation coefficient of 0.25 indicates a smallish positive relationship.
Values range from -1 to +1.
- 0 indicates no linear relationship.
- -1 indicates a perfect negative correlation.
- +1 indicates a perfect positive correlation.

Difference Between Correlation and Regression

Correlation coefficient (r): measures how homogeneously the cloud of points points into one direction.
Regression coefficient: tells how much $y$ changes for a change in $x$ and estimates the value of the outcome for a given value of the exposure.

Linear Regression Equation

We have a cloud of points and we want to draw a line through it to describe any kind of line we need two values:

Slope Coefficient: For a one unit increase on our x axis, how much does our y change
Intercept: the point where our x is zero.

Equation formula

$y = \beta0 + \beta1 * x$

Where:

$y$ is the dependent variable.
$\beta_0$ is the intercept.
$\beta_1$ is the slope coefficient.
$x$ is the independent variable.

Residuals and Random Errors

Introduce individual index $i$ for each individual in the data set.
Each point in the scatter plot is based on our data frame with a value of $x$ (body mass index) for the first woman and our glucose test $Y$ for the first woman as well.
Based on all the points, we fit this regression line, but then this regression line doesn't go through all the points and this means that we are committing an error.
- The distance here for this person, for this given value of $x$ , we observe this value, but the model estimates this value. So this is the error between the observation and the estimation.
For every value of $x$ we now have two $y$ 's, the observed $y$ and the estimated $\hat{y}$ .
The error term, this random error is then $\hat{{\epsilon}}$ , because it's also an estimated error, which is the difference between the observed and the estimated value of $y$ .

$yi = \beta0 + \beta1 * xi + \epsilon_i$

$\epsilon$ = error term (residual) for individual $i$ .

Ordinary Least Squares (OLS) Estimation

Find the line that minimizes the joint error for all points.
All candidate regression lines go through the point where there's the mean body mass index and the mean blood glucose level.
The regression line goes through the point where both variables have their mean value, and then essentially they all meet in one point, and then the only thing we have to figure out is the slope, which is what comes out of this equation.

Interpreting Results

Estimated intercept: 84.42.
Estimated slope coefficient: 1.11.
R-squared ( $R^2$ ): 0.0625.

$\text{Blood Glucose Level} = 84 + 1.11 * \text{Body Mass Index} + \text{Error Term}$

R-Squared

Coefficient of determination.
Ratio between explained variability in $y$ (blood glucose) and the total variability in $y$ .
Ranges from 0 to 1.
- 0 indicates the model does not fit the data.
- 1 indicates the model fits the data perfectly.

Variability

Technically here we mean the term variance, but we are having it on the slides a bit broader with variability.

We can use descriptive statistics to describe this distribution, and then we have this central measures for central tendency and measures for dispersion.

Central tendency includes mean, median and mode
Dispersion includes variance, standard deviation and range

R-Squared Calculation

The total variability in $y$ , some of this is explained by the regression, some of this is left unexplained, and then these values we use to compute the $R^2$ .

$R^2$ can be interpreted as the percentage of variability or variance that's explained in the outcome.

Pearson Correlation and R-Squared Relationship

If you would square our Pearson's correlation coefficient, it's equal to the r squared of the model, but this really only works in the simple linear regression where we only have these two variables, one independent and one dependent variable.

Confidence Intervals and Statistical Tests

Compute a confidence interval and perform a hypothesis test to quantify the uncertainty of the estimates.

Hypothesis Testing

Null Hypothesis ( $H_0$ ): No effect of BMI on glucose (slope coefficient is not different from zero).
Alternative Hypothesis ( $H_1$ ): Slope coefficient is different from zero.

Model Result Interpretation

BMI is statistically significant (p < 0.001).
For one unit change in BMI, blood glucose changes by 1.11 units.

Multi Variable Linear Regression

Extend the simple LR model by adding more variables.

$y = \beta0 + \beta1 * x1 + \beta2 * x_2 + …$

Where:

$beta_0$ is the intercept.
$beta1$ is the slope coefficient for $x1$ .
$beta2$ is the slope coefficient for $x2$ .
$x1$ and $x2$ are independent variables.

Interpretation of Slope Coefficient

Slope coefficient reflects the change in outcome variable (blood glucose) for a one-unit change in the independent variable, when all other independent variables remain constant. This is why regression models are commonly used. we can interpret coefficients in isolation while keeping all the other variables stable.

The slope coefficient for age is for a one year change how much does our blood glucose level change, when all the other independent variables remain constant.

Multivariate vs. Multivariable Regression

Multivariable regression: Multiple independent variables, one dependent variable.
Multivariate regression: Multiple dependent variables (multiple outcomes).

Model Evaluation

Adding additional variables to the regression models will always increase R squared, because even if two variables are completely unrelated, adding them to the model will still by chance explain some of the variability in the outcome.

This is why we make a little modification to r squared to allow to kind of like account for how complex the model is (adjusted R squared).

Adjusted R-Squared

Takes into account the number of independent variables in the model and the number of observations. It penalizes adding coefficients to account for the rule of chance that the added coefficient needs to increase $R^2$ by enough.

Characteristics:
*It's always smaller or equal to $R^2$ .
*It can also become negative.

Linear Regression Assumptions

Five assumptions:

Linearity: The regression model is linear in its parameters.
Zero Mean Error: Errors are expected to be zero on average.
Homoscedasticity: The Constant Variability in errors is constant across values of the outcome.
Independence: Errors are uncorrelated (observations are independent).
Normality: Errors are approximately normally distributed.

Visualization of Residuals/Errors

Scatter Plot of Residuals vs. Fitted Values

X-axis: Fitted values (estimated outcome).
Y-axis: Residuals (errors).

Looking for homogeneous cloud around the point of zero residuals. The variability of errors should be constant.

If you have an error of minus 40 then we should also have an error of plus 40, if we sum those up their error should zero.

Histogram of Residuals

Count how many errors are there in each size and visualize this as a sort of bar diagram as a histogram.
Check for normal distribution.
*we can also overlay a normal distribution to make sure that these errors are approximately normally distributed, and that the errors are centered around zero.

QQ (Quantile-Quantile) Plot

We sort the errors from the smallest to the biggest and then we put those into quantiles, so on a normal distribution which rank in a way do they fall in and then we visualise this against a normal distribution.

We want to see the dots are on or very close to this diagonal line.

With this knowledge we now know how to fix this now if the data set does not fulfill the regression model requirement.

Applications in Data Science Tasks

For it's in principle possible to use regression models also for descriptive questions, but having this research question of, oh I want to estimate an association between two variables in a descriptive way is probably not something that you would like to be doing
In prediction models, regression models estimate the best predictors for the outcome variable (e.g., blood glucose level) without needing causal assumptions.
In causal inference, and in observational studies, determine the total causal effect through causal diagrams (DAG). These determine potential/relevant confounders.

Regression Models, DAX, Terminology

If variable Z causes the exposure and outcome, the coefficients for the two variables (A and Z will give a measure of the unbiased total causal effect that we are interested in).
If Z is a confounder, then we don't have to look at it
If Z is a mediator, and part of the causal effect is going through Z, then the interpretation of these coefficient changes drastically. Then the slope coefficient of variable of interest (A) is just the direct causal effect. If we were to sum up variable of interest (A) and Z, we would get the total causal effect.

Knowing the underlying causal structure is important to figure out what coefficient do you actually have to look at.

The big takeaway that we can use different measures to quantify the relationship between two variables. (Correlation, regression model). Due to using the same method/framework, you might have difficulties deciphering what we actually want to be doing due to the number of different words used between fields such as statistics and epidemiology . If you read the papers, we encourage that you stick to what we have presented on the slides of this lecture.

Model Parsimony

We have a preference for less complex models. The more complex we make models, the harder they become to interpret, also the better they're going to explain the outcome, but we should always strive to keep models simple, and really only add variables that are absolutely necessary, for example, because our deck tells us to add them.

Extensions of Regression Models

Binary outcomes: logistic or log binomial regression.
Time-to-event data: Cox models.