Correlation and Regression

Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry

Introduction to Correlation and Regression

Focuses on situations where both variables are continuous.
Discusses data presentation techniques and methods for assessing the significance and size of associations between variables.

Scatter Plots

Used to visualize the relationship between birthweight and maternal weight prior to pregnancy.
Observations from a scatter plot indicate that heavier mothers tend to have heavier babies.
Introduced the concept of the coefficient of association.

Coefficients of Association

Properties

A value of zero for the coefficient indicates no association.
A negative value suggests that as one variable increases, the other decreases.
A positive value indicates that the variables tend to increase or decrease together.
The coefficient ranges from –1 to +1; achieving either extreme is termed 'perfect association'.

Pearson Correlation Coefficient

Definition

The Pearson correlation coefficient (r) quantifies the degree of association between two variables.
Example data: Birthweight (g) plotted against Maternal weight (kg).

Illustration

The coefficient was calculated as +0.50 for the relationship.

Interpretation of Values

0–0.15: Low correlation
0.16–0.4: Modest correlation
0.41–0.7: Moderate correlation
Above 0.7: High correlation

Significance of the Correlation Coefficient

A coefficient of 0.5 is deemed moderate.
Assessing significance involves determining the probability of observing a coefficient of this size in a sample if there is no true relationship.
Degrees of freedom (df) computed as:
$df = N - 2, ext{ where } N = 16 ext{ gives } df = 14$
Resulting significance level: p = 0.048.

Types of Associations and Correlation

Categories:
- Negative correlation
- No correlation
- Non-linear correlation
- Perfect positive correlation (e.g., $r = +1$)
Visual representations include scatter plots demonstrating these correlation types.

Regression Analysis

Purpose

To model the association between two continuous variables by fitting a straight line to the data.

Variables

Y: Dependent variable (outcome) which is being predicted.
X: Independent variable (predictor) used to explain or predict Y.
Regression coefficients:
- α: Intercept (value of Y when X = 0).
- β: Slope (rate of change in Y for a unit change in X).
Error term (e) denotes random errors that are normally distributed about the predicted value:
$e ext{ assumed to follow } N(0, ext{ variance})$

Estimation of Regression Coefficients

An algorithm to estimate α and β:
- Method of least squares is employed to minimize squared differences between observed values and values predicted by the model.
Example Result: Birthweight increases by 45.5 g for every 1 kg increase in maternal weight.
Conclusion for weight difference: Mothers differing by 10 kg can expect birthweight differences of 455 g.

Statistical Analysis of Regression

Confidence Intervals and Statistical Tests

Standard errors and Confidence Intervals (CIs) can be calculated.
A 95% CI for the regression coefficient: (4.5, 86.5), indicating statistical significance since it excludes zero.
Further statistical testing for significance employs a t-test, yielding p = 0.03.

Predictive Use of the Model

Example calculation: For a woman weighing 50 kg:
$Y = 1124 + (45.5 × 50) = 3399 g$

Assumptions of Correlation and Regression Analysis

Assumptions include:
- Linear association between variables.
- Joint normal distributions for both variables in correlation.
- The outcome variable has to be normally distributed in regression analyses.
- Samples should consist of independent observations.

Addressing Violations of Assumptions

Solutions

Transformations can be applied.
Utilization of non-parametric correlation coefficients:
- Kendall (τ or tau).
- Spearman correlation coefficient (rs).

Critical Insights on Correlation and Causality

Causality cannot be established solely through the fitting of a regression line.
While prediction is possible (e.g., babies whose mothers’ weights differ by 1 kg should vary by 45.5 g), one cannot definitively state that an increase in maternal weight causes an increase by the same amount in the child’s birthweight.
Important considerations: chance, bias, and confounding should be evaluated to understand causality better.

Multivariable Analysis

Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry

Definition

Involves multiple variables predicting an outcome variable.
Utilizes various models under the umbrella of the generalized linear model.

Normal Theory Regression

Purpose and Exploration

Investigates the association between birthweight and maternal weight, while also considering other potential predictors like family income, tobacco use, or alcohol consumption.
A regression model generates an effect size (regression coefficient with its 95% CI) along with its statistical significance (p-value).

Examples

Birthweight based on maternal pre-pregnancy weight: $ext{Birthweight (g)} = 2832 + 11.3 imes ext{Maternal weight (kg)}$
- Indicates that birthweight increases by 11.3 g for every 1 kg increase in maternal weight.
Birthweight based on the number of previous pregnancies: $ext{Birth weight (g)} = 3453 + 29.5 imes ext{Number of previous pregnancies}$
- Shows birth weight increases by 29.5 g for each previous pregnancy.

Considerations in Normal Theory Regression

If maternal weight increases with the number of pregnancies, determining direct effects can become complex.
Raises the question of whether birthweight increases due solely to the increased weight of the mother or if there are distinct effects at play.

Normal Theory Regression Structure

Layout

General formula: $Y = B<em>0 + B</em>1X<em>1 + B</em>2X_2 + … + e$
- Where, Y is the dependent variable, $X1$ represents maternal weight, $X2$ denotes the number of previous pregnancies, and e is the error term.

Specific Examples of Regression Analysis

For birth weight based on maternal weight and previous pregnancies: $ext{Birth weight (g)} = 2830 + 11.0 imes ext{Maternal weight (kg)} + 16.9 imes ext{Number of previous pregnancies}$
- The effect of maternal weight remains relatively constant, while the effect of previous pregnancies has diminished and falls to non-significance due to confounding.
To assess maternal weight impact, two hypothetical cases:
- Woman A: 50 kg, 1 previous pregnancy yields:
  $ext{Predicted birthweight}_A = 2830 + 11.0 imes 50 + 16.9 imes 1$
- Woman B: 49 kg, 0 previous pregnancies but adjusted for comparison:
  $ext{Predicted birthweight}_B = 2830 + 11.0 imes 49 + 16.9 imes 1$
- The difference indicates additional weight impact after adjusting for prior pregnancies.

Final Assessment of Multivariable Impact

Evaluation of regression coefficients indicates changes in significance after adjusting for maternal weight, reinforcing the impact of confounding.
A confounding variable is specifically associated with both the outcome and predictor variables, and can distort observed relationships.