Correlation and Regression
Correlation and Regression
Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry
Introduction to Correlation and Regression
Focuses on situations where both variables are continuous.
Discusses data presentation techniques and methods for assessing the significance and size of associations between variables.
Scatter Plots
Used to visualize the relationship between birthweight and maternal weight prior to pregnancy.
Observations from a scatter plot indicate that heavier mothers tend to have heavier babies.
Introduced the concept of the coefficient of association.
Coefficients of Association
Properties
A value of zero for the coefficient indicates no association.
A negative value suggests that as one variable increases, the other decreases.
A positive value indicates that the variables tend to increase or decrease together.
The coefficient ranges from –1 to +1; achieving either extreme is termed 'perfect association'.
Pearson Correlation Coefficient
Definition
The Pearson correlation coefficient (r) quantifies the degree of association between two variables.
Example data: Birthweight (g) plotted against Maternal weight (kg).
Illustration
The coefficient was calculated as +0.50 for the relationship.
Interpretation of Values
0–0.15: Low correlation
0.16–0.4: Modest correlation
0.41–0.7: Moderate correlation
Above 0.7: High correlation
Significance of the Correlation Coefficient
A coefficient of 0.5 is deemed moderate.
Assessing significance involves determining the probability of observing a coefficient of this size in a sample if there is no true relationship.
Degrees of freedom (df) computed as:
Resulting significance level: p = 0.048.
Types of Associations and Correlation
Categories:
Negative correlation
No correlation
Non-linear correlation
Perfect positive correlation (e.g., $r = +1$)
Visual representations include scatter plots demonstrating these correlation types.
Regression Analysis
Purpose
To model the association between two continuous variables by fitting a straight line to the data.
Variables
Y: Dependent variable (outcome) which is being predicted.
X: Independent variable (predictor) used to explain or predict Y.
Regression coefficients:
α: Intercept (value of Y when X = 0).
β: Slope (rate of change in Y for a unit change in X).
Error term (e) denotes random errors that are normally distributed about the predicted value:
Estimation of Regression Coefficients
An algorithm to estimate α and β:
Method of least squares is employed to minimize squared differences between observed values and values predicted by the model.
Example Result: Birthweight increases by 45.5 g for every 1 kg increase in maternal weight.
Conclusion for weight difference: Mothers differing by 10 kg can expect birthweight differences of 455 g.
Statistical Analysis of Regression
Confidence Intervals and Statistical Tests
Standard errors and Confidence Intervals (CIs) can be calculated.
A 95% CI for the regression coefficient: (4.5, 86.5), indicating statistical significance since it excludes zero.
Further statistical testing for significance employs a t-test, yielding p = 0.03.
Predictive Use of the Model
Example calculation: For a woman weighing 50 kg:
Assumptions of Correlation and Regression Analysis
Assumptions include:
Linear association between variables.
Joint normal distributions for both variables in correlation.
The outcome variable has to be normally distributed in regression analyses.
Samples should consist of independent observations.
Addressing Violations of Assumptions
Solutions
Transformations can be applied.
Utilization of non-parametric correlation coefficients:
Kendall (τ or tau).
Spearman correlation coefficient (rs).
Critical Insights on Correlation and Causality
Causality cannot be established solely through the fitting of a regression line.
While prediction is possible (e.g., babies whose mothers’ weights differ by 1 kg should vary by 45.5 g), one cannot definitively state that an increase in maternal weight causes an increase by the same amount in the child’s birthweight.
Important considerations: chance, bias, and confounding should be evaluated to understand causality better.
Multivariable Analysis
Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry
Definition
Involves multiple variables predicting an outcome variable.
Utilizes various models under the umbrella of the generalized linear model.
Normal Theory Regression
Purpose and Exploration
Investigates the association between birthweight and maternal weight, while also considering other potential predictors like family income, tobacco use, or alcohol consumption.
A regression model generates an effect size (regression coefficient with its 95% CI) along with its statistical significance (p-value).
Examples
Birthweight based on maternal pre-pregnancy weight:
Indicates that birthweight increases by 11.3 g for every 1 kg increase in maternal weight.
Birthweight based on the number of previous pregnancies:
Shows birth weight increases by 29.5 g for each previous pregnancy.
Considerations in Normal Theory Regression
If maternal weight increases with the number of pregnancies, determining direct effects can become complex.
Raises the question of whether birthweight increases due solely to the increased weight of the mother or if there are distinct effects at play.
Normal Theory Regression Structure
Layout
General formula:
Where, Y is the dependent variable, $X1$ represents maternal weight, $X2$ denotes the number of previous pregnancies, and e is the error term.
Specific Examples of Regression Analysis
For birth weight based on maternal weight and previous pregnancies:
The effect of maternal weight remains relatively constant, while the effect of previous pregnancies has diminished and falls to non-significance due to confounding.
To assess maternal weight impact, two hypothetical cases:
Woman A: 50 kg, 1 previous pregnancy yields:
Woman B: 49 kg, 0 previous pregnancies but adjusted for comparison:
The difference indicates additional weight impact after adjusting for prior pregnancies.
Final Assessment of Multivariable Impact
Evaluation of regression coefficients indicates changes in significance after adjusting for maternal weight, reinforcing the impact of confounding.
A confounding variable is specifically associated with both the outcome and predictor variables, and can distort observed relationships.