Correlation and Regression

Correlation and Regression

Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry


Introduction to Correlation and Regression

  • Focuses on situations where both variables are continuous.

  • Discusses data presentation techniques and methods for assessing the significance and size of associations between variables.


Scatter Plots

  • Used to visualize the relationship between birthweight and maternal weight prior to pregnancy.

  • Observations from a scatter plot indicate that heavier mothers tend to have heavier babies.

  • Introduced the concept of the coefficient of association.


Coefficients of Association

Properties

  • A value of zero for the coefficient indicates no association.

  • A negative value suggests that as one variable increases, the other decreases.

  • A positive value indicates that the variables tend to increase or decrease together.

  • The coefficient ranges from –1 to +1; achieving either extreme is termed 'perfect association'.


Pearson Correlation Coefficient

Definition

  • The Pearson correlation coefficient (r) quantifies the degree of association between two variables.

  • Example data: Birthweight (g) plotted against Maternal weight (kg).

Illustration

  • The coefficient was calculated as +0.50 for the relationship.

Interpretation of Values

  • 0–0.15: Low correlation

  • 0.16–0.4: Modest correlation

  • 0.41–0.7: Moderate correlation

  • Above 0.7: High correlation


Significance of the Correlation Coefficient

  • A coefficient of 0.5 is deemed moderate.

  • Assessing significance involves determining the probability of observing a coefficient of this size in a sample if there is no true relationship.

  • Degrees of freedom (df) computed as:
    df=N2,extwhereN=16extgivesdf=14df = N - 2, ext{ where } N = 16 ext{ gives } df = 14

  • Resulting significance level: p = 0.048.


Types of Associations and Correlation

  • Categories:

    • Negative correlation

    • No correlation

    • Non-linear correlation

    • Perfect positive correlation (e.g., $r = +1$)

  • Visual representations include scatter plots demonstrating these correlation types.


Regression Analysis

Purpose

  • To model the association between two continuous variables by fitting a straight line to the data.

Variables

  • Y: Dependent variable (outcome) which is being predicted.

  • X: Independent variable (predictor) used to explain or predict Y.

  • Regression coefficients:

    • α: Intercept (value of Y when X = 0).

    • β: Slope (rate of change in Y for a unit change in X).

  • Error term (e) denotes random errors that are normally distributed about the predicted value:
    eextassumedtofollowN(0,extvariance)e ext{ assumed to follow } N(0, ext{ variance})


Estimation of Regression Coefficients

  • An algorithm to estimate α and β:

    • Method of least squares is employed to minimize squared differences between observed values and values predicted by the model.

  • Example Result: Birthweight increases by 45.5 g for every 1 kg increase in maternal weight.

  • Conclusion for weight difference: Mothers differing by 10 kg can expect birthweight differences of 455 g.


Statistical Analysis of Regression

Confidence Intervals and Statistical Tests

  • Standard errors and Confidence Intervals (CIs) can be calculated.

  • A 95% CI for the regression coefficient: (4.5, 86.5), indicating statistical significance since it excludes zero.

  • Further statistical testing for significance employs a t-test, yielding p = 0.03.

Predictive Use of the Model

  • Example calculation: For a woman weighing 50 kg:
    Y=1124+(45.5×50)=3399gY = 1124 + (45.5 × 50) = 3399 g


Assumptions of Correlation and Regression Analysis

  • Assumptions include:

    • Linear association between variables.

    • Joint normal distributions for both variables in correlation.

    • The outcome variable has to be normally distributed in regression analyses.

    • Samples should consist of independent observations.


Addressing Violations of Assumptions

Solutions

  • Transformations can be applied.

  • Utilization of non-parametric correlation coefficients:

    • Kendall (τ or tau).

    • Spearman correlation coefficient (rs).


Critical Insights on Correlation and Causality

  • Causality cannot be established solely through the fitting of a regression line.

  • While prediction is possible (e.g., babies whose mothers’ weights differ by 1 kg should vary by 45.5 g), one cannot definitively state that an increase in maternal weight causes an increase by the same amount in the child’s birthweight.

  • Important considerations: chance, bias, and confounding should be evaluated to understand causality better.


Multivariable Analysis

Presented by: David Gangitano, PhD
Qualification: Bachelor in Dentistry

Definition

  • Involves multiple variables predicting an outcome variable.

  • Utilizes various models under the umbrella of the generalized linear model.


Normal Theory Regression

Purpose and Exploration

  • Investigates the association between birthweight and maternal weight, while also considering other potential predictors like family income, tobacco use, or alcohol consumption.

  • A regression model generates an effect size (regression coefficient with its 95% CI) along with its statistical significance (p-value).

Examples

  1. Birthweight based on maternal pre-pregnancy weight: extBirthweight(g)=2832+11.3imesextMaternalweight(kg)ext{Birthweight (g)} = 2832 + 11.3 imes ext{Maternal weight (kg)}

    • Indicates that birthweight increases by 11.3 g for every 1 kg increase in maternal weight.

  2. Birthweight based on the number of previous pregnancies: extBirthweight(g)=3453+29.5imesextNumberofpreviouspregnanciesext{Birth weight (g)} = 3453 + 29.5 imes ext{Number of previous pregnancies}

    • Shows birth weight increases by 29.5 g for each previous pregnancy.


Considerations in Normal Theory Regression

  • If maternal weight increases with the number of pregnancies, determining direct effects can become complex.

  • Raises the question of whether birthweight increases due solely to the increased weight of the mother or if there are distinct effects at play.


Normal Theory Regression Structure

Layout

  • General formula: Y=B<em>0+B</em>1X<em>1+B</em>2X2++eY = B<em>0 + B</em>1X<em>1 + B</em>2X_2 + … + e

    • Where, Y is the dependent variable, $X1$ represents maternal weight, $X2$ denotes the number of previous pregnancies, and e is the error term.


Specific Examples of Regression Analysis

  1. For birth weight based on maternal weight and previous pregnancies: extBirthweight(g)=2830+11.0imesextMaternalweight(kg)+16.9imesextNumberofpreviouspregnanciesext{Birth weight (g)} = 2830 + 11.0 imes ext{Maternal weight (kg)} + 16.9 imes ext{Number of previous pregnancies}

    • The effect of maternal weight remains relatively constant, while the effect of previous pregnancies has diminished and falls to non-significance due to confounding.

  2. To assess maternal weight impact, two hypothetical cases:

    • Woman A: 50 kg, 1 previous pregnancy yields:
      extPredictedbirthweightA=2830+11.0imes50+16.9imes1ext{Predicted birthweight}_A = 2830 + 11.0 imes 50 + 16.9 imes 1

    • Woman B: 49 kg, 0 previous pregnancies but adjusted for comparison:
      extPredictedbirthweightB=2830+11.0imes49+16.9imes1ext{Predicted birthweight}_B = 2830 + 11.0 imes 49 + 16.9 imes 1

    • The difference indicates additional weight impact after adjusting for prior pregnancies.


Final Assessment of Multivariable Impact

  • Evaluation of regression coefficients indicates changes in significance after adjusting for maternal weight, reinforcing the impact of confounding.

  • A confounding variable is specifically associated with both the outcome and predictor variables, and can distort observed relationships.