FMPH 102 Study Notes: Correlation of Two Continuous Variables

FMPH 102: Biostatistics in Public Health

Course Overview

  • Focus on the analysis of correlation between two continuous variables in health statistics.

Topics Covered

  • Continuous Variables Inference for Means:

    • One sample

    • Two paired samples

    • Two independent samples

    • Three or more independent samples

  • Correlation and Linear Regression

  • Binary Variables Inference for Proportions:

    • One sample

    • Two paired samples

    • Two independent samples

    • Three or more independent samples

Understanding Correlation

Definition

  • Correlation assesses the relationship between two continuous variables.

  • Key Principle: "Everybody who went to the moon has eaten chicken"—implies correlation can sometimes suggest spurious relationships.

Correlation of Continuous Variables

  • Correlation examines how two continuous variables for the same sample of individuals vary together.

  • Examples of Correlation Investigations:

    • Is Body Mass Index (BMI) correlated with total exercise time per week?

    • Is LDL cholesterol associated with the total amount of cholesterol in the diet?

    • Is blood pressure related to total sodium intake?

Properties of Correlation

Form of Correlation
  • Investigates whether the relationship is linear, curved, or random.

Direction of Correlation
  • Positive Association: Upward trend indicating that as one variable increases, the other does too.

  • Negative Association: Downward trend indicating that as one variable increases, the other decreases.

  • Flat Trend: No correlation, indicating no association between the variables.

Strength of Correlation
  • Measured by how closely data points adhere to the trend line or curve in a scatterplot.

  • Example:

    • Values of variable x for person A

    • Values of variable y for person A

Visualization and Example

Jackson Study Example
  • Utilized a scatterplot of percentage of body fat (PBF) and BMI with n = 655 participants.

  • Observed a positive association between PBF and BMI:

    • Individuals with high PBF tend to exhibit high BMI.

    • Relationship appears approximately linear but likely has a quadratic form expressed as:
      y=a(x+b)2y = a(x+b)^2

Strength of Correlation

Pearson’s Correlation Coefficient (r)

  • Definition:

    • The population correlation coefficient, denoted by corr(X,Y)=ρcorr(X,Y) = \rho, estimated by the sample correlation corr(x,y)=rcorr(x,y) = r.

  • Assumptions:

    • Both X and Y are normally distributed.

    • There must be a linear association between X and Y.

  • Range:

    • Values lie between -1 and 1:

    • If X and Y are independent: ρ=0\rho = 0 or r0r \approx 0.

    • Positively correlated: \rho > 0 (or r > 0).

    • Negatively correlated: \rho < 0 (or r < 0).

    • Strong correlation: ρ0.5|\rho| \geq 0.5 or r0.5|r| \geq 0.5.

      • Often calculate r2r^2: strong correlation if r20.25r^2 \geq 0.25.

Visualization of Pearson's Correlation Coefficients

  • Various scatterplots visualize correlation coefficients where:

    • r=0.9r = 0.9, r=0.8r = -0.8, r=0.7r = 0.7, r=0.6r = -0.6, etc.

Hypothesis Testing with Pearson’s Correlation Coefficient

Process

  • As with means and proportions, hypothesis tests can be conducted using the correlation coefficient.

  • Null Hypothesis (H0): No correlation between variables XX and YY (ρ=0\rho = 0).

  • Alternative Hypothesis (HA): There is a correlation between variables XX and YY.

Covariance

  • Definition:

    • cov(X,Y)cov(X,Y) or SXYS_{XY} measures how random variables X and Y relate to one another, capturing the degree to which two variables change together.

  • Positive Covariance: Indicates direct relationship.

  • Negative Covariance: Indicates inverse relationship.

Formula
  • COV(X,Y)=S<em>XY=1N1</em>i=1N(x<em>ixˉ)(y</em>iyˉ)COV(X,Y) = S<em>{XY} = \frac{1}{N-1}\sum</em>{i=1}^{N} (x<em>i - \bar{x})(y</em>i - \bar{y}).

  • Covariance can range from negative to positive infinity.

Covariance Characteristics

  • Scale Effect:

    • Covariance is sensitive to the scale of the variables—the unit is the product of the units of both variables.

  • Adjustment Example:

    • If the y-values are doubled: changing $(xi, yi)$ to $(xi, 2yi)$ affects covariance.

  • Important conclusion: Covariance is not standardized, meaning it retains the scale of measurement.

Correlation vs. Covariance

  • Correlation is a scaled form of covariance, making it unit-less and allowing comparison across different datasets.

Assumptions of Pearson's Correlation Coefficient r

Assumptions

  • Normal distribution of both variables X and Y.

  • Linear relationship between X and Y must be true.

Cases When Assumptions Fail

  • If assumptions do not hold, use a nonparametric (rank-based) test such as Spearman's Correlation.

Spearman’s Correlation Coefficient ρ<em>S,r</em>S\rho<em>S, r</em>S

  • Definition:

    • Spearman's correlation coefficient considers the ranks of X and Y, rather than their raw values.

  • Use Cases:

    • For non-linear relationships.

    • When data are not normally distributed.

    • Robust in various conditions.

  • Hypothesis for rank correlation:

    • Null Hypothesis (H<em>0:ρ</em>S=0H<em>0: \rho</em>S = 0).

    • Test statistic follows the Student's t-distribution with df=n2df = n - 2.

Conducting Pearson & Spearman Correlations in SPSS

Step-by-Step Process

  1. Access the dataset of interest (e.g., weight & energy intake from Dennison.sav).

  2. Generate scatter plots:

    • SPSS: Graph > Legacy Dialogs > Scatter/Dot > Simple Scatter > Define

      • Set Y-Axis to weight and X-Axis to energy.

  3. Statistical Correlation Analysis:

    • SPSS: Analyze > Correlate > Bivariate…

      • Input weight and energy, choose Pearson and Spearman correlations.

Examples and Consistency of Results

Case of Inconsistent Results
  • When testing correlation between bilirubin and creatinine:

    • Pearson showed significant correlation r=0.198r = 0.198, p = 0.004, while Spearman suggested no correlation rS=0.021r_S = 0.021, p = 0.76.

  • Conclusion: Pearson correlation should be validated for normal distribution; otherwise, rely on Spearman's coefficient.

Summary of Key Concepts

Final Insights

  • Correlation of continuous variables can be assessed using:

    • Pearson correlation coefficient (for normally distributed data)

    • Spearman rank correlation (for non-normal data or non-linear relationships).

  • Statistical tests apply to both Spearman and Pearson correlations via:

    • tstat=rSErtstat = \frac{r}{SE_r}

    • tstat=r<em>SSE</em>rStstat = \frac{r<em>S}{SE</em>{r_S}}

Hypothesis Testing

  • Effective statistical test design requires defining assumptions and ensuring methods are appropriate for the underlying data characteristics, validating results, and correctly interpreting correlation as not implying causation.