FMPH 102 Study Notes: Correlation of Two Continuous Variables
FMPH 102: Biostatistics in Public Health
Course Overview
Focus on the analysis of correlation between two continuous variables in health statistics.
Topics Covered
Continuous Variables Inference for Means:
One sample
Two paired samples
Two independent samples
Three or more independent samples
Correlation and Linear Regression
Binary Variables Inference for Proportions:
One sample
Two paired samples
Two independent samples
Three or more independent samples
Understanding Correlation
Definition
Correlation assesses the relationship between two continuous variables.
Key Principle: "Everybody who went to the moon has eaten chicken"—implies correlation can sometimes suggest spurious relationships.
Correlation of Continuous Variables
Correlation examines how two continuous variables for the same sample of individuals vary together.
Examples of Correlation Investigations:
Is Body Mass Index (BMI) correlated with total exercise time per week?
Is LDL cholesterol associated with the total amount of cholesterol in the diet?
Is blood pressure related to total sodium intake?
Properties of Correlation
Form of Correlation
Investigates whether the relationship is linear, curved, or random.
Direction of Correlation
Positive Association: Upward trend indicating that as one variable increases, the other does too.
Negative Association: Downward trend indicating that as one variable increases, the other decreases.
Flat Trend: No correlation, indicating no association between the variables.
Strength of Correlation
Measured by how closely data points adhere to the trend line or curve in a scatterplot.
Example:
Values of variable x for person A
Values of variable y for person A
Visualization and Example
Jackson Study Example
Utilized a scatterplot of percentage of body fat (PBF) and BMI with n = 655 participants.
Observed a positive association between PBF and BMI:
Individuals with high PBF tend to exhibit high BMI.
Relationship appears approximately linear but likely has a quadratic form expressed as:
Strength of Correlation
Pearson’s Correlation Coefficient (r)
Definition:
The population correlation coefficient, denoted by , estimated by the sample correlation .
Assumptions:
Both X and Y are normally distributed.
There must be a linear association between X and Y.
Range:
Values lie between -1 and 1:
If X and Y are independent: or .
Positively correlated: \rho > 0 (or r > 0).
Negatively correlated: \rho < 0 (or r < 0).
Strong correlation: or .
Often calculate : strong correlation if .
Visualization of Pearson's Correlation Coefficients
Various scatterplots visualize correlation coefficients where:
, , , , etc.
Hypothesis Testing with Pearson’s Correlation Coefficient
Process
As with means and proportions, hypothesis tests can be conducted using the correlation coefficient.
Null Hypothesis (H0): No correlation between variables and ().
Alternative Hypothesis (HA): There is a correlation between variables and .
Covariance
Definition:
or measures how random variables X and Y relate to one another, capturing the degree to which two variables change together.
Positive Covariance: Indicates direct relationship.
Negative Covariance: Indicates inverse relationship.
Formula
.
Covariance can range from negative to positive infinity.
Covariance Characteristics
Scale Effect:
Covariance is sensitive to the scale of the variables—the unit is the product of the units of both variables.
Adjustment Example:
If the y-values are doubled: changing $(xi, yi)$ to $(xi, 2yi)$ affects covariance.
Important conclusion: Covariance is not standardized, meaning it retains the scale of measurement.
Correlation vs. Covariance
Correlation is a scaled form of covariance, making it unit-less and allowing comparison across different datasets.
Assumptions of Pearson's Correlation Coefficient r
Assumptions
Normal distribution of both variables X and Y.
Linear relationship between X and Y must be true.
Cases When Assumptions Fail
If assumptions do not hold, use a nonparametric (rank-based) test such as Spearman's Correlation.
Spearman’s Correlation Coefficient
Definition:
Spearman's correlation coefficient considers the ranks of X and Y, rather than their raw values.
Use Cases:
For non-linear relationships.
When data are not normally distributed.
Robust in various conditions.
Hypothesis for rank correlation:
Null Hypothesis ().
Test statistic follows the Student's t-distribution with .
Conducting Pearson & Spearman Correlations in SPSS
Step-by-Step Process
Access the dataset of interest (e.g., weight & energy intake from Dennison.sav).
Generate scatter plots:
SPSS: Graph > Legacy Dialogs > Scatter/Dot > Simple Scatter > Define
Set Y-Axis to weight and X-Axis to energy.
Statistical Correlation Analysis:
SPSS: Analyze > Correlate > Bivariate…
Input weight and energy, choose Pearson and Spearman correlations.
Examples and Consistency of Results
Case of Inconsistent Results
When testing correlation between bilirubin and creatinine:
Pearson showed significant correlation , p = 0.004, while Spearman suggested no correlation , p = 0.76.
Conclusion: Pearson correlation should be validated for normal distribution; otherwise, rely on Spearman's coefficient.
Summary of Key Concepts
Final Insights
Correlation of continuous variables can be assessed using:
Pearson correlation coefficient (for normally distributed data)
Spearman rank correlation (for non-normal data or non-linear relationships).
Statistical tests apply to both Spearman and Pearson correlations via:
Hypothesis Testing
Effective statistical test design requires defining assumptions and ensuring methods are appropriate for the underlying data characteristics, validating results, and correctly interpreting correlation as not implying causation.