Biostatistics: Correlation and Regression Notes

Chapter 14: Correlation and Regression

14.1 Data

  • Dependent Variable (Y): Quantitative response variable (e.g., lung cancer mortality).
  • Independent Variable (X): Quantitative explanatory variable (e.g., per capita cigarette consumption).
  • Historical Dataset: Based on Doll's study (1955) with:
    • n = 11 countries.
    • CIG1930: Explanatory variable (cigarette consumption in 1930).
    • LUNGCA: Response variable (lung cancer mortality per 100,000) in 1950.

14.2 Scatterplots

  • Purpose: Visual representation of the relationship between X and Y.
  • Inspection Criteria:
    • Form: Is the relationship linear or non-linear?
    • Direction: Do points trend upwards (positive) or downwards (negative)?
    • Strength of Association: Are points closely adhering to an imaginary trend line?
    • Outliers: Are there deviations from the overall pattern?
  • Example: United States highlighted in the scatterplot showing correlation between smoking and lung cancer rates.

14.3 Correlation

  • Correlation Coefficient (r):
    • Quantifies linear relationships between -1 and 1.
    • Interpretations:
    • r = 1: Perfect positive correlation.
    • r = -1: Perfect negative correlation.
    • 0 < r < 1: Positive correlation.
    • -1 < r < 0: Negative correlation.
    • The closer |r| is to 1, the stronger the correlation.
  • Calculation of r: Uses z-scores of X and Y. If both trends are in the same direction, their product is positive, indicating a positive correlation (and vice versa).
  • Visual Judgement of Correlation: Perceptions can be affected by scale; always check data scaling.
Example Calculation of r
  • Data table example shows how to calculate correlational strength using z-scores and gives results that estimate correlation.

Interpretation of Correlation

  • Direction: Positive (r > 0), Negative (r < 0), No association (r ≈ 0).
  • Strength: Closer to 1 or -1 signifies a stronger association. For example, with r = 0.737, approximately 54% variance in Y is explained by X (r² = 0.54).
  • Reversible Relationship: Correlation doesn't depend on which variable is X or Y, unlike regression.
  • Outliers' Effects: Single outliers can significantly distort correlation values.
  • Correlation vs. Causation: Correlation does not imply causation. Example: Cholera mortality correlated with elevation, confounded by proximity to polluted water.

Hypothesis Testing for Correlation

  • Hypotheses:
    • Null Hypothesis (H0): ρ = 0 (no correlation).
    • Alternative Hypothesis (Ha): ρ ≠ 0 (correlation exists).
  • Test Statistic and P-value: Convert statistical values to P-values using software/table.
  • Example of Hypothesis Testing: Showcases methodology in determining the significance of correlations.

Confidence Intervals for the Population Correlation

  • Calculation:
    • Uses the formula:
      LCL, UCL = r \pm t_{n-2,1-a/2} * SE
  • Example Calculation: Established values lead to a 95% confidence interval indicating the range for the population correlation between cigarette consumption and lung cancer mortality.
  • Conditions for Inference:
    • Independent observations.
    • Bivariate Normality.

14.4 Regression

  • Purpose: Describes the relationship between independent and dependent variables, fitting a line that predicts average changes in Y for units of X.
  • Best Fitting Line: Determined by minimizing the sum of squared residuals.
  • Regression Equation: \hat{Y} = a + bX where:
    • \hat{Y} = predicted value of Y.
    • a = Y-intercept.
    • b = slope of the line.
  • Slope and Intercept Calculation: Provides standard formulas to calculate the slope (b) and intercept (a).

Key Statistical Findings from Regression

  • Example Output:
    • Regression coefficients and their significance (e.g., p-value indicators).
    • R-squared values explaining variability in response (e.g., 97.9% of variability in measles cases explained by vaccination rates).
  • Analysis of Variance (ANOVA): Techniques for testing slope using methods equivalent to the t-test.

Conditions for Inference with Regression

  • Requirements:
    • Linearity.
    • Independence of observations.
    • Normality at each level of X.
    • Equal variance (homoscedasticity) at each level of X.