Biostatistics: Correlation and Regression Notes
Chapter 14: Correlation and Regression
14.1 Data
- Dependent Variable (Y): Quantitative response variable (e.g., lung cancer mortality).
- Independent Variable (X): Quantitative explanatory variable (e.g., per capita cigarette consumption).
- Historical Dataset: Based on Doll's study (1955) with:
- n = 11 countries.
- CIG1930: Explanatory variable (cigarette consumption in 1930).
- LUNGCA: Response variable (lung cancer mortality per 100,000) in 1950.
14.2 Scatterplots
- Purpose: Visual representation of the relationship between X and Y.
- Inspection Criteria:
- Form: Is the relationship linear or non-linear?
- Direction: Do points trend upwards (positive) or downwards (negative)?
- Strength of Association: Are points closely adhering to an imaginary trend line?
- Outliers: Are there deviations from the overall pattern?
- Example: United States highlighted in the scatterplot showing correlation between smoking and lung cancer rates.
14.3 Correlation
- Correlation Coefficient (r):
- Quantifies linear relationships between -1 and 1.
- Interpretations:
- r = 1: Perfect positive correlation.
- r = -1: Perfect negative correlation.
- 0 < r < 1: Positive correlation.
- -1 < r < 0: Negative correlation.
- The closer |r| is to 1, the stronger the correlation.
- Calculation of r: Uses z-scores of X and Y. If both trends are in the same direction, their product is positive, indicating a positive correlation (and vice versa).
- Visual Judgement of Correlation: Perceptions can be affected by scale; always check data scaling.
Example Calculation of r
- Data table example shows how to calculate correlational strength using z-scores and gives results that estimate correlation.
Interpretation of Correlation
- Direction: Positive (r > 0), Negative (r < 0), No association (r ≈ 0).
- Strength: Closer to 1 or -1 signifies a stronger association. For example, with r = 0.737, approximately 54% variance in Y is explained by X (r² = 0.54).
- Reversible Relationship: Correlation doesn't depend on which variable is X or Y, unlike regression.
- Outliers' Effects: Single outliers can significantly distort correlation values.
- Correlation vs. Causation: Correlation does not imply causation. Example: Cholera mortality correlated with elevation, confounded by proximity to polluted water.
Hypothesis Testing for Correlation
- Hypotheses:
- Null Hypothesis (H0): ρ = 0 (no correlation).
- Alternative Hypothesis (Ha): ρ ≠ 0 (correlation exists).
- Test Statistic and P-value: Convert statistical values to P-values using software/table.
- Example of Hypothesis Testing: Showcases methodology in determining the significance of correlations.
Confidence Intervals for the Population Correlation
- Calculation:
- Uses the formula:
LCL, UCL = r \pm t_{n-2,1-a/2} * SE
- Example Calculation: Established values lead to a 95% confidence interval indicating the range for the population correlation between cigarette consumption and lung cancer mortality.
- Conditions for Inference:
- Independent observations.
- Bivariate Normality.
14.4 Regression
- Purpose: Describes the relationship between independent and dependent variables, fitting a line that predicts average changes in Y for units of X.
- Best Fitting Line: Determined by minimizing the sum of squared residuals.
- Regression Equation:
\hat{Y} = a + bX where:
- \hat{Y} = predicted value of Y.
- a = Y-intercept.
- b = slope of the line.
- Slope and Intercept Calculation: Provides standard formulas to calculate the slope (b) and intercept (a).
Key Statistical Findings from Regression
- Example Output:
- Regression coefficients and their significance (e.g., p-value indicators).
- R-squared values explaining variability in response (e.g., 97.9% of variability in measles cases explained by vaccination rates).
- Analysis of Variance (ANOVA): Techniques for testing slope using methods equivalent to the t-test.
Conditions for Inference with Regression
- Requirements:
- Linearity.
- Independence of observations.
- Normality at each level of X.
- Equal variance (homoscedasticity) at each level of X.