Summarizing Numerical Associations

Association Between Two Variables

  • Association exists between two variables if the conditional distribution of one variable changes across different values of the other.
  • Associations can be detected in scatter plots by observing changes in the conditional distribution of the y-variable as you move along the x-axis.
  • Example: A scatter plot of high school graduation rate vs. poverty rate shows states with higher poverty rates tend to have lower graduation rates.
  • When focusing on states with low poverty rates (Poverty < 9), the conditional distribution of the graduation rate is high (85%-90%).
  • When focusing on states with high poverty rates the conditional distribution of graduation rate shifts downwards (low 80%).
  • If the conditional distributions of the graduation rate are the same regardless of poverty rate, there is no association.

Numerical Summaries

  • Graphical summaries can be enhanced with numerical summaries to capture associations in numbers.
  • Two approaches: the correlation coefficient and the simple linear model.

Correlation Coefficient

  • The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables.
  • Formula: r=1n1<em>i=1n(x</em>ixˉs<em>x)(y</em>iyˉsy)r = \frac{1}{n - 1} \sum<em>{i=1}^{n} (\frac{x</em>i - \bar{x}}{s<em>x}) (\frac{y</em>i - \bar{y}}{s_y})
    • Where:
      • x<em>ix<em>i and y</em>iy</em>i are individual data points.
      • xˉ\bar{x} and yˉ\bar{y} are the means of the x and y variables.
      • s<em>xs<em>x and s</em>ys</em>y are the standard deviations of the x and y variables.
  • It is also called the Pearson correlation.
  • Example: For the poverty and graduation rate data, the correlation coefficient is -0.747, indicating a negative and reasonably strong linear association.
  • For a scatter plot with no association, the correlation coefficient is close to zero.

Simple Linear Model

  • A simple linear model represents a line drawn on a scatter plot to summarize the linear association between two variables.
  • The line is defined by its slope and y-intercept.
  • Equation: y^=b<em>0+b</em>1x\hat{y} = b<em>0 + b</em>1x
    • Where:
      • y^\hat{y} is the predicted value of the y variable.
      • xx is the value of the x variable.
      • b0b_0 is the y-intercept.
      • b1b_1 is the slope.
  • The slope (b1) and y-intercept (b0) are the summary statistics.
  • The slope represents the change in yy for every unit change in xx.
  • The y-intercept represents the value of yy when xx is zero.
Least Squares Line
  • The least squares line is a precisely defined linear model that minimizes the sum of the squares of the residuals.

  • Slope: b<em>1=rs</em>ysxb<em>1 = r \frac{s</em>y}{s_x}

  • Intercept: b<em>0=yˉb</em>1xˉb<em>0 = \bar{y} - b</em>1\bar{x}

  • The lm() function in R calculates the least squares slope and intercept.

    • Syntax: lm(y ~ x, data = data_frame)
  • Example: Using the lm() function on the poverty and graduation rate data yields an intercept of 96.2022 and a slope of -0.8979.

    • This means that for every 1 percentage point increase in the poverty rate, the graduation rate is expected to decrease by approximately 0.89 percentage points.
Residuals
  • A residual is the difference between the observed value of a data point and the value predicted by the linear model.
  • Formula: e<em>i=y</em>iy^ie<em>i = y</em>i - \hat{y}_i
    • Where:
      • eie_i is the residual for the i-th observation.
      • yiy_i is the observed value of the y variable for the i-th observation.
      • y^i\hat{y}_i is the predicted value of the y variable for the i-th observation.
  • Residuals indicate whether an individual observation's y-value is higher or lower than expected based on its x-value.
  • Example: For California, with a poverty rate of 12.8, the predicted graduation rate is 84.7. The actual graduation rate is 81.1, so the residual is -3.60908.
    • This indicates that California's graduation rate is 3.6 percentage points lower than expected, given its poverty rate.

Summary

  • The correlation coefficient and the least squares line are two approaches to summarizing the linear relationship between two numerical variables.
  • The correlation coefficient captures the strength and direction of the linear trend.
  • The least squares line provides an expectation for the y-value of every observation, allowing for the calculation of residuals.
  • Residuals express whether each observation is higher or lower than expected.