Summarizing Numerical Associations

Association exists between two variables if the conditional distribution of one variable changes across different values of the other.
Associations can be detected in scatter plots by observing changes in the conditional distribution of the y-variable as you move along the x-axis.
Example: A scatter plot of high school graduation rate vs. poverty rate shows states with higher poverty rates tend to have lower graduation rates.
When focusing on states with low poverty rates (Poverty < 9), the conditional distribution of the graduation rate is high (85%-90%).
When focusing on states with high poverty rates the conditional distribution of graduation rate shifts downwards (low 80%).
If the conditional distributions of the graduation rate are the same regardless of poverty rate, there is no association.

Graphical summaries can be enhanced with numerical summaries to capture associations in numbers.
Two approaches: the correlation coefficient and the simple linear model.

The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables.
Formula: $r = \frac{1}{n - 1} \sum{i=1}^{n} (\frac{xi - \bar{x}}{sx}) (\frac{yi - \bar{y}}{s_y})$
- Where:
 - $xi$ and $yi$ are individual data points.
 - $\bar{x}$ and $\bar{y}$ are the means of the x and y variables.
 - $sx$ and $sy$ are the standard deviations of the x and y variables.
It is also called the Pearson correlation.
Example: For the poverty and graduation rate data, the correlation coefficient is -0.747, indicating a negative and reasonably strong linear association.
For a scatter plot with no association, the correlation coefficient is close to zero.

A simple linear model represents a line drawn on a scatter plot to summarize the linear association between two variables.
The line is defined by its slope and y-intercept.
Equation: $\hat{y} = b0 + b1x$
- Where:
 - $\hat{y}$ is the predicted value of the y variable.
 - $x$ is the value of the x variable.
 - $b_0$ is the y-intercept.
 - $b_1$ is the slope.
The slope (b1) and y-intercept (b0) are the summary statistics.
The slope represents the change in $y$ for every unit change in $x$ .
The y-intercept represents the value of $y$ when $x$ is zero.

The least squares line is a precisely defined linear model that minimizes the sum of the squares of the residuals.
Slope: $b1 = r \frac{sy}{s_x}$
Intercept: $b0 = \bar{y} - b1\bar{x}$
The lm() function in R calculates the least squares slope and intercept.
- Syntax: lm(y ~ x, data = data_frame)
Example: Using the lm() function on the poverty and graduation rate data yields an intercept of 96.2022 and a slope of -0.8979.
- This means that for every 1 percentage point increase in the poverty rate, the graduation rate is expected to decrease by approximately 0.89 percentage points.

A residual is the difference between the observed value of a data point and the value predicted by the linear model.
Formula: $ei = yi - \hat{y}_i$
- Where:
 - $e_i$ is the residual for the i-th observation.
 - $y_i$ is the observed value of the y variable for the i-th observation.
 - $\hat{y}_i$ is the predicted value of the y variable for the i-th observation.
Residuals indicate whether an individual observation's y-value is higher or lower than expected based on its x-value.
Example: For California, with a poverty rate of 12.8, the predicted graduation rate is 84.7. The actual graduation rate is 81.1, so the residual is -3.60908.
- This indicates that California's graduation rate is 3.6 percentage points lower than expected, given its poverty rate.

The correlation coefficient and the least squares line are two approaches to summarizing the linear relationship between two numerical variables.
The correlation coefficient captures the strength and direction of the linear trend.
The least squares line provides an expectation for the y-value of every observation, allowing for the calculation of residuals.
Residuals express whether each observation is higher or lower than expected.