Summarizing Numerical Associations
Association Between Two Variables
- Association exists between two variables if the conditional distribution of one variable changes across different values of the other.
- Associations can be detected in scatter plots by observing changes in the conditional distribution of the y-variable as you move along the x-axis.
- Example: A scatter plot of high school graduation rate vs. poverty rate shows states with higher poverty rates tend to have lower graduation rates.
- When focusing on states with low poverty rates (Poverty < 9), the conditional distribution of the graduation rate is high (85%-90%).
- When focusing on states with high poverty rates the conditional distribution of graduation rate shifts downwards (low 80%).
- If the conditional distributions of the graduation rate are the same regardless of poverty rate, there is no association.
Numerical Summaries
- Graphical summaries can be enhanced with numerical summaries to capture associations in numbers.
- Two approaches: the correlation coefficient and the simple linear model.
Correlation Coefficient
- The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables.
- Formula:
- Where:
- and are individual data points.
- and are the means of the x and y variables.
- and are the standard deviations of the x and y variables.
- Where:
- It is also called the Pearson correlation.
- Example: For the poverty and graduation rate data, the correlation coefficient is -0.747, indicating a negative and reasonably strong linear association.
- For a scatter plot with no association, the correlation coefficient is close to zero.
Simple Linear Model
- A simple linear model represents a line drawn on a scatter plot to summarize the linear association between two variables.
- The line is defined by its slope and y-intercept.
- Equation:
- Where:
- is the predicted value of the y variable.
- is the value of the x variable.
- is the y-intercept.
- is the slope.
- Where:
- The slope (b1) and y-intercept (b0) are the summary statistics.
- The slope represents the change in for every unit change in .
- The y-intercept represents the value of when is zero.
Least Squares Line
The least squares line is a precisely defined linear model that minimizes the sum of the squares of the residuals.
Slope:
Intercept:
The
lm()function in R calculates the least squares slope and intercept.- Syntax:
lm(y ~ x, data = data_frame)
- Syntax:
Example: Using the
lm()function on the poverty and graduation rate data yields an intercept of 96.2022 and a slope of -0.8979.- This means that for every 1 percentage point increase in the poverty rate, the graduation rate is expected to decrease by approximately 0.89 percentage points.
Residuals
- A residual is the difference between the observed value of a data point and the value predicted by the linear model.
- Formula:
- Where:
- is the residual for the i-th observation.
- is the observed value of the y variable for the i-th observation.
- is the predicted value of the y variable for the i-th observation.
- Where:
- Residuals indicate whether an individual observation's y-value is higher or lower than expected based on its x-value.
- Example: For California, with a poverty rate of 12.8, the predicted graduation rate is 84.7. The actual graduation rate is 81.1, so the residual is -3.60908.
- This indicates that California's graduation rate is 3.6 percentage points lower than expected, given its poverty rate.
Summary
- The correlation coefficient and the least squares line are two approaches to summarizing the linear relationship between two numerical variables.
- The correlation coefficient captures the strength and direction of the linear trend.
- The least squares line provides an expectation for the y-value of every observation, allowing for the calculation of residuals.
- Residuals express whether each observation is higher or lower than expected.