Scatter Plots and Correlation Notes

Scatter Plots and Correlation

Objective 1: Drawing Scatter Plots

  • A scatter plot is a graph of ordered pairs used to determine if a relationship exists between two variables.
  • Types of linear relationships:
    • Positive linear relationship: As one variable increases, the other variable also tends to increase.
    • Negative linear relationship: As one variable increases, the other variable tends to decrease.
    • No linear relationship: There is no apparent pattern between the two variables.
  • No relationship: There isn't a discernible connection between the variables.

Example: Age and Systolic Blood Pressure

  • Data:
    • Age (x): 43, 48, 56, 61, 67, 70
    • Blood Pressure (y): 128, 120, 135, 143, 141, 152
  • Task: Draw a scatter plot for the data and determine if there appears to be a linear relationship.
  • Assessment: Determine if the relationship is positive or negative.

Objective 2: Computing the Correlation Coefficient

  • Population Correlation Coefficient:
    • Denoted by the Greek letter ρ (rho).
    • Computed using all possible pairs of data values from a population: (x,y)(x, y)
  • Sample Correlation Coefficient:
    • Symbolized by rr. Measures the strength and direction of a linear relationship between two quantitative variables.
  • Formula for Sample Correlation Coefficient (r):
    • r=nxy(x)(y)[nx2(x)2][ny2(y)2]r = \frac{n \sum xy - (\sum x)(\sum y)}{\sqrt{[n \sum x^2 - (\sum x)^2][n \sum y^2 - (\sum y)^2]}}
    • Where nn is the number of data pairs.

Example: Compute the Correlation Coefficient for Age and Systolic Blood Pressure

  • Data:
    • Age (x): 43, 48, 56, 61, 67, 70
    • Blood Pressure (y): 128, 120, 135, 143, 141, 152
    • x2x^2: 1849, 2304, 3136, 3721, 4489, 4900
    • y2y^2: 16384, 14400, 18225, 20449, 19881, 23104
    • xyxy: 5504, 5760, 7560, 8723, 9447, 10640

Properties of the Linear Correlation Coefficient

  • The value of rr will always be between 1-1 and 11 inclusively: 1r1-1 \le r \le 1
  • The closer rr is to 1-1 or 11, the stronger the linear association between the variables.
  • The closer rr is to 00, the weaker the linear association between the variables.
  • If the values of xx and yy are interchanged, the value of rr will not change.

Objective 3: Hypothesis Testing for Correlation

  • Null Hypothesis (H0H_0):
    • H0:ρ=0H_0: \rho = 0 (There is no linear correlation between the two variables.)
  • Alternative Hypothesis (H1H_1):
    • H1:ρ0H_1: \rho \neq 0 (There is a linear correlation between the two variables.)
  • Test Value (t-statistic):
    • t=r1r2n2t = \frac{r}{\sqrt{\frac{1 - r^2}{n - 2}}}
  • P-value:
    • Degrees of freedom: df=n2df = n - 2

Example: Hypothesis Test for Age and Systolic Blood Pressure

  • Using a significance level of 5% (α=0.05\alpha = 0.05), test if there is a linear correlation between age and systolic blood pressure.
  • Steps:
    1. State the null and alternative hypotheses.
    2. Calculate the test statistic (t)(t).
    3. Determine the degrees of freedom (df)(df).
    4. Find the p-value.
    5. Make a decision based on the p-value and significance level.
    6. Draw a conclusion.

Example: Driver Age and Accidents

  • Data represents ages of drivers and the number of accidents reported for each age group in Pennsylvania for a selected year.
    • Age (x): 16, 17, 18, 19, 20, 21
    • Number of accidents (y): 6605, 8932, 8506, 7349, 6458, 5974
  • Tasks:
    • Draw a scatter plot.
    • Compute the correlation coefficient (round to three decimal places).
    • Test the claim that there is a linear correlation between ages of drivers and the number of accidents reported using α=10%\alpha = 10\%.

Example: Driver Age and Accidents (Excluding 16-Year-Olds)

  • Calculate the correlation coefficient without the ordered pair (16,6605)(16, 6605).
    • Age (x): 17, 18, 19, 20, 21
    • Number of accidents (y): 8932, 8506, 7349, 6458, 5974
    • x2x^2: 289, 324, 361, 400, 441
    • y2y^2: 79780624, 72352036, 54007801, 41705764, 35688676
    • xyxy: 151844, 153108, 139631, 129160, 125454
  • Determine if there is a linear correlation between the ages of drivers and the number of accidents reported when 16-year-olds are left out using α=10%\alpha = 10\%.
    • Follow the same hypothesis testing steps as before.