Module 4

Scatter Diagrams & Correlation

Response variable: The variable whose value can be explained by the value of the explanatory/predictor variable

Scatter diagram: A graph that shows the relationship between 2 quantitative variables measured on the same individual

Each individual in the data set is represented by a point.
The explanatory variable is plotted on the horizontal axis (left n right) & the response variable is plotted on the vertical axis (up n down)

2 variables that are linearly related are positively associated when: (positive slope)

Above average values of one variable are associated w/ above average values of the other variables

Below average values of one variable are associated w/ below average values of the other variable.

2 variables are linearly related are negatively associated when: (negative slope)

Above average values of one variable are associated with below average values of the other variable

2 variables are negatively associated if:

Whenever the value of one variable increases, the value of the other variable decreases.

Linear correlation coefficient/Pearson product moment correlation coefficient: A measure of strength & direction of linear relation between 2 quantitative variables

The Greek letter $\rho$ (rho) represents the population correlation coefficient

$r$ = the sample correlation coefficient

Properties of the Linear Correlation Coefficient

The linear correlation coefficient is always between $-1$ and $1$ inclusive.

$-1\le r\le1$

If $r=+1$ , then a perfect positive linear relation exists between 2 variables
If $r=-1$ , then a perfect negative linear relation exists between 2 variables.
The closer $r$ is to $+1$ , the stronger the evidence is of a positive association between the 2 variables.
The closer $r$ is to $-1$ , the stronger the evidence is of a negative association between the 2 variables
If $r$ is close to 0, then little to no evidence exists of a linear relation between 2 variables
So $r$ is close to 0 does not imply no relation, just no linear relation
The linear correlation coefficient is a unitless measure of association. The unit of measure for x & y plays no role in the interpretation of r.
The correlation coefficient is not resistant. An observation that does not follow the overall pattern of the data could affect the value of the linear correlation coefficient

Testing for a linear relation

Determine the absolute value the correlation coefficient → The positive version of it
Find the critical value in Table II for the given sample size
If the absolute value of the correlation coefficient is greater than the critical value, we say a linear relation exists between the two variables. Otherwise, no linear relation exists.

Example:

The correlation between the percentage of the female population w/ a bachelor’s degree and the percentage of births to unmarried mothers since 1990 is 0.940.

R= 0.940

Does this mean that a higher percentage of females w/ bachelor’s degrees causes a higher percentage of births to unmarried mothers?

No b/c in observational studies, we can’t claim cause. In a designed experiment, we can.

2 variables can be related even though there is not a casual relation is through a lurking variable

Lurking variable: It’s related to both the explanatory and response variable

Example:

Ice cream sales and crime rates have a very high correlation. Does this mean that local governments should shut down all ice cream shops?

No, the lurking variable is temperature. As air temperatures rise, both ice cream sales and crime rates rise.

Least-squares Regression

X: 0, 2, 3, 5, 6, 6

Y: 5.8, 5.7, 5.2, 2.8, 1.9, 2.2

a) Find a linear equation that relates x (the explanatory variable) and y (the response variable) by selecting two points and finding the equation of the line containing the points.

Two points: (2, 5.7) (6, 1.9)

$m=\frac{5.7-1.9}{2-6}=-0.95$

$y-5.7=-0.95\left(x-2\right)$

$y-\frac{5.7}{+5.7}=-0.95x+\frac{1.9}{+5.7}$

$y=-0.95+7.6$ → $y=-0.75x+7.6$

b) Graph the equation on the scatter diagram

c) Use the equation to predict y if x=3.

$y=0.95\left(3\right)+7.6$

$y=4.75$ ← Prediction

The difference between the observed value of y and the predicted value of y is the error, or the residual.

Using the line from the last example, and the predicted value at x=3:

residual= observed y- predicted y (3,5.2)

$res=5.2-4.75$

$=0.45$

Least-Squares Regression Criterion:

The least-squares regression line is the line that minimizes the sum of the squared errors or residuals.
This line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line y^ (y hat). We represent this as:

“minimizes $\Sigma$ residuals $^2$ ”

The Least-Squares Regression Line:

The equation of the least-squares regression line is given by

$ŷ=b^1x+b^0$ (b→slope, b0 → y-intercept)

Where $b^1=r\cdot\frac{s^{y}}{s^{x}}$ is the slope of the least-squares regression line

Where $b^0=\overline{y}-\overline{b^1x}$ is the y-intercept of the least-squares regression line

Note: $\overline{x}$ is the sample mean and $s^{x}$ is the sample standard deviation of the explanatory variable x;

$\overline{y}$ is the sample mean and $s^{y}$ is the sample standard deviation of the response variable y.

Time = 5.5272658 + 0.01155301 Depth ← y= y-int & slope

$y=0.0116x\cdot+5.5273$ ← Least squares regression line

$y=mx+b$

To interpret slope:

m= change y/ 1 unit change x

The y-intercept of the regression line is 5.5273

To interpret it, ask 2 questions:

Is 0 a reasonable value for the explanatory variable?
Do any observations near x=0 exist in the data set?

If the least-squares regression line is used to make predictions based on values of the explanatory variable that are much larger/smaller than the observed values. We say the researcher is working outside the scope of the model.

Never use a least-squares regression line to make predictions outside the scope of the model cause we can’t be sure the linear relation continues to exist.

Diagnostics on the Least-squares Regression Line

Coefficient of determination, $R^2$ = Measures the proportion of total variation in the response variable that is explained by the least-squares regression line
t’s a number between 0 and 1, inclusive. $0\le R^2\le1$
$R^2=0$ the line has NO explanatory value
$R^2=1$ means the line explains all of the variation in the response variable

The difference between observed value of the response variable & the mean value of the response variable: total standard deviation → $y=\overline{y}$
The difference between the predicted value of the response variable and the mean value of the response variable: explained deviation → $ŷ-\overline{y}$

The difference between the observed value of the response variable and the predicted value of the response variable: unexplained deviation → $y-ŷ$

Total Deviation= Unexplained Deviation + Explained Deviation

$y-\overline{y}=\left(y-ŷ\right)+\left(ŷ-\overline{y}\right)$

1= Unexplained variation/total variation + Explained variation/total variation

$R^2$ = Explained variation/total variation= $1-$ Unexplained variation/total variation

To determine $R^2$ for the linear regression model, just square the value of the linear correlation coefficient.

Squaring the linear correlation coefficient to obtain the coefficient of determination works only for the least-squares linear regression model.

$ŷ=b^1+b^0$ (Always write as %)

This method doesn’t work in general