# Chapter 12: Linear Regression and Correlation

## Introductory

• Bivariate data: two variable data

• Multivariate data: more than two variables

## 12.1 Linear Equations

• y = a + bx: linear regression for two variables is based on a linear equation with one independent variable.

• Independent variable: x

• Dependent variable: y

• Slope: b

• y-intercept: a

• Graph form: a straight line or linear

• B > 0: slopes to the right

• b = 0: horizontal line

• b < 0: slopes downward to the right

## 12.2 Scatter Plots

• Scatterplot: uses dots to represent values for two different numeric variables.

• Calculator steps for scatter plot

• Enter your X data into list L1 and your Y data into list L2.

• Press 2nd STATPLOT ENTER to use Plot 1. On the input screen for PLOT 1, highlight On and press ENTER. (Make sure the other plots are OFF.)

• For TYPE: highlight the first icon, the scatter plot, and press ENTER.

• For X List, enter L1 ENTER and for Ylist: L2 ENTER.

• For Mark: it does not matter which symbol you highlight, but the square is the easiest to see. Press ENTER.

• Make sure there are no other equations that could be plotted. Press Y = and clear any equations out.

• Press the ZOOM key and then the number 9 (for menu item "ZoomStat"); the calculator will fit the window to the data. You can press WINDOW to see the scaling of the axes.

• Scatterplot Direction: High values of one variable occurring with high values of the other variable or low values of one variable occurring with low values of the other variable

• Strength: Looking at how close the points are to the line

• Linear regression: shows the relationship between a dependent and independent variable(s)

## 12.3 The Regression Equation

• Least-Squares Line: You have a set of data whose scatter plot appears to "fit" a straight line

• Least-squares regression line: Helps obtain a line of best fit

• y hat: estimates value of y

• y0 – ŷ0 = ε0: error or residual

• Absolute value of a residual: measures the vertical distance between the actual value of y and the estimated value of y

• ε: the Greek letter epsilon

• Slope equation: b = r (sy / sx)

• sx = the standard deviation of the x values.

• sy = the standard deviation of the y values

• Interpretation of the Slope: “The slope of the best-fit line tells us how the dependent variable (y) changes for every one unit increase in the independent (x) variable, on average.”

• Using the Linear Regression T Test

• In the STAT list editor, enter the X data in list L1 and the Y data in list L2, paired so that the corresponding (x,y) values are next to each other in the lists.

• On the STAT TESTS menu, scroll down with the cursor to select the LinRegTTest.

• On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1

• On the next line, at the prompt β or ρ, highlight "≠ 0" and press ENTER

• Leave the line for "RegEq:" blank

• Highlight Calculate and press ENTER.

• Correlation coefficient (r): is numerical and provides a measure of strength and direction of the linear association between the independent variable x and the dependent variable y.

• The value of r is always between –1 and +1: –1 ≤ r ≤ 1.

• The size of the correlation r indicates the strength of the linear relationship between x and y. Values of r close to –1 or to +1 indicate a stronger linear relationship between x and y.

• If r = 0 there is likely no linear correlation. It is important to view the scatterplot, however, because data that exhibit a curved or horizontal pattern may have a correlation of 0.

• If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line.

• Positive correlation: A positive value of r means that when x increases, y tends to increase and when x decreases, y tends to decrease.

• Positive correlation: A negative value of r means that when x increases, y tends to decrease and when x decreases, y tends to increase

• Correlation does not imply causation

• 0 < r < 1: A scatter plot showing data with a positive correlation.

• –1 < r < 0: A scatter plot showing data with a negative correlation.

• r = 0: A scatter plot showing data with zero correlation.

• Coefficient of determination: a number between 0 and 1 that measures how well a statistical model predicts an outcome

• r^2 interpretation: when expressed as a percent, represents the percent of variation in the dependent (predicted) variable y that can be explained by variation in the independent (explanatory) variable x using the regression (best-fit) line.

• 1 - r^2 Interpretation: when expressed as a percentage, represents the percent of the variation in y that is NOT explained by variation in x using the regression line.

## 12.4 Testing the Significance of the Correlation Coefficient

• Significance of the correlation coefficient: to decide whether the linear relationship in the sample data is strong enough to use to model the relationship in the population.

• ρ: population correlation coefficient

• r: sample correlation coefficient

• Conclusion for Significant: There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero.

• Conclusion for Not Significant: There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is not significantly different from zero.

• Null Hypothesis: H0ρ = 0

• Alternative Hypothesis: Haρ ≠ 0

• Interpreting Null Hypothesis: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between x and y in the population.

• Interpreting Alternate Hypothesis: The population correlation coefficient IS significantly DIFFERENT FROM zero. There IS A SIGNIFICANT LINEAR RELATIONSHIP (correlation) between x and y in the population.

• To calculate the p-value using LinRegTTEST

• On the LinRegTTEST input screen, on the line prompt for β or ρ, highlight "≠ 0"

• The output screen shows the p-value on the line that reads "p ="

• p-value is less than the significance level: We reject the null hypothesis. There is sufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is significantly different from zero

• p**-value is NOT less than the significance level**: DO NOT REJECT the null hypothesis. There is insufficient evidence to conclude that there is a significant linear relationship between x and y because the correlation coefficient is NOT significantly different from zero.

## 12.6 Outliers

• Outliers: are observed data points that are far from the least squares line.

• Influential points: observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line.

• Degrees of freedom: n - 2