Chapter 10 - Correlation and Regression

**10-1 Correlation**

A

**correlation**exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variableA

**linear correlation**exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight lineThe closer r is to 1, the more linear the correlation

The

**linear correlation coefficient r**measures the strength of the linear correlation between the paired quantitative x values and y values in a sampleProperties of r:

The value of r is always between -1 and 1 inclusive

If all values of either variable are converted to a different scale, the value of r does not change

The value of r is not affected by the choice of x or y

r measures the strength of a linear relationship

r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value

A p-value less than or equal to alpha supports the claim of a linear correlation

A p-value greater than alpha does not support the claim of a linear correlation

The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y

CORRELATION DOES NOT IMPLY CAUSALITY

Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship

A

**lurking variable**is one that affects the variables being studied but is not included in the studyIf conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

**10-2 Regression**

The best-fitting line is called the regression line, and its equation is called the regression equation.

The regression equation: y hat = b0 + b1x

The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable

The

**marginal change**in a variable is the amount that it changes when the other variable changes by exactly 1 unitAn

**outlier**is a point lying far away from the other data points. Paired sample data may include one or more**influential points**, which are points that strongly affect the graph of the regression lineFor a pair of sample x and y values, the

**residual**is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hatA straight line satisfies the

**least-squares property**if the sum of the squares of the residuals is the smallest sum possibleA

**residual plot**is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

**10-3 Prediction Intervals and Variation**

A

**prediction interval**is a range of values used to estimate a variableA

**confidence interval**is a range of values used to estimate a population parameterThe

**total deviation**of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.The

**explained deviation**is the vertical distance y hat - y hatThe

**unexplained deviation**is the vertical distance y - y hatTotal variation = explained variation + unexplained variation

The

**coefficient of determination**is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

**10-4 Multiple Regression**

A

**multiple regression equation**expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)The

**adjusted coefficient of determination**is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample sizeA

**dummy variable**is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

**10-5 Nonlinear Regression**

Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships

To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think

**10-1 Correlation**

A

**correlation**exists between 2 variables when the values of 1 variable are somehow associated with the values of the other variableA

**linear correlation**exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight lineThe closer r is to 1, the more linear the correlation

The

**linear correlation coefficient r**measures the strength of the linear correlation between the paired quantitative x values and y values in a sampleProperties of r:

The value of r is always between -1 and 1 inclusive

If all values of either variable are converted to a different scale, the value of r does not change

The value of r is not affected by the choice of x or y

r measures the strength of a linear relationship

r is very sensitive to outliers in the sense that a single outlier could dramatically affect its value

A p-value less than or equal to alpha supports the claim of a linear correlation

A p-value greater than alpha does not support the claim of a linear correlation

The value of r squared is the proportion of the variation in y that is explained by the linear relationship between x and y

CORRELATION DOES NOT IMPLY CAUSALITY

Common errors involving correlation: assuming that correlation implies causality, using data based on averages, ignoring the possibility of a nonlinear relationship

A

**lurking variable**is one that affects the variables being studied but is not included in the studyIf conducing a formal hypothesis test, use p = 0 as H0 and p not equal to 0 as Ha.

**10-2 Regression**

The best-fitting line is called the regression line, and its equation is called the regression equation.

The regression equation: y hat = b0 + b1x

The regression equation algebraically describes the regression line. x is called the explanatory/predictor/independent variable, and y hat is called the response/dependent variable

The

**marginal change**in a variable is the amount that it changes when the other variable changes by exactly 1 unitAn

**outlier**is a point lying far away from the other data points. Paired sample data may include one or more**influential points**, which are points that strongly affect the graph of the regression lineFor a pair of sample x and y values, the

**residual**is the difference between the observed sample value of y and the y value that is predicted by using the regression equation. That is, residual = observed y - predicted y = y - y hatA straight line satisfies the

**least-squares property**if the sum of the squares of the residuals is the smallest sum possibleA

**residual plot**is a scatterplot of the (x, y) values after each of the y-coordinate values has been replaced by the residual value y - y hat

**10-3 Prediction Intervals and Variation**

A

**prediction interval**is a range of values used to estimate a variableA

**confidence interval**is a range of values used to estimate a population parameterThe

**total deviation**of (x, y) is the vertical distance y - y bar, which is the distance between the point (x, y) and the horizontal line passing through the sample mean y hat.The

**explained deviation**is the vertical distance y hat - y hatThe

**unexplained deviation**is the vertical distance y - y hatTotal variation = explained variation + unexplained variation

The

**coefficient of determination**is the proportion of the variation in y that is explained by the regression line. r^2 = explained variation / total variation

**10-4 Multiple Regression**

A

**multiple regression equation**expresses a linear relationship between a response variable y and two or more predictor variables (x1, x2, ... xk)The

**adjusted coefficient of determination**is the multiple coefficient of determination R^2 modified to account for the number of variables and the sample sizeA

**dummy variable**is a variable having only the values of 0 and 1 that are used to represent the 2 different categories of a qualitative variable

**10-5 Nonlinear Regression**

Not all relationships are linear, there are nonlinear functions that can fit sample data as well, such as logarithmic, power, quadratic, and exponential relationships

To identify a good mathematical model: look for a pattern in the graph, compare values of R^2, and think