Chapter 6: Two-Variable Data Analysis
In this chapter, we consider techniques of data analysis for two-variable (bivariate) quantitative data.
It is a two-dimensional graph of ordered pairs.
We put one variable (explanatory variable) on the horizontal axis and the other (response variable) on the vertical axis.
The horizontal axis is considered to be independent, while the other is dependent.
The two variables are positively associated if higher than average values for one variable are generally paired with higher than average values of the other variable.
They are negatively associated if higher than average values for one variable tend to be paired with lower than average values of the other variable.
Two variables are linearly related to the extent that their relationship can be modeled by a line.
The first statistic we have to quantify a linear relationship is the Pearson product moment correlation, or, more simply, the Correlation Coefficient , denoted by the letter r .
It is a measure of the strength of the linear relationship between two variables as well as an indicator of the direction of the linear relationship.
For a sample of size n paired data (x,y), r is given by:
If r = −1 or r = 1, the points all lie on a line.
If r > 0, it indicates that the variables are positively associated.
if r<0, it indicates that the variables are negatively associated.
If r = 0, it indicates that there is no linear association that would allow us to predict y from x.
r depends only on the paired points, not the ordered pairs.
r does not depend on the units of measurement.
r is not resistant to extreme values because it is based on the mean.
Just because two things seem to go together does not mean that one caused the other—some third variable may be influencing them both.
A line that can be used for predicting response values from explanatory values.
If we wanted to predict the score of a person that studied for 2.75 hours, we could use a regression line.
The least-squares regression line (LSRL) is the line that minimizes the sum of squared errors.
Let ŷ = predicted value of y for a given x
now (y-ŷ) = error in the prediction.
To reduce the errors in prediction, we try to minimize Σ(y-ŷ)^2.
Example:
Given below is the data from a study that looked at hours studied versus score on an exam:
Here, ŷ = 59.03 + 6.77x
If we want to predict the score of someone that studied 2.75 hours, we plug this value into the above equation.
ŷ = 59.03 + 6.77(2.75)=77.63
The formal name for (y - ŷ) is the residual.
Note that the order is always “actual” − “predicted”.
A positive residual means that the prediction was too small and a negative residual means that the prediction was too large.
If a line is an appropriate model, we would expect to find the residuals more or less randomly scattered about the average residual.
A pattern of residuals that does not appear to be more or less randomly distributed about 0 is evidence that a line is not a good model for the data.
The line on the left is a good fit and the residuals on the right show no obvious pattern either. Thus the line is a good fit for the data.
Predicting a value of y from a value of x is called interpolation if we are predicting from an x-value within the range.
It is called extrapolation if we are predicting from a value of x outside of the x -values.
The Proportion of the total variability in y that is explained by the regression of y on x is called the coefficient of determination .
The coefficient of determination is symbolized by r^2.
SST = sum of squares total. It represents the total error from using ȳ as the basis for predicting weight from height.
SSE = sum of squared errors. SSE represents the total error from using the LSRL.
SST − SSE represents the benefit of using the regression line rather than ȳ for prediction.
Outliers and Influential Observations
An outlier lies outside of the general pattern of the data.
An outlier can influence the correlation and may also exert an influence on the slope of the regression line.
An influential observation is one that has a strong influence on the regression model.
Often, the most influential points are extreme in the x -direction. We call these points high-leverage points.
There are many two-variable relationships that are nonlinear.
The path of an object thrown in the air is parabolic (quadratic).
Population tends to grow exponentially.
We can transform such data in such a way that the transformed data is well modeled by a line. This can be done by taking logs, or raising the variables to a power.
Example 1:
The number of a certain type of bacteria present somewhere after a certain number of hours given in the following chart. What would be the predicted quantity of bacteria after 3.75 hours?
Solution:
The scatterplot and residual plot for this data comes out as follows:
The line is NOT a good model for the data.
Now if we take log of the number of bacteria:
The scatterplot and residual plot for the new data comes out as follows:
Now the scatterplot is more linear and the residual plot no longer has any distinct pattern.
The regression equation of the transformed data is:
ln(Number)=-0.047 + 0.586(Hours)
Therefore, after 3.75 hours,
ln(Number)=-0.047 + 0.586(3.75)=2.19.
Now to get back the original variable Number, we remove the log.
Number= e^2.19.
Example 2:
A researcher finds that the LSRL for predicting GPA based on average hours studied per week is
GPA = 1.75 + 0.11(hours studied). Interpret the slope of the regression line in the context of the problem.
Solution:
A student who studies an hour more than another
is predicted to have a GPA 0.11 points higher than the other student.
Example 3:
What is the regression equation for predicting weight from height in the following computer printout, and what is the correlation between height and weight?
Solution:
Weight = -104.64 + 3.4715(Height)
r^2 = 84.8% = 0.848.
r = 0.921.
As r is positive, the slope of the regression line is also positive.
Click the link to go to the next chapter:
In this chapter, we consider techniques of data analysis for two-variable (bivariate) quantitative data.
It is a two-dimensional graph of ordered pairs.
We put one variable (explanatory variable) on the horizontal axis and the other (response variable) on the vertical axis.
The horizontal axis is considered to be independent, while the other is dependent.
The two variables are positively associated if higher than average values for one variable are generally paired with higher than average values of the other variable.
They are negatively associated if higher than average values for one variable tend to be paired with lower than average values of the other variable.
Two variables are linearly related to the extent that their relationship can be modeled by a line.
The first statistic we have to quantify a linear relationship is the Pearson product moment correlation, or, more simply, the Correlation Coefficient , denoted by the letter r .
It is a measure of the strength of the linear relationship between two variables as well as an indicator of the direction of the linear relationship.
For a sample of size n paired data (x,y), r is given by:
If r = −1 or r = 1, the points all lie on a line.
If r > 0, it indicates that the variables are positively associated.
if r<0, it indicates that the variables are negatively associated.
If r = 0, it indicates that there is no linear association that would allow us to predict y from x.
r depends only on the paired points, not the ordered pairs.
r does not depend on the units of measurement.
r is not resistant to extreme values because it is based on the mean.
Just because two things seem to go together does not mean that one caused the other—some third variable may be influencing them both.
A line that can be used for predicting response values from explanatory values.
If we wanted to predict the score of a person that studied for 2.75 hours, we could use a regression line.
The least-squares regression line (LSRL) is the line that minimizes the sum of squared errors.
Let ŷ = predicted value of y for a given x
now (y-ŷ) = error in the prediction.
To reduce the errors in prediction, we try to minimize Σ(y-ŷ)^2.
Example:
Given below is the data from a study that looked at hours studied versus score on an exam:
Here, ŷ = 59.03 + 6.77x
If we want to predict the score of someone that studied 2.75 hours, we plug this value into the above equation.
ŷ = 59.03 + 6.77(2.75)=77.63
The formal name for (y - ŷ) is the residual.
Note that the order is always “actual” − “predicted”.
A positive residual means that the prediction was too small and a negative residual means that the prediction was too large.
If a line is an appropriate model, we would expect to find the residuals more or less randomly scattered about the average residual.
A pattern of residuals that does not appear to be more or less randomly distributed about 0 is evidence that a line is not a good model for the data.
The line on the left is a good fit and the residuals on the right show no obvious pattern either. Thus the line is a good fit for the data.
Predicting a value of y from a value of x is called interpolation if we are predicting from an x-value within the range.
It is called extrapolation if we are predicting from a value of x outside of the x -values.
The Proportion of the total variability in y that is explained by the regression of y on x is called the coefficient of determination .
The coefficient of determination is symbolized by r^2.
SST = sum of squares total. It represents the total error from using ȳ as the basis for predicting weight from height.
SSE = sum of squared errors. SSE represents the total error from using the LSRL.
SST − SSE represents the benefit of using the regression line rather than ȳ for prediction.
Outliers and Influential Observations
An outlier lies outside of the general pattern of the data.
An outlier can influence the correlation and may also exert an influence on the slope of the regression line.
An influential observation is one that has a strong influence on the regression model.
Often, the most influential points are extreme in the x -direction. We call these points high-leverage points.
There are many two-variable relationships that are nonlinear.
The path of an object thrown in the air is parabolic (quadratic).
Population tends to grow exponentially.
We can transform such data in such a way that the transformed data is well modeled by a line. This can be done by taking logs, or raising the variables to a power.
Example 1:
The number of a certain type of bacteria present somewhere after a certain number of hours given in the following chart. What would be the predicted quantity of bacteria after 3.75 hours?
Solution:
The scatterplot and residual plot for this data comes out as follows:
The line is NOT a good model for the data.
Now if we take log of the number of bacteria:
The scatterplot and residual plot for the new data comes out as follows:
Now the scatterplot is more linear and the residual plot no longer has any distinct pattern.
The regression equation of the transformed data is:
ln(Number)=-0.047 + 0.586(Hours)
Therefore, after 3.75 hours,
ln(Number)=-0.047 + 0.586(3.75)=2.19.
Now to get back the original variable Number, we remove the log.
Number= e^2.19.
Example 2:
A researcher finds that the LSRL for predicting GPA based on average hours studied per week is
GPA = 1.75 + 0.11(hours studied). Interpret the slope of the regression line in the context of the problem.
Solution:
A student who studies an hour more than another
is predicted to have a GPA 0.11 points higher than the other student.
Example 3:
What is the regression equation for predicting weight from height in the following computer printout, and what is the correlation between height and weight?
Solution:
Weight = -104.64 + 3.4715(Height)
r^2 = 84.8% = 0.848.
r = 0.921.
As r is positive, the slope of the regression line is also positive.
Click the link to go to the next chapter: