Chapter 6: Two-Variable Data Analysis

In this chapter, we consider techniques of data analysis for two-variable (bivariate) quantitative data.

Scatterplots

  • It is a two-dimensional graph of ordered pairs.

  • We put one variable (explanatory variable) on the horizontal axis and the other (response variable) on the vertical axis.

  • The horizontal axis is considered to be independent, while the other is dependent.

    Scatter Plot

  • The two variables are positively associated if higher than average values for one variable are generally paired with higher than average values of the other variable.

  • They are negatively associated if higher than average values for one variable tend to be paired with lower than average values of the other variable.

Correlation

  • Two variables are linearly related to the extent that their relationship can be modeled by a line.

  • The first statistic we have to quantify a linear relationship is the Pearson product moment correlation, or, more simply, the Correlation Coefficient , denoted by the letter r .

  • It is a measure of the strength of the linear relationship between two variables as well as an indicator of the direction of the linear relationship.

  • For a sample of size n paired data (x,y), r is given by:

    Correlation Coefficient "r"

    r in terms of z-scores

Properties of “r“

  • If r = −1 or r = 1, the points all lie on a line.
  • If r > 0, it indicates that the variables are positively associated.
  • if r<0, it indicates that the variables are negatively associated.
  • If r = 0, it indicates that there is no linear association that would allow us to predict y from x.
  • r depends only on the paired points, not the ordered pairs.
  • r does not depend on the units of measurement.
  • r is not resistant to extreme values because it is based on the mean.

Correlation and Causation

==Just because two things seem to go together does not mean that one caused the other—some third variable may be influencing them both.==

Linear Models

Least-Squares Regression Line

  • A line that can be used for predicting response values from explanatory values.
  • If we wanted to predict the score of a person that studied for 2.75 hours, we could use a regression line.
  • The least-squares regression line (LSRL) is the line that minimizes the sum of squared errors.

==Let ŷ = predicted value of y for a given x==

==now (y-ŷ) = error in the prediction.==

==To reduce the errors in prediction, we try to minimize Σ(y-ŷ)^2.==

Example:

Given below is the data from a study that looked at hours studied versus score on an exam:

Here, ŷ = 59.03 + 6.77x

If we want to predict the score of someone that studied 2.75 hours, we plug this value into the above equation.

ŷ = 59.03 + 6.77(2.75)=77.63

Residuals

  • The formal name for (y - ŷ) is the residual.

  • Note that the order is always “actual” − “predicted”.

  • ==A positive residual means that the prediction was too small and a negative residual means that the prediction was too large.==

  • If a line is an appropriate model, we would expect to find the residuals more or less randomly scattered about the average residual.

  • A pattern of residuals that does not appear to be more or less randomly distributed about 0 is evidence that a line is not a good model for the data.

  • The line on the left is a good fit and the residuals on the right show no obvious pattern either. Thus the line is a good fit for the data.

  • Predicting a value of y from a value of x is called interpolation if we are predicting from an x-value within the range.

  • It is called extrapolation if we are predicting from a value of x outside of the x -values.

Coefficient of Determination

  • The Proportion of the total variability in y that is explained by the regression of y on x is called the coefficient of determination .

  • The coefficient of determination is symbolized by r^2.

SST = sum of squares total. It represents the total error from using ȳ as the basis for predicting weight from height.

SSE = sum of squared errors. SSE represents the total error from using the LSRL.

SST − SSE represents the benefit of using the regression line rather than ȳ for prediction.

\
Outliers and Influential Observations

  • An outlier lies outside of the general pattern of the data.
  • An outlier can influence the correlation and may also exert an influence on the slope of the regression line.
  • An influential observation is one that has a strong influence on the regression model.
  • Often, the most influential points are extreme in the x -direction. We call these points high-leverage points.

Transformations to Achieve Linearity

  • There are many two-variable relationships that are nonlinear.
  • The path of an object thrown in the air is parabolic (quadratic).
  • Population tends to grow exponentially.
  • We can transform such data in such a way that the transformed data is well modeled by a line. This can be done by taking logs, or raising the variables to a power.

Example 1:

The number of a certain type of bacteria present somewhere after a certain number of hours given in the following chart. What would be the predicted quantity of bacteria after 3.75 hours?

Solution:

The scatterplot and residual plot for this data comes out as follows:

The line is NOT a good model for the data.

Now if we take log of the number of bacteria:

The scatterplot and residual plot for the new data comes out as follows:

Now the scatterplot is more linear and the residual plot no longer has any distinct pattern.

The regression equation of the transformed data is:

ln(Number)=-0.047 + 0.586(Hours)

Therefore, after 3.75 hours,

ln(Number)=-0.047 + 0.586(3.75)=2.19.

Now to get back the original variable Number, we remove the log.

Number= e^2.19.

Example 2:

A researcher finds that the LSRL for predicting GPA based on average hours studied per week is

GPA = 1.75 + 0.11(hours studied). Interpret the slope of the regression line in the context of the problem.

Solution:

A student who studies an hour more than another

is predicted to have a GPA 0.11 points higher than the other student.

Example 3:

What is the regression equation for predicting weight from height in the following computer printout, and what is the correlation between height and weight?

Solution:

Weight = -104.64 + 3.4715(Height)

r^2 = 84.8% = 0.848.

r = 0.921.

As r is positive, the slope of the regression line is also positive.

\
Click the link to go to the next chapter:

Chapter 7

\