response variable
measures an outcome of a study. independent variable.
explanatory variable
attempts to explain the observed outcomes. dependent variable.
how to examine data
plot the data. use numerical summaries. look for overall patterns and striking deviations (outliers). if overall pattern is regular, use a compact mathematical model to describe it.
scatterplot
shows the relationship between two quantitative variables measured on the same individuals. explanatory variable on x-axis. response variable on y-axis.
explanatory/response variables
change in x causes change in y. x used to predict the values of y.
how to make a scatterplot
look for an overall pattern and striking deviations (outliers). describe the form of the scatterplot. make axes and label.
how to describe a scatterplot
form is the pattern (linear or curved or clusters). direction is the association (positive or negative). strength is how closely the points follow a clear form such as a line (strong or moderately strong or weak).
outlier
an individual value that falls outside the overall pattern of the relationship.
positively associated
when above-average values of one tend to accompany above-average values of the other and below-average values also tend to occur together.
negatively associated
when above-average values of one tend to accompany below-average values of the other, and vice versa.
how to display categorical values in a scatterplot
use two different plotting symbols, such as colors, to differentiate the values.
correlation
measures the direction and strength of the linear relationship between two quantitative variables. numerical measure to supplement the graph, thus proving linear relationship. standardized, no units. r.
r
positive=positive association between variables. negative=negative association between variables.
makes no distinction between explanatory and response variables. x or y does not matter.
requires that both variables be quantitative.
always between -1 and 1.
does not describe curved relationships.
like mean and SD, not resistant. strongly affected by a few outlying observations.
r=0
no linear relationship. scattered.
r=.99
strong, positive linear relationship.
r=-.99
strong, negative linear relationship.
how to use correlation
correlation is not a complete description of two variable data, even when the relationship is linear. give the means and SDs of both x and y along with the correlation. conclusions based on correlation. describe data more.
r=1, r=-1
points lie exactly on a straight line.
least-squares regression
a straight line that describes how a response variable y changes as an explanatory variable x changes. often used to predict the value of y for a given value x. unlike correlation, requires an explanatory variable and a response variable.
least-squares regression line
the line that makes the vertical distances of the points in a scatterplot from the line as small as possible.
LSRL
ŷ=a + bx
ŷ
predicted value.
y
observed value.
r²
_____% of the variation in the response variable (y) is accounted for by the regression line. a measure of how successful the regression was in explaining the response.
correlation and slope of LSRL
a change of one standard deviation in x corresponds to a change of r standard deviations in y.
residual
the difference between an observed value of the response variable and the value predicted by the regression line. y - ŷ. the mean of the least-squares residuals of a LSLR is always zero. otherwise, caused by a roundoff error.
residual plot
a scatterplot of the regression residuals against the explanatory variable. help us assess the fit of a regression line.
how to make a residual plot
plot the x values on the x-axis and the residuals on the y-axis. draw a line at zero. label the axes.
how to examine a residual plot
a curved pattern shows the relationship is not linear. thus, a straight line is an inappropriate model.
increasing or decreasing spread about the line shows that prediction of y will be less accurate for larger x.
individual points with large residuals are outliers in the vertical (y) direction because they lie far from the line that describes the overall pattern.
individual points that are extreme in the x direction may not have large residuals, but can be very important.
outlier
observation that lies outside the overall pattern of the other observations.
influential observation
removing the observation would markedly change the result of the calculation. points that are outliers in the x direction of a scatterplot are often influential observations for the LSRL. has small residuals because it pulls the regression line toward itself.
how to analyze data for two variables
plot your data in a scatterplot.
interpret what you see: direction, form strength. linear?
numerical summary? x bar, y bar, SD x, SD y and r?
mathematic model? regression line?