3.1: Correlation and Variate Relationships
Two-Variable (Bi-Variate) Relationships
- Explanatory variable: a variable that attempts to explain or influence observed outcomes * What is being used to make the prediction * Displayed on the x-axis
- Response variable: a variable that measures some outcome * What is being predicted * Displayed on the y-axis
Describing Scatterplots and Bi-Variate Data: FUDS
- Form: linear, curve, u-shape, etc.
- Unusual Points: outliers, influential points * Outlier: a point with a large residual (usually decreases the correlation) * Influential: a point which draws the line toward it (usually increases the correlation)
- Direction: positive or negative association (or neither) * Positive association—as one variable increases, so does the other * Negative association—as one variable increases, the other decreases
- Strength: how closely the points follow the form * Strong, weak, moderately strong/weak
Residuals
- Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
- Individual points that are extreme in the x direction may not have large residuals, but can be very important; such points are influential if removing them would markedly change the results of the calculation
Correlation (r)
- Gives the direction and strength of a linear relationship * Does not imply causation
- Makes no distinction between explanatory and response variables * Can switch x’s and y’s and they would still be correlated
- Both variables must be quantitative
- Standardized and will not change if we change/convert units of measurement from x, y, or both
- r itself has no units
- Positive r = positive association * Negative r = negative association
- Correlation only measures strength and direction of linear relationships
- -1 ≤ x ≤ 1 always
- The closer r is to 1 or -1, the stronger the linear form * The closer r is to 0, the weaker the linear form and the more scattered the points are
- r does not tell the whole story
Displaying Data
Two-Way Tables
- Two-way table: a table that displays data for two categorical variables about the same group of individuals
\

\
- Marginal distribution: the total for one categorical variable
\

\
- The yellow box shows the marginal distribution for gender, and the purple box is the marginal distribution of opinions
- Conditional distribution: the distribution within just one value of one variable * Often uses language of the probability of A “given” B
Segmented Bar Graphs
Also known as segmented bar charts
Segmented bar graph: a chart that displays categorical data as a percentage of the whole * Similar to a pie chart

Mosaic Plots
- Mosaic plot: a segmented bar graph used to compare groups where the widths of the bars are proportional to the size of the groups
- Mosaic plots of the same data from the previous section:
