9/3: SOCI 252 - Scatterplots and Correlations
Scatter plots and correlation in R: comprehensive notes
Relationship between two (or more) variables is often analyzed through scatter plots and correlation coefficients.
Today’s focus: two main tools for two-variable relationships: scatter plots and the correlation coefficient.
Scatter plots: what they are and how to read them
Purpose: visualize the relationship between two variables in a 2D space (x vs y).
Points represent observations (e.g., students in an experiment). If the plot shows an upward trend, higher x tends to correspond to higher y; a downward trend indicates higher x tends to correspond to lower y.
Example points described in the transcript (as an illustration of plotting by hand):
(x, y) = (4, 2)
(8, 5)
(10, 3)
How to construct a scatter plot in R:
Two required arguments: the x-values and the y-values.
You can either supply the two vectors directly, or use an object (e.g., a data frame) for the variables.
If you specify an object for x-values and an object for y-values, the plot shows those values against each other.
You can also place x first and y second explicitly (plot(x, y)); or, if you have column names in a data frame, you can reference them (e.g., plot(data$X, data$Y)). If you supply the variables in order, R assumes the first is x and the second is y.
Example dataset discussed: project STAR (Tennessee) — data from experiments on class size effects in elementary school and later outcomes.
Each row represents a student (an observation).
Reading scores on the x-axis and math scores on the y-axis were used to explore the relationship.
Accessing variables in the STAR data frame: star$reading and star$math (e.g., plot(star$reading, star$math)).
Alternative approach: plot(reading, math) if the STAR data frame context is understood (ordering implies x then y).
What the scatter plot typically shows for STAR reading vs math:
A general positive association: as reading scores go up, math scores tend to go up as well.
This does not imply causation; it only indicates a linear-like association in the observed data.
Key caveat: correlation and scatter plots summarize relationships but do not establish causality. Strong association does not mean one variable causes changes in the other.
The correlation coefficient: direction, strength, and interpretation
What it is: a statistic that summarizes the linear relationship between two variables.
Common notation: the correlation coefficient is often denoted as r .
Mathematical definition: r = rac{ ext{cov}(X, Y) }{ sX sY } where ( ext{cov}(X, Y) ) is the covariance of X and Y, and ( sX, sY ) are the standard deviations of X and Y.
Range and interpretation:
Range: -1 \le r \le 1
Sign indicates direction:
If ( r > 0 ), there is a positive linear relationship (as x increases, y tends to increase).
If ( r < 0 ), there is a negative linear relationship (as x increases, y tends to decrease).
Magnitude indicates strength of linear association:
The closer (|r|) is to 1, the stronger the linear association.
The closer (|r|) is to 0, the weaker the linear association.
Note: A strong non-linear relationship can have a low or even zero correlation; correlation primarily captures linear associations.
Example interpretation from the STAR discussion:
If the scatter plot shows a clear straight-line pattern, the correlation is high (positive if the line slopes up, negative if it slopes down).
A near-straight-line pattern corresponds to a correlation close to ±1; a cloud-like pattern corresponds to a correlation near 0.
Important nuance about variance explained (noted in the discussion):
The relationship between correlation and variance explained is via the coefficient of determination, R^2 = r^2 .
When a demonstration yielded an observed correlation of r \approx 0.71 , the correct interpretation is:
R^2 = r^2 \approx 0.5041 , i.e., about 50% of the variance in one variable is explained by the linear relationship with the other variable (in the context of the linear model). The transcript’s claim that “about 70% of the variance is explained” is not correct for this value of r.
Therefore, with r = 0.71 , roughly half of the variance in one variable is explained by the other, in a linear sense, not 70%.
An illustrative comparison of two graphs:
One plot may have a strong, very linear pattern with a steep slope; it can still have a lower correlation if there is more scatter around the line.
Another plot may show a less steep slope but tighter alignment to a line, resulting in a higher correlation.
Example scenarios:
Parabolic relationship (e.g., y = x^2 ) can have a strong non-linear relationship but a correlation near zero because the data do not lie along a single straight line.
This reinforces the point that correlation captures linear association, not all types of relationships.
Practical use and interpretation tips
Always plot first: a scatter plot helps you visually assess whether a linear relationship is appropriate to summarize with a correlation coefficient.
Correlation does not imply causation: a high correlation indicates association, not that changes in one variable cause changes in the other.
Handling real data issues:
When computing correlation in R with missing values, you can specify how to handle them, e.g., using the complete observations only: use the option \text{use} = \text{"complete.obs"} in cor().
Example in R: \text{cor}(star\$reading, star\$math, use = \text{"complete.obs"})
Relationship with binary variables (not deeply covered in the transcript): correlation between a binary and a continuous variable is possible but interpretation is more nuanced; the transcript notes that the presence of binary categories can complicate straightforward correlation in some cases.
STAR dataset: concrete walkthrough and findings
Data context: reading vs. math scores for students in the STAR experiment in Tennessee; end-of-year tests used to assess outcomes and potential later effects (e.g., high school graduation rates).
Procedure described in class:
Create a scatter plot with reading as the x-variable and math as the y-variable (e.g., plot(star$reading, star$math)).
Observe the general positive association: higher reading tends to accompany higher math scores.
Compute the correlation using R’s cor function: e.g., r = \text{cor}(star\$reading, star\$math) , yielding a value around 0.71 (strong positive linear association).
With such a value, the linear model explanation is that as reading increases, math generally increases, though not perfectly (not 100% deterministic).
Practical takeaway:
The correlation reflects linear association strength and direction; attention should be paid to potential non-linear relationships or lurking variables if the scatter deviates from a clear line.
Additional examples discussed (beyond STAR)
Mother’s age vs baby birth weight:
Scatter plot described as “blobby” with a very weak positive correlation (weak linear association).
Interpretation: as mother’s age increases, baby birth weight shows little or no strong linear pattern.
Weeks at birth vs birth weight:
Positive association observed; correlation around 0.67 (moderate-to-strong linear relationship).
Interpretation: longer gestation generally relates to higher birth weight; about two-thirds of the variance in birth weight can be explained by gestational length in this simple linear sense, though there is still substantial remaining variance.
Mother’s age vs father’s age:
Scatter plot shows a clear positive direction; correlation around 0.78 (strong linear correlation).
Interpretation: generally, older mothers tend to have older fathers, with notable but not perfect variance.
Binary/gender variable discussion (brief): dataset includes binary categories; the narrative notes how binary variables are handled and that interpretation of correlation with a binary variable is less straightforward in simple terms.
Practical practice recommendations mentioned
After choosing a pair of variables from the NCBIRT dataset, plot them and discuss:
The shape and direction of the scatter plot (positive/negative/no clear pattern).
The computed correlation and what it implies about linear association.
Whether the correlation aligns with the visual impression from the plot.
If missing values are present, use complete observations to compute correlation to avoid NA-related issues.
Examples used during practice:
Mother’s age vs weight: blob-like scatter, very weak positive correlation.
Weeks at birth vs birth weight: moderate positive correlation (~0.67).
Mother’s age vs father’s age: strong positive correlation (~0.78).
Time allocation and interactive practice: the instructor allowed a few minutes for students to pick variable pairs, plot, and discuss results.
Final takeaways and cautions
Scatter plots provide intuitive visuals of relationships; they should be inspected before calculating correlation.
The correlation coefficient summarizes linear association, with sign indicating direction and magnitude indicating strength, but it does not capture non-linear relationships.
R^2 = r^2 gives the proportion of variance in one variable explained by the other under a linear model; interpret with care (for r = 0.71, R^2 ≈ 0.50, not 0.70).
Correlation is not causation; multiple variables and confounding factors can influence observed associations.
When working with real data, consider data cleaning steps (e.g., handling missing values via complete observations) to avoid misleading results.
Practice with datasets (like NCBIRT) to build intuition about how plots and correlations align or diverge for different variable pairs.
Quick reference: common R commands mentioned
Scatter plot: plot(x, y) or plot(star$reading, star$math)
Correlation: cor(x, y) or cor(star$reading, star$math)
Correlation with complete observations: cor(x, y, use = "complete.obs")
Scatter plot caveat: always visually inspect the pattern; correlation measures linearity, not non-linear structure
Notes on accuracy observed in the transcript:
The transcript states that a correlation of around 0.71 implies about 70% of the variance is explained. The correct interpretation is R^2 = r^2 ≈ 0.5041, i.e., about 50% of the variance in the dependent variable is explained by the linear relationship with the predictor in this example. Keep in mind this is a simplification and depends on the context and model used.