9/3: SOCI 252 - Scatterplots and Correlations

Scatter plots and correlation in R: comprehensive notes

  • Relationship between two (or more) variables is often analyzed through scatter plots and correlation coefficients.

  • Today’s focus: two main tools for two-variable relationships: scatter plots and the correlation coefficient.

Scatter plots: what they are and how to read them

  • Purpose: visualize the relationship between two variables in a 2D space (x vs y).

  • Points represent observations (e.g., students in an experiment). If the plot shows an upward trend, higher x tends to correspond to higher y; a downward trend indicates higher x tends to correspond to lower y.

  • Example points described in the transcript (as an illustration of plotting by hand):

    • (x, y) = (4, 2)

    • (8, 5)

    • (10, 3)

  • How to construct a scatter plot in R:

    • Two required arguments: the x-values and the y-values.

    • You can either supply the two vectors directly, or use an object (e.g., a data frame) for the variables.

    • If you specify an object for x-values and an object for y-values, the plot shows those values against each other.

    • You can also place x first and y second explicitly (plot(x, y)); or, if you have column names in a data frame, you can reference them (e.g., plot(data$X, data$Y)). If you supply the variables in order, R assumes the first is x and the second is y.

  • Example dataset discussed: project STAR (Tennessee) — data from experiments on class size effects in elementary school and later outcomes.

    • Each row represents a student (an observation).

    • Reading scores on the x-axis and math scores on the y-axis were used to explore the relationship.

    • Accessing variables in the STAR data frame: star$reading and star$math (e.g., plot(star$reading, star$math)).

    • Alternative approach: plot(reading, math) if the STAR data frame context is understood (ordering implies x then y).

  • What the scatter plot typically shows for STAR reading vs math:

    • A general positive association: as reading scores go up, math scores tend to go up as well.

    • This does not imply causation; it only indicates a linear-like association in the observed data.

  • Key caveat: correlation and scatter plots summarize relationships but do not establish causality. Strong association does not mean one variable causes changes in the other.

The correlation coefficient: direction, strength, and interpretation

  • What it is: a statistic that summarizes the linear relationship between two variables.

  • Common notation: the correlation coefficient is often denoted as r .

  • Mathematical definition: r = rac{ ext{cov}(X, Y) }{ sX sY } where ( ext{cov}(X, Y) ) is the covariance of X and Y, and ( sX, sY ) are the standard deviations of X and Y.

  • Range and interpretation:

    • Range: -1 \le r \le 1

    • Sign indicates direction:

    • If ( r > 0 ), there is a positive linear relationship (as x increases, y tends to increase).

    • If ( r < 0 ), there is a negative linear relationship (as x increases, y tends to decrease).

    • Magnitude indicates strength of linear association:

    • The closer (|r|) is to 1, the stronger the linear association.

    • The closer (|r|) is to 0, the weaker the linear association.

    • Note: A strong non-linear relationship can have a low or even zero correlation; correlation primarily captures linear associations.

  • Example interpretation from the STAR discussion:

    • If the scatter plot shows a clear straight-line pattern, the correlation is high (positive if the line slopes up, negative if it slopes down).

    • A near-straight-line pattern corresponds to a correlation close to ±1; a cloud-like pattern corresponds to a correlation near 0.

  • Important nuance about variance explained (noted in the discussion):

    • The relationship between correlation and variance explained is via the coefficient of determination, R^2 = r^2 .

    • When a demonstration yielded an observed correlation of r \approx 0.71 , the correct interpretation is:

    • R^2 = r^2 \approx 0.5041 , i.e., about 50% of the variance in one variable is explained by the linear relationship with the other variable (in the context of the linear model). The transcript’s claim that “about 70% of the variance is explained” is not correct for this value of r.

    • Therefore, with r = 0.71 , roughly half of the variance in one variable is explained by the other, in a linear sense, not 70%.

  • An illustrative comparison of two graphs:

    • One plot may have a strong, very linear pattern with a steep slope; it can still have a lower correlation if there is more scatter around the line.

    • Another plot may show a less steep slope but tighter alignment to a line, resulting in a higher correlation.

  • Example scenarios:

    • Parabolic relationship (e.g., y = x^2 ) can have a strong non-linear relationship but a correlation near zero because the data do not lie along a single straight line.

    • This reinforces the point that correlation captures linear association, not all types of relationships.

Practical use and interpretation tips

  • Always plot first: a scatter plot helps you visually assess whether a linear relationship is appropriate to summarize with a correlation coefficient.

  • Correlation does not imply causation: a high correlation indicates association, not that changes in one variable cause changes in the other.

  • Handling real data issues:

    • When computing correlation in R with missing values, you can specify how to handle them, e.g., using the complete observations only: use the option \text{use} = \text{"complete.obs"} in cor().

    • Example in R: \text{cor}(star\$reading, star\$math, use = \text{"complete.obs"})

  • Relationship with binary variables (not deeply covered in the transcript): correlation between a binary and a continuous variable is possible but interpretation is more nuanced; the transcript notes that the presence of binary categories can complicate straightforward correlation in some cases.

STAR dataset: concrete walkthrough and findings

  • Data context: reading vs. math scores for students in the STAR experiment in Tennessee; end-of-year tests used to assess outcomes and potential later effects (e.g., high school graduation rates).

  • Procedure described in class:

    • Create a scatter plot with reading as the x-variable and math as the y-variable (e.g., plot(star$reading, star$math)).

    • Observe the general positive association: higher reading tends to accompany higher math scores.

    • Compute the correlation using R’s cor function: e.g., r = \text{cor}(star\$reading, star\$math) , yielding a value around 0.71 (strong positive linear association).

    • With such a value, the linear model explanation is that as reading increases, math generally increases, though not perfectly (not 100% deterministic).

  • Practical takeaway:

    • The correlation reflects linear association strength and direction; attention should be paid to potential non-linear relationships or lurking variables if the scatter deviates from a clear line.

Additional examples discussed (beyond STAR)

  • Mother’s age vs baby birth weight:

    • Scatter plot described as “blobby” with a very weak positive correlation (weak linear association).

    • Interpretation: as mother’s age increases, baby birth weight shows little or no strong linear pattern.

  • Weeks at birth vs birth weight:

    • Positive association observed; correlation around 0.67 (moderate-to-strong linear relationship).

    • Interpretation: longer gestation generally relates to higher birth weight; about two-thirds of the variance in birth weight can be explained by gestational length in this simple linear sense, though there is still substantial remaining variance.

  • Mother’s age vs father’s age:

    • Scatter plot shows a clear positive direction; correlation around 0.78 (strong linear correlation).

    • Interpretation: generally, older mothers tend to have older fathers, with notable but not perfect variance.

  • Binary/gender variable discussion (brief): dataset includes binary categories; the narrative notes how binary variables are handled and that interpretation of correlation with a binary variable is less straightforward in simple terms.

Practical practice recommendations mentioned

  • After choosing a pair of variables from the NCBIRT dataset, plot them and discuss:

    • The shape and direction of the scatter plot (positive/negative/no clear pattern).

    • The computed correlation and what it implies about linear association.

    • Whether the correlation aligns with the visual impression from the plot.

  • If missing values are present, use complete observations to compute correlation to avoid NA-related issues.

  • Examples used during practice:

    • Mother’s age vs weight: blob-like scatter, very weak positive correlation.

    • Weeks at birth vs birth weight: moderate positive correlation (~0.67).

    • Mother’s age vs father’s age: strong positive correlation (~0.78).

  • Time allocation and interactive practice: the instructor allowed a few minutes for students to pick variable pairs, plot, and discuss results.

Final takeaways and cautions

  • Scatter plots provide intuitive visuals of relationships; they should be inspected before calculating correlation.

  • The correlation coefficient summarizes linear association, with sign indicating direction and magnitude indicating strength, but it does not capture non-linear relationships.

  • R^2 = r^2 gives the proportion of variance in one variable explained by the other under a linear model; interpret with care (for r = 0.71, R^2 ≈ 0.50, not 0.70).

  • Correlation is not causation; multiple variables and confounding factors can influence observed associations.

  • When working with real data, consider data cleaning steps (e.g., handling missing values via complete observations) to avoid misleading results.

  • Practice with datasets (like NCBIRT) to build intuition about how plots and correlations align or diverge for different variable pairs.

Quick reference: common R commands mentioned

  • Scatter plot: plot(x, y) or plot(star$reading, star$math)

  • Correlation: cor(x, y) or cor(star$reading, star$math)

  • Correlation with complete observations: cor(x, y, use = "complete.obs")

  • Scatter plot caveat: always visually inspect the pattern; correlation measures linearity, not non-linear structure

Notes on accuracy observed in the transcript:

  • The transcript states that a correlation of around 0.71 implies about 70% of the variance is explained. The correct interpretation is R^2 = r^2 ≈ 0.5041, i.e., about 50% of the variance in the dependent variable is explained by the linear relationship with the predictor in this example. Keep in mind this is a simplification and depends on the context and model used.