Correlation, Variables, and Scatter Plots — Study Notes

Correlation and Variables: Core Ideas

General definition of correlation: there may be some general relationship between two things. A correlation is an observation that two traits or attributes are related to one another, i.e., they are co-related.
Distinction highlighted in the transcript:
- Correlation as a general relationship between variables (a conceptual link).
- Correlation as a numerical measure of how closely two variables co-vary and how well you can predict change in one by observing change in the other.
Correlation indicates a direct relationship in the sense that two things move together: both increase together or both decrease together.
- Positive relationship: as one variable increases, the other increases.
- Negative relationship (inverse): as one variable increases, the other decreases. The phrase "the correlation inverts under this context" can be interpreted as recognizing inverse relationships.
Important nuance: the correlation you observe depends on the context and on which variables are being examined.
Variables: a fundamental concept in this topic
- A variable is anything that can vary or change. If something can take on different values across observations, it is a variable.
- Examples of variables: height, temperature, exam score, time spent, etc. Anything that varies can be treated as a variable.
Scatter plots as a visualization tool
- Scatter plots are used to monitor and inspect correlations between pairs of variables.
- The strength and direction of a relationship are inferred from the pattern of points on the scatter plot.
- A clustered, line-like pattern indicates a stronger relationship; a widely dispersed pattern indicates a weaker or no linear correlation.
Example reference from the transcript
- The phrase "birdshot" is used to describe a scatter plot with no clear relationship, i.e., data points scattered around with no discernible pattern, centered near the middle of the plot.
Key takeaway: correlation strength is related to how tightly the data points align along a pattern (often a line) in the scatter plot.
Related but important caveat (real-world context, not explicitly stated in the transcript):
- Correlation does not imply causation: two variables may move together without one causing the other.
- Non-linear relationships may have low linear correlation even when there is a relationship (e.g., curved patterns).
- When linear correlation is insufficient, nonparametric or rank-based measures (e.g., Spearman’s rho) or nonlinear models may be appropriate.
Quick recap of concepts introduced: correlation, variables, scatter plots, positive vs negative correlation, no (zero) correlation, and interpretation of scatter patterns.

Key Concepts and Definitions

Correlation (general concept): a relationship or association between two variables.
Correlation as a measure: a numerical value that quantifies how closely two variables vary together and how well you can predict one from the other.
Direct vs inverse relationship:
- Direct (positive) correlation: both variables increase together or decrease together.
- Inverse (negative) correlation: one variable increases while the other decreases.
Variable: anything that can vary across observations.
Scatter plot: a graphical representation used to visualize the relationship between two variables.
Zero (no) correlation: data show no discernible linear relationship; points do not cluster along a line.

Formulas and Quantitative Details

Pearson correlation coefficient (to quantify linear association):
$r \,=\, \frac{\mathrm{cov}(X,Y)}{\sigmaX \sigmaY} \,=\, \frac{\sumi (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sumi (xi - \bar{x})^2} \sqrt{\sumi (y_i - \bar{y})^2}}$
Range of r: $-1 \le r \le 1$
- r = 1: perfect positive linear relationship
- r = -1: perfect negative linear relationship
- r = 0: no linear relationship (but may still have a non-linear relationship)
Visual interpretation guide:
- Points closely along an upward-sloping line -> strong positive correlation
- Points closely along a downward-sloping line -> strong negative correlation
- Points scattered with no clear pattern -> weak or zero correlation

How to Interpret Correlation Strength and Direction

Direction:
- Positive correlation: as X increases, Y tends to increase; slope of the best-fit line is positive.
- Negative correlation: as X increases, Y tends to decrease; slope of the best-fit line is negative.
Strength (for linear correlation):
- |r| close to 1: strong linear relationship
- |r| around 0.3 to 0.7: moderate linear relationship
- |r| close to 0: weak linear relationship
Important caveats:
- A high |r| does not imply causation.
- A low |r| does not imply no relationship if the relationship is non-linear.
- Outliers can substantially affect r.

Practical Examples (Illustrative Scenarios)

Example 1: Positive correlation
- Variables: hours studied (X) and exam score (Y)
- Expectation: more hours studied tends to be associated with higher exam scores.
- Scatter plot pattern: upward trend; r > 0.
Example 2: Negative correlation
- Variables: number of hours of video game playing per day (X) and sleep duration (Y)
- Expectation: more gaming hours tends to be associated with less sleep.
- Scatter plot pattern: downward trend; r < 0.
Example 3: No correlation (zero correlation)
- Variables: shoe size (X) and height of a randomly chosen adult (Y) within a limited range (or as in transcript, a "birdshot" pattern)
- Scatter plot pattern: no clear pattern; points scattered around without a linear trend; r ≈ 0.
Example 4: Non-linear relationship (note for interpretation)
- Variables: X and Y with a curved relationship (e.g., Y = X^2)
- Pearson r may be near 0 even though there is a strong relationship; linear correlation misses the pattern.

Connections to Broader Concepts

Relationship to regression: correlation provides a foundation for understanding linear regression, which models the relationship between X and Y via a linear equation.
Foundational principle: importance of distinguishing correlation (association) from causation; correlation is a necessary but not sufficient condition for causation.
Real-world relevance: helps in making predictions, understanding relationships between measurements, and identifying potential confounding factors in observational studies.
Ethical and practical implications:
- Misinterpreting correlation can lead to false conclusions or misguided decisions (e.g., assuming causation from correlation alone).
- In research and policy, careful study design and consideration of confounders are essential when using correlation for decision making.

Quick Reference: Summary of Takeaways

Correlation is a measure of how two variables relate and co-vary.
Variables are anything that can vary across observations.
Scatter plots visualize the strength and direction of the relationship.
Positive vs negative correlation describes the direction of association; zero correlation indicates no linear association.
The strength of correlation is captured by the coefficient $r$ , with values in the range $-1 \le r \le 1$ .
High scatter (no clear pattern) corresponds to weak or zero correlation; tight linear patterns correspond to strong correlation.
Always consider non-linearity and potential confounding factors; correlation does not imply causation.