Topic 2.5-2.8: Correlation, Causation, Linear Regression, and Residuals

Topic 2.5: Correlation Coefficient (r)

Understanding the Correlation Coefficient (r)

  • Definition: The correlation coefficient, denoted by r, is a numerical measure that quantifies the direction and strength of a linear relationship between two quantitative variables. It values range from -1 to 1.

  • Calculation: While a complex formula exists, in practice, technology (calculators or statistical software) is used to calculate r. Focus is on interpretation rather than manual calculation.

Interpreting the Correlation Coefficient (r)

  • Direction: The sign of r indicates the direction of the linear relationship:

    • Positive r values: Indicate a positive correlation. As the explanatory variable (x) increases, the response variable (y) tends to increase.

    • Negative r values: Indicate a negative correlation. As the explanatory variable (x) increases, the response variable (y) tends to decrease.

  • Strength: The magnitude (absolute value) of r quantifies the strength of the linear relationship:

    • Closer to 1 or -1: Stronger linear correlation.

    • Closer to 0: Weaker linear correlation.

  • Example: Attendance vs. Test Scores: A random sample of 11 students showed a strong positive linear relationship between the percent of school days attended and the number of questions answered correctly on an exam. If calculated, the correlation coefficient was found to be r = 0.95, indicating a very strong positive relationship.

Limitations of Correlation Coefficient (r) Alone

  • The correlation coefficient r alone does not provide sufficient evidence to determine the form of the relationship (e.g., if it's truly linear or curved) or to identify unusual features (like outliers, which can heavily influence r).

  • Practical Implications: Even with a strong positive correlation (r = 0.95) between attendance and test scores, real-world school initiatives to raise attendance often resulted in flat test scores. This highlights a critical limitation: correlation does not imply causation.

Critical Thinking in Data Analysis

  • Always be critical, cautious, and compassionate when analyzing data.

Topic 2.5 (Continued): Correlation vs. Causation

Distinguishing Correlation and Causation

  • Correlation: Simply means two variables are associated; they tend to move together in a predictable way.

  • Causation: Means that one variable directly causes a change in another variable.

  • Fundamental Concept: Correlation does not equal causation. Just because two variables are strongly correlated does not mean one causes the other.

Alternative Explanations and Confounding Variables

  • Example: Attendance Initiatives Revisited: Despite a strong positive correlation between attendance and test scores, school initiatives to raise attendance (e.g., calling programs, attendance managers, ride-sharing) failed to boost test scores. Attendance rose, but scores stayed flat.

  • Causal Chain Hypothesis: School leaders initially hypothesized: Poverty --> Low Attendance --> Low Test Scores. If Low Attendance is fixed, Low Test Scores should improve.

  • Alternative Explanation: The true causal chain might be more complex, involving confounding variables. Poverty could cause multiple factors that truly lead to low test scores, independent of attendance:

    • Hunger and lack of nutrition.

    • Zoning to worse-resourced schools.

    • Reduced study time due to working to support family or caring for family members.

  • Implication: Fixing attendance alone does not address these underlying issues, so test scores remain unchanged.

Coincidental Correlations

  • Sometimes, strong correlations can be purely coincidental, with no logical link or a hidden common cause.

  • Example: Divorce rate in Maine and per capita consumption of margarine show a strong positive correlation. This is clearly coincidental and does not imply any causal relationship.

Inferences from Correlated Data

  • When strong correlations are observed, it's crucial to investigate potential underlying causal mechanisms rather than assuming direct causation. Causal inference techniques (covered in later topics) are needed to establish causation.

Investigating Educational Inequity

  • As statisticians, it's important to rigorously investigate the driving factors of educational inequity to understand them and propose effective solutions, moving beyond simple correlation.

Topic 2.6: Linear Regression Model and Predictions

Constructing a Linear Regression Model

  • Purpose: To model the linear relationship between an explanatory variable (x) and a response variable (y) and to make predictions.

  • Algebraic Form: The familiar linear equation from algebra is y = mx + b.

  • Statistical Form (Linear Regression Equation): ext{Predicted } y = ext{y-intercept} + ext{slope} imes x or ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{regression equation: } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ }^ ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } ext{ } extit{Y} hat $= a + bx$. (Note: Some resources may use ext{ } ext{ } ext{ } ext{ } ext{predicted } y = ext{intercept} + ext{slope} imes x

    • extit{Y } hat is the predicted response value (not the exact value, due to variability).

    • a is the y-intercept.

    • b is the slope.

  • Technology Use: Statistical software or calculators are typically used to determine the exact equation of the regression line.

  • Example: Supermarket Organic Items: Linda Sauceto's study found a positive trend between average income in a ZIP code and the number of organic items offered in local supermarkets. A linear model could be ext{ } ext{ } ext{ } ext{ } ext{ } ext{predicted } y = -14.7 + 0.001x.

Making Predictions with the Model

  • To make a prediction, substitute a specific value for the explanatory variable x into the regression equation.

  • Example: If a ZIP code's average income is $90,000, the predicted number of organic items would be ext{ } ext{ } ext{ } ext{ } ext{ } ext{predicted } y = -14.7 + 0.001(90000) = 75.3 items.

  • Nature of Predictions: Predictions are average or expected values, not exact observed data points. This is why a decimal such as 75.3 organic items is acceptable; it represents an average tendency, not a discrete count for a single store.

Gauging Reliability of Predictions: Dangers of Extrapolation

  • Extrapolation: Making predictions for x-values that are outside the range of the observed x-values in the original dataset.

  • Danger: Predictions made through extrapolation are often unreliable because the observed linear trend may not continue beyond the range of the data.

    • Example: A study concluded that 100 ext{%} of Americans would be overweight by 2048 by linearly extrapolating current trends. This is unreliable because it's unlikely a trend would continue indefinitely to 100 ext{%}; it more likely would curve out or change direction.

    • This danger applies to both time-based explanatory variables and other types (e.g., income).

Topic 2.6 (Continued): Free Response Questions and Extrapolation

Strategies for Free Response Questions (FRQs)

  • Annotate the Problem: Understand the motivation/context, identify variables, and clarify their roles (explanatory/response).

  • Include Context: All answers and interpretations must be presented within the specific context of the problem's variables.

  • Show Pertinent Work: Clearly show calculations, setups for graphs, and reasoning.

  • Practice: Consistent practice with past FRQs improves performance.

Example FRQ: Swine Population and Ammonia Concentration (2002 AP Exam Form B, Q1)

  • Context: Investigating if swine population size (in thousands) affects atmospheric ammonia concentration (in parts per million).

  • Part A: Construct a Scatterplot

    • Requirements: Plot all data points, label both axes with variable context and units, include a scale on both axes, and provide a title.

    • Scoring: Essentially Correct (E) for all components; Partially Correct (P) for one missing component; Incorrect (I) for major errors.

  • Part B: Interpret r = 0.85

    • Interpretation: Describe the strength (e.g., strong/moderately strong), direction (positive/negative), and context (swine population size and atmospheric ammonia concentration).

    • Model Response: "There is a strong positive linear relationship between swine population size and atmospheric ammonia concentration."

    • Scoring: E for all three components; P for two; I for one or zero.

  • Part C: Assess Linearity

    • Requirements: Refer to both the visual pattern in the scatterplot and the magnitude of the r value.

    • Model Response: "Because the data in the scatterplot appear to follow an approximately linear pattern and the magnitude of the r value (0.85) is relatively high, the relationship does appear to be linear."

    • Scoring: E for both components; P for one.

  • Part D: Prediction and Reliability

    • Problem: Predict ammonia concentration for a swine population of 200 and comment on reliability using the model ext{ } ext{ } ext{ } ext{ } ext{ } ext{predicted } y = 0.01 + 0.72x.

    • Prediction: Swine population is in thousands, so 200 pigs corresponds to x = 0.2. Predicted ammonia: ext{ } ext{ } ext{predicted } y = 0.01 + 0.72(0.2) = 0.154 ppm.

    • Reliability: This is extrapolation because x = 0.2 falls outside the range of the given explanatory variable data (e.g., if data ranged from 0.3 to 1.9). Therefore, the prediction is not reliable as trends may not continue.

    • Scoring: E for correct prediction, correctly stating unreliability, and providing the reason (extrapolation with context); P for two components; I for one or zero.

  • Overall Scoring: E (1 point), P (0.5 points), I (0 points) for each part. Total score rounded holistically.

Topic 2.7: Residuals and Residual Plots

Calculating and Interpreting Residuals

  • Definition: A residual is the difference between an actual observed response value (y) and the predicted response value ($ ext{ } extit{Y} hat ) by the linear model.

  • Formula: ext{Residual} = y - ext{ } extit{Y} hat

  • Example: San Antonio Supermarkets

    • Model: ext{ } ext{ } ext{ } ext{ } ext{ } ext{predicted } y = -14.7 + 0.001x. (x = income, y = organic items).

    • For a ZIP code with average income x = 66,073:

      • Predicted organic items: ext{ } ext{ } ext{predicted } y = -14.7 + 0.001(66073) = 51.4

    • If the actual observed number of organic items at a store in that ZIP code was y = 84, then:

      • Residual: 84 - 51.4 = 32.6 items.

  • Interpretation: The actual number of organic vegetable varieties offered (84) was 32.6 items greater than the model predicted (51.4).

    • Positive Residual: Indicates the model underestimated the actual response value (y > ext{ } extit{Y} hat ).

    • Negative Residual: Indicates the model overestimated the actual response value (y < ext{ } extit{Y}$$ hat ).

Constructing and Using Residual Plots

  • Construction: A residual plot is a scatterplot where the x-axis represents the explanatory variable (or predicted y-values) and the y-axis represents the residuals.

  • Purpose: Residual plots visualize and accentuate the residuals, providing a focused view to assess how well the linear model fits the data.

Assessing Model Fit with Residual Plots

  • Good Model Fit: A residual plot suggesting a good fit will show a random scatter of points centered around zero, with no clear pattern or structure.

    • This indicates that the linear model has successfully captured the systematic linear trend in the data, and the remaining residuals are just random noise that cannot be modeled further by a linear relationship.

  • Bad Model Fit: A residual plot suggesting a bad fit will show a distinct pattern among the residual values (e.g., a curved pattern, a fanning out or fanning in pattern).

    • This indicates that the linear model failed to capture a systematic pattern in the data, suggesting that a linear model may not be appropriate (e.g., the true relationship might be nonlinear).

  • Action for Bad Fit: When a residual plot shows a pattern, it signals that a linear model might not be the best choice, and alternative non-linear models or other modeling procedures should be considered.

Topic 2.8: Least Squares Regression Line (LSRL) Determination

Determining the Least Squares Regression Line (LSRL)

  • Objective: To find the