Linear Regression: Interpretation and Extrapolation

  • Regression Line Fundamentals

    • The least squares regression line provides the best linear fit for observed data, yielding an output of slope (aa) and y-intercept (bb).

    • The general algebraic form is y=ax+by = ax + b, where:

      • aa represents the slope of the line.

      • bb represents the y-intercept.

    • All regression lines inherently pass through the point (xˉ,yˉ)(\bar{x}, \bar{y}) (the mean of the x-values and the mean of the y-values). This point represents the best central estimate from the data.

    • Rounding: Textbooks often use four decimal places. Always adhere to specific rounding instructions given for homework or exams.

  • Interpreting the Slope (aa)

    • Definition: The slope is defined as "rise over run," representing the vertical change in the y-variable for every horizontal change in the x-variable. It shows the rate of change of the response variable with respect to the explanatory variable.

    • Contextual Interpretation - Key Phraseology: For every one-unit increase in the explanatory variable (xx), we expect the response variable (yy) to change by the value of the slope.

    • Cautionary Language:

      • It is crucial to use language that reflects uncertainty in statistical models, such as "we expect to happen."

      • Avoid definitive statements like "this will happen," as statistical models are based on samples and inherent uncertainty, not entire populations.

    • Algebraic Analogy: In the equation y=2x+1y = 2x + 1, if xx increases from 00 to 11, yy increases from 11 to 33 (an increase of 22, which is the slope). Each unit increase in xx leads to a 22 unit increase in yy.

    • Example (Burger): If x (fat in grams) increases by one gram, we expect y (calories) to increase by the value of the slope.

    • General Interpretation Template: "For every extra one [unit of x-variable], the [y-variable] is expected to [increase/decrease] by slopevalue|slope_value|."

      • Use "increase" if the slope is positive (a > 0).

      • Use "decrease" if the slope is negative (a < 0), but use the absolute value of the slope (do not include the negative sign with the word "decrease").

  • Interpreting the Y-intercept (bb)

    • Definition: Algebraically, the y-intercept is the value of yy when x=0x=0.

    • Contextual Interpretation: If the explanatory variable (xx) is zero, we expect the response variable (yy) to be the value of the y-intercept.

    • Conditions for Appropriate Interpretation: It is not always appropriate to interpret the y-intercept in context. Always consider these two questions:

      1. Is zero a reasonable value for the explanatory variable (xx) in the real world?

        • Example: If x is height, a person cannot be 00 inches tall. Thus, it's inappropriate to interpret.

        • If not reasonable, the interpretation should state: "It is inappropriate to interpret the y-intercept" or "It does not make sense to interpret the y-intercept."

      2. Do we have any observed data points near x=0x=0?

        • Example: If x is temperature and all collected data is from summer months (70F90F70^\circ \text{F}-90^\circ \text{F}), even if 0F0^\circ \text{F} is a reasonable temperature, we have no data near x=0x=0.

        • Reasoning: The relationship between x and y may not remain consistent at extreme values outside the observed data range. Extrapolating to x=0x=0 in such cases can lead to untrustworthy predictions.

    • If either of these conditions leads to a "no," then it is inappropriate to interpret the y-intercept. No need for both to be problematic.

    • Self-Correction: If you accidentally interpret the y-intercept when x=0 is unreasonable, the resulting statement often sounds illogical (e.g., negative weight for zero height), prompting self-correction. However, for the second condition (no data near x=0x=0), the interpretation might sound logical but still be untrustworthy.

  • Extrapolation: Using the Model Beyond Its Bounds

    • Definition: Extrapolation occurs when using a regression model to make predictions for values of the explanatory variable (xx) that are significantly larger or smaller than the range of the observed data used to create the model.

    • Problem: We lack certainty that the established linear relationship (or any relationship) between x and y will continue to hold true outside the observed data range. The behavior might change.

    • Trustworthiness: While a linear equation can mathematically provide a y value for any x, the trustworthiness of that prediction decreases as the xx value moves further away from the mean of the collected xx data.

    • Examples of Volatility/Non-linearity at Extremes:

      • Oil Prices: Highly volatile, making long-term predictions unreliable.

      • Weather Patterns: Relationships (e.g., temperature and rainfall) differ significantly between seasons (e.g., summer rain vs. winter snow/sleet).

      • Public Opinion: Can shift rapidly and unexpectedly.

  • Example: Height and Weight Regression Analysis

    • Scenario: A sample of 1010 people with recorded height (inches) and weight (pounds).

      • Explanatory variable (xx): Height

      • Response variable (yy): Weight

    • Steps for Finding the Least Squares Regression Line (Calculator Usage):

      1. Enter height data into List 1 (L1) and weight data into List 2 (L2) of the calculator. Ensure lists are of equal length and corresponding values are aligned.

      2. Use the LinReg(ax+b) function (e.g., STAT > CALC > 4:LinReg(ax+b)), specifying L1 as the Xlist and L2 as the Ylist.

    • Example Calculator Output (Illustrative Values):

      • Slope (aa): 5.64285.6428 (rounded to four decimal places)

      • Y-intercept (bb): 200.4343-200.4343 (rounded to four decimal places)

      • Note: Pay close attention to negative signs for aa and bb.

    • Constructing the Regression Equation:

      • The predicted weight (Weight^\widehat{Weight}) equation: Weight^=5.6428(Height)200.4343\widehat{Weight} = 5.6428(\text{Height}) - 200.4343

      • Alternatively, using generic variables: y^=5.6428x200.4343\hat{y} = 5.6428x - 200.4343 (Remember the y^\hat{y} and the xx variable).

    • Interpreting the Slope for Height and Weight:

      • Slope: a=5.6428a = 5.6428

      • Interpretation: "For every extra one inch taller a person is, we expect their weight to increase by 5.64285.6428 pounds." (Since the slope is positive, we use "increase").

    • Interpreting the Y-intercept for Height and Weight:

      • Y-intercept: b=200.4343b = -200.4343

      • Condition 1 Check (Reasonable value for x=0x=0): Can a person be 00 inches tall? No, this is not a reasonable value.

      • Condition 2 Check (Data near x=0x=0): The observed height data ranges from 5454 inches to 7272 inches. There are no observations near x=0x=0.

      • Conclusion: It is inappropriate to interpret the y-intercept. A statement like "If a person is 00 inches tall, we expect their weight to be 200.4343-200.4343 pounds" is illogical (negative weight).