Linear Regression: Interpretation and Extrapolation

Regression Line Fundamentals
- The least squares regression line provides the best linear fit for observed data, yielding an output of slope ( $a$ ) and y-intercept ( $b$ ).
- The general algebraic form is $y = ax + b$ , where:
  - $a$ represents the slope of the line.
  - $b$ represents the y-intercept.
- All regression lines inherently pass through the point $(\bar{x}, \bar{y})$ (the mean of the x-values and the mean of the y-values). This point represents the best central estimate from the data.
- Rounding: Textbooks often use four decimal places. Always adhere to specific rounding instructions given for homework or exams.
Interpreting the Slope ( $a$ )
- Definition: The slope is defined as "rise over run," representing the vertical change in the y-variable for every horizontal change in the x-variable. It shows the rate of change of the response variable with respect to the explanatory variable.
- Contextual Interpretation - Key Phraseology: For every one-unit increase in the explanatory variable ( $x$ ), we expect the response variable ( $y$ ) to change by the value of the slope.
- Cautionary Language:
  - It is crucial to use language that reflects uncertainty in statistical models, such as "we expect to happen."
  - Avoid definitive statements like "this will happen," as statistical models are based on samples and inherent uncertainty, not entire populations.
- Algebraic Analogy: In the equation $y = 2x + 1$ , if $x$ increases from $0$ to $1$ , $y$ increases from $1$ to $3$ (an increase of $2$ , which is the slope). Each unit increase in $x$ leads to a $2$ unit increase in $y$ .
- Example (Burger): If x (fat in grams) increases by one gram, we expect y (calories) to increase by the value of the slope.
- General Interpretation Template: "For every extra one [unit of x-variable], the [y-variable] is expected to [increase/decrease] by $|slope_value|$ ."
  - Use "increase" if the slope is positive (a > 0).
  - Use "decrease" if the slope is negative (a < 0), but use the absolute value of the slope (do not include the negative sign with the word "decrease").
Interpreting the Y-intercept ( $b$ )
- Definition: Algebraically, the y-intercept is the value of $y$ when $x=0$ .
- Contextual Interpretation: If the explanatory variable ( $x$ ) is zero, we expect the response variable ( $y$ ) to be the value of the y-intercept.
- Conditions for Appropriate Interpretation: It is not always appropriate to interpret the y-intercept in context. Always consider these two questions:
  1. Is zero a reasonable value for the explanatory variable ( $x$ ) in the real world?
    - Example: If x is height, a person cannot be $0$ inches tall. Thus, it's inappropriate to interpret.
    - If not reasonable, the interpretation should state: "It is inappropriate to interpret the y-intercept" or "It does not make sense to interpret the y-intercept."
  2. Do we have any observed data points near $x=0$ ?
    - Example: If x is temperature and all collected data is from summer months ( $70^\circ \text{F}-90^\circ \text{F}$ ), even if $0^\circ \text{F}$ is a reasonable temperature, we have no data near $x=0$ .
    - Reasoning: The relationship between x and y may not remain consistent at extreme values outside the observed data range. Extrapolating to $x=0$ in such cases can lead to untrustworthy predictions.
- If either of these conditions leads to a "no," then it is inappropriate to interpret the y-intercept. No need for both to be problematic.
- Self-Correction: If you accidentally interpret the y-intercept when x=0 is unreasonable, the resulting statement often sounds illogical (e.g., negative weight for zero height), prompting self-correction. However, for the second condition (no data near $x=0$ ), the interpretation might sound logical but still be untrustworthy.
Extrapolation: Using the Model Beyond Its Bounds
- Definition: Extrapolation occurs when using a regression model to make predictions for values of the explanatory variable ( $x$ ) that are significantly larger or smaller than the range of the observed data used to create the model.
- Problem: We lack certainty that the established linear relationship (or any relationship) between x and y will continue to hold true outside the observed data range. The behavior might change.
- Trustworthiness: While a linear equation can mathematically provide a y value for any x, the trustworthiness of that prediction decreases as the $x$ value moves further away from the mean of the collected $x$ data.
- Examples of Volatility/Non-linearity at Extremes:
  - Oil Prices: Highly volatile, making long-term predictions unreliable.
  - Weather Patterns: Relationships (e.g., temperature and rainfall) differ significantly between seasons (e.g., summer rain vs. winter snow/sleet).
  - Public Opinion: Can shift rapidly and unexpectedly.
Example: Height and Weight Regression Analysis
- Scenario: A sample of $10$ people with recorded height (inches) and weight (pounds).
  - Explanatory variable ( $x$ ): Height
  - Response variable ( $y$ ): Weight
- Steps for Finding the Least Squares Regression Line (Calculator Usage):
  1. Enter height data into List 1 (L1) and weight data into List 2 (L2) of the calculator. Ensure lists are of equal length and corresponding values are aligned.
  2. Use the LinReg(ax+b) function (e.g., STAT > CALC > 4:LinReg(ax+b)), specifying L1 as the Xlist and L2 as the Ylist.
- Example Calculator Output (Illustrative Values):
  - Slope ( $a$ ): $5.6428$ (rounded to four decimal places)
  - Y-intercept ( $b$ ): $-200.4343$ (rounded to four decimal places)
  - Note: Pay close attention to negative signs for $a$ and $b$ .
- Constructing the Regression Equation:
  - The predicted weight ( $\widehat{Weight}$ ) equation: $\widehat{Weight} = 5.6428(\text{Height}) - 200.4343$
  - Alternatively, using generic variables: $\hat{y} = 5.6428x - 200.4343$ (Remember the $\hat{y}$ and the $x$ variable).
- Interpreting the Slope for Height and Weight:
  - Slope: $a = 5.6428$
  - Interpretation: "For every extra one inch taller a person is, we expect their weight to increase by $5.6428$ pounds." (Since the slope is positive, we use "increase").
- Interpreting the Y-intercept for Height and Weight:
  - Y-intercept: $b = -200.4343$
  - Condition 1 Check (Reasonable value for $x=0$ ): Can a person be $0$ inches tall? No, this is not a reasonable value.
  - Condition 2 Check (Data near $x=0$ ): The observed height data ranges from $54$ inches to $72$ inches. There are no observations near $x=0$ .
  - Conclusion: It is inappropriate to interpret the y-intercept. A statement like "If a person is $0$ inches tall, we expect their weight to be $-200.4343$ pounds" is illogical (negative weight).