2

Simple Linear Regression

Overview

  • Statistical models describe relationships between two or more variables.

  • Common question: "Does x influence y?"

  • Predictor variable x's influence on response variable y.

Simple Linear Regression

  • Simplest approach: Estimate a “best-fit” straight line through data.

  • Assess whether the line is flat (slope = 0) or has a slope.

  • Does not prove causality, but indicates a relationship between x and y.

  • Example: Heights of men and their fathers may relate to environmental/lifestyle factors rather than direct causation.

Purposes of Regression Modelling

  1. Describe Relationship - Does height pass from father to son?

    • Investigate heritability of human male height.

  2. Explain Variation - How much of y's variability is due to father's height?

  3. Predict New Values for y - Estimating height for men based on father's height.

    • Example: Predict height of a man whose father is 5’10”.

Basic Regression Model

  • Equation: ( y = b_0 + b_1 x + \epsilon )

    • ( b_0 ): Intercept

    • ( b_1 ): Slope

    • ( y ): Response variable

    • ( x ): Predictor variable

Least-Squares Estimation

  • Goal: Minimize the Residual Sum of Squares (RSS).

    • Formula: ( RSS = \sum_{i=1}^{n} \epsilon_i^2 )

  • Residual: Difference between observed value ( y_i ) and predicted value ( y_{i} ).

Regression of Father-Son Heights

  • Examines effect of father's height on son's height.

Making Predictions

  • Example prediction for a man whose father's height is 5’10”:

    • ( y = b_0 + b_1 x = 37.6 + 0.45 \cdot 70 = 69.4 )

Goodness-of-Fit

  • Evaluate how well the model fits the data.

  • Partitioning sum-of-squares formula:

    • ( (y_i - \bar{y})^2 = (y_i - \hat{y})^2 + (y_i - y_i)^2 )

  • Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares.

  • RSS should be minimal for accurate predictions.

Coefficient of Determination ( R^2 )

  • Measures variance in y explained by the model.

  • Formula: ( R^2 = 1 - \frac{SSR}{TSS} )

  • Ranges from 0 to 1; closer to 0 indicates more accuracy.

Explaining Variation

  • Coefficient of determination for father-son height model: 35%.

Null-Hypothesis Testing

  • To determine significant relationship between y and x:

    • Null Hypothesis: ( H_0: b_1 = 0 ) (no effect).

    • Methods:

      1. t-statistic for regression coefficients (focus on slope).

      2. F-statistic based on sum-of-squares:

      • Formula: ( F = \frac{SS_{Regression}}{p} \div \frac{SS_{Residual}}{N - p - 1} ).

F Distribution

  • F-statistic with degrees of freedom (1,48) under null hypothesis.

  • Observed F-statistic of 25.4 indicates hypothesis rejection.

Regression of Father-Son Heights: P-value

  • Data shows significant relationship (p=0.0000007).

  • Taller men have taller sons.

Assumptions for Valid P-values

  1. Residuals should be normally distributed.

  2. Residuals need to be independent.

  3. Homoscedasticity: residuals must have equal variance.

Residual Analysis

  • Residuals should be symmetric around 0.

  • Independence violated if data collected in groups.

  • Residuals should show constant variance around regression line.

Transforming Predictor Variables

  • Example data on soybean yields and rainfall: 1930-62 in Illinois.

Raw Predictor Analysis

  • Regression equation: ( y = 15.8 + 1.9x )

  • Coefficient of determination (R2) = 0.53.

Slope and Intercept Interpretation

  • Analysis includes shifts, scaling, standardization, and normalization of predictors:

    1. Shifted: Adjusts x to account for specific conditions.

    2. Scaled: Changes unit (inches to cm) affects slope but not R2.

    3. Standardized: Creates mean of zero, standard deviation of one.

    4. Normalized: Adjusts to a 0-1 range.

    5. Logarithm: Transforms x logarithmically.

    6. Threshold: Converts variable based on a cutoff (e.g., rainfall > 4).

Specific Transformations

  • Each transformation affects parameters while retaining relationships:

    • Example: Shifting minimal values or changing units leads to similar outcomes.

  • Log Transform to capture diminishing returns with rainfall increases.

Conclusion

  • Proper regression techniques can yield insights into complex relationships between variables, helping guide predictions and statistical inference.

robot