2
Simple Linear Regression
Overview
Statistical models describe relationships between two or more variables.
Common question: "Does x influence y?"
Predictor variable x's influence on response variable y.
Simple Linear Regression
Simplest approach: Estimate a “best-fit” straight line through data.
Assess whether the line is flat (slope = 0) or has a slope.
Does not prove causality, but indicates a relationship between x and y.
Example: Heights of men and their fathers may relate to environmental/lifestyle factors rather than direct causation.
Purposes of Regression Modelling
Describe Relationship - Does height pass from father to son?
Investigate heritability of human male height.
Explain Variation - How much of y's variability is due to father's height?
Predict New Values for y - Estimating height for men based on father's height.
Example: Predict height of a man whose father is 5’10”.
Basic Regression Model
Equation: ( y = b_0 + b_1 x + \epsilon )
( b_0 ): Intercept
( b_1 ): Slope
( y ): Response variable
( x ): Predictor variable
Least-Squares Estimation
Goal: Minimize the Residual Sum of Squares (RSS).
Formula: ( RSS = \sum_{i=1}^{n} \epsilon_i^2 )
Residual: Difference between observed value ( y_i ) and predicted value ( y_{i} ).
Regression of Father-Son Heights
Examines effect of father's height on son's height.
Making Predictions
Example prediction for a man whose father's height is 5’10”:
( y = b_0 + b_1 x = 37.6 + 0.45 \cdot 70 = 69.4 )
Goodness-of-Fit
Evaluate how well the model fits the data.
Partitioning sum-of-squares formula:
( (y_i - \bar{y})^2 = (y_i - \hat{y})^2 + (y_i - y_i)^2 )
Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares.
RSS should be minimal for accurate predictions.
Coefficient of Determination ( R^2 )
Measures variance in y explained by the model.
Formula: ( R^2 = 1 - \frac{SSR}{TSS} )
Ranges from 0 to 1; closer to 0 indicates more accuracy.
Explaining Variation
Coefficient of determination for father-son height model: 35%.
Null-Hypothesis Testing
To determine significant relationship between y and x:
Null Hypothesis: ( H_0: b_1 = 0 ) (no effect).
Methods:
t-statistic for regression coefficients (focus on slope).
F-statistic based on sum-of-squares:
Formula: ( F = \frac{SS_{Regression}}{p} \div \frac{SS_{Residual}}{N - p - 1} ).
F Distribution
F-statistic with degrees of freedom (1,48) under null hypothesis.
Observed F-statistic of 25.4 indicates hypothesis rejection.
Regression of Father-Son Heights: P-value
Data shows significant relationship (p=0.0000007).
Taller men have taller sons.
Assumptions for Valid P-values
Residuals should be normally distributed.
Residuals need to be independent.
Homoscedasticity: residuals must have equal variance.
Residual Analysis
Residuals should be symmetric around 0.
Independence violated if data collected in groups.
Residuals should show constant variance around regression line.
Transforming Predictor Variables
Example data on soybean yields and rainfall: 1930-62 in Illinois.
Raw Predictor Analysis
Regression equation: ( y = 15.8 + 1.9x )
Coefficient of determination (R2) = 0.53.
Slope and Intercept Interpretation
Analysis includes shifts, scaling, standardization, and normalization of predictors:
Shifted: Adjusts x to account for specific conditions.
Scaled: Changes unit (inches to cm) affects slope but not R2.
Standardized: Creates mean of zero, standard deviation of one.
Normalized: Adjusts to a 0-1 range.
Logarithm: Transforms x logarithmically.
Threshold: Converts variable based on a cutoff (e.g., rainfall > 4).
Specific Transformations
Each transformation affects parameters while retaining relationships:
Example: Shifting minimal values or changing units leads to similar outcomes.
Log Transform to capture diminishing returns with rainfall increases.
Conclusion
Proper regression techniques can yield insights into complex relationships between variables, helping guide predictions and statistical inference.