Statistical models describe relationships between two or more variables.
Common question: "Does x influence y?"
Predictor variable x's influence on response variable y.
Simplest approach: Estimate a “best-fit” straight line through data.
Assess whether the line is flat (slope = 0) or has a slope.
Does not prove causality, but indicates a relationship between x and y.
Example: Heights of men and their fathers may relate to environmental/lifestyle factors rather than direct causation.
Describe Relationship - Does height pass from father to son?
Investigate heritability of human male height.
Explain Variation - How much of y's variability is due to father's height?
Predict New Values for y - Estimating height for men based on father's height.
Example: Predict height of a man whose father is 5’10”.
Equation: ( y = b_0 + b_1 x + \epsilon )
( b_0 ): Intercept
( b_1 ): Slope
( y ): Response variable
( x ): Predictor variable
Goal: Minimize the Residual Sum of Squares (RSS).
Formula: ( RSS = \sum_{i=1}^{n} \epsilon_i^2 )
Residual: Difference between observed value ( y_i ) and predicted value ( y_{i} ).
Examines effect of father's height on son's height.
Example prediction for a man whose father's height is 5’10”:
( y = b_0 + b_1 x = 37.6 + 0.45 \cdot 70 = 69.4 )
Evaluate how well the model fits the data.
Partitioning sum-of-squares formula:
( (y_i - \bar{y})^2 = (y_i - \hat{y})^2 + (y_i - y_i)^2 )
Total Sum of Squares = Regression Sum of Squares + Residual Sum of Squares.
RSS should be minimal for accurate predictions.
Measures variance in y explained by the model.
Formula: ( R^2 = 1 - \frac{SSR}{TSS} )
Ranges from 0 to 1; closer to 0 indicates more accuracy.
Coefficient of determination for father-son height model: 35%.
To determine significant relationship between y and x:
Null Hypothesis: ( H_0: b_1 = 0 ) (no effect).
Methods:
t-statistic for regression coefficients (focus on slope).
F-statistic based on sum-of-squares:
Formula: ( F = \frac{SS_{Regression}}{p} \div \frac{SS_{Residual}}{N - p - 1} ).
F-statistic with degrees of freedom (1,48) under null hypothesis.
Observed F-statistic of 25.4 indicates hypothesis rejection.
Data shows significant relationship (p=0.0000007).
Taller men have taller sons.
Residuals should be normally distributed.
Residuals need to be independent.
Homoscedasticity: residuals must have equal variance.
Residuals should be symmetric around 0.
Independence violated if data collected in groups.
Residuals should show constant variance around regression line.
Example data on soybean yields and rainfall: 1930-62 in Illinois.
Regression equation: ( y = 15.8 + 1.9x )
Coefficient of determination (R2) = 0.53.
Analysis includes shifts, scaling, standardization, and normalization of predictors:
Shifted: Adjusts x to account for specific conditions.
Scaled: Changes unit (inches to cm) affects slope but not R2.
Standardized: Creates mean of zero, standard deviation of one.
Normalized: Adjusts to a 0-1 range.
Logarithm: Transforms x logarithmically.
Threshold: Converts variable based on a cutoff (e.g., rainfall > 4).
Each transformation affects parameters while retaining relationships:
Example: Shifting minimal values or changing units leads to similar outcomes.
Log Transform to capture diminishing returns with rainfall increases.
Proper regression techniques can yield insights into complex relationships between variables, helping guide predictions and statistical inference.