Lecture 4: Introduction to Linear Models and Simple Linear Regression
Course Context and Sequential Learning Structure
Sequential Development: Unlike other courses (such as marine biology at UQ) where different lecturers can teach components in varying orders, this course is strictly sequential. Content builds directly on previous material, making every session foundational.
Foundational Importance: Linear models are described as the "bread and butter" of statistical analysis. A deep understanding of these concepts is considered essential for a career involving data analysis.
Classifications of Linear Models
General Linear Models: Characterized by a normal error structure. The standard formula is . It assumes that the residuals (the random distribution between points and the fitted line) follow a normal distribution once the model is fitted.
Simple Linear Regression: Contains one response variable () and exactly one continuous predictor variable ().
Multiple Linear Regression: Features one response variable () but utilizes multiple continuous predictor variables ().
Categorical Variables (Factors): When variables are discrete (e.g., eye color, hair color) rather than continuous, the analysis changes:
T-test: Used when a categorical predictor has only two levels.
ANOVA (Analysis of Variance): Used when a categorical predictor has multiple levels (e.g., four different hair colors).
Relationship: A T-test is essentially a simplified version of an ANOVA.
ANCOVA (Analysis of Covariance): A model that incorporates both continuous and categorical variables as predictors.
Generalized Linear Models (GLMs): These extend beyond the normal error structure to include different types of error distributions:
Poisson: Used for count data as the response variable () is discrete () rather than continuous.
Binomial: For binary outcomes.
Multinomial and Exponential: Other specialized error structures.
Purposes of Statistical Modeling
Understanding Relationships: Determining how different variables interact within a dataset.
Testing Scientific Hypotheses: Utilizing models like T-tests or ANOVAs to find statistical significance.
Prediction: Estimating future outcomes based on current data trends.
Estimating Hidden Parameters: Calculating values for variables that cannot be measured directly through a mathematical understanding of the system.
Controlling for Variation and Confounding: Identifying which predictors are actual drivers vs. those influenced by other variables.
Confounding Example: Ice cream sales in Australia have risen over 50 years. While climate change might show a correlation, the true driver is likely population size. By putting both population and climate change in a model, researchers adjust for the confounding variable to see which effect disappears or remains significant.
Decision Making: Using statistical evidence to inform choices.
Model Development and Parameter Estimation
Formula Representation: .
represents the intercept (often called ).
represents the slope (often called ).
Testing against Zero: The primary goal is to determine if the beta parameters are significantly different from zero. If a slope () is zero, the relationship is flat, implying the predictor has no importance.
Explanatory Power: Terms are dropped from the model if their removal does not significantly decrease the or the explanatory power. If the model is not significantly worse without a term, that predictor is not considered important.
The Mechanics of Line Fitting
The Centroid: The line of best fit for any linear regression always passes through the "centroid," which is the point defined by the mean of () and the mean of ().
Mental Rotation: Fitting a line is conceptually similar to rotating a line around the centroid until the residuals (the distances between individual points and the line) are minimized.
Residuals: Represented visually as dotted lines between the observed data points and the regression line. A horizontal line passing through the centroid would have very large residuals, making it a poor fit.
Sums of Squares and Variance Decomposition
Variable Definitions:
: The observed data points.
: The overall mean of the response variable.
: The predicted estimate from the model for a specific observation.
Components of Variance:
Total Sums of Squares (SST): Calculated as . This represents the total variation in the data. We square the difference because the sum of residuals () would otherwise equal zero.
Regression Sums of Squares (SSR): Variation explained by the regression model. As the line gets steeper (deviating from the flat mean), this value increases.
Error Sums of Squares (SSE): Also known as Residual Sums of Squares. This is the variation not explained by the model (noise). The model-fitting process rotates the line until this quantity is minimized.
The Relationship: Total Variation = Explained Variation (Regression) + Unexplained Variation (Error).
Mean Squares (MS): This is the variance estimate, calculated as . Mean Squares Total is equivalent to the total variance of the data.
Understanding Degrees of Freedom ()
Conceptual Definition: The number of values in a final calculation that are free to vary.
Variance Example: If you have data points and you know the mean is , the total sum must be . You can freely choose any numbers for the first four positions, but the fifth number is "locked" to ensure the result matches the pre-determined mean. Therefore, .
Regression Degrees of Freedom: For simple linear regression (), we estimate two parameters ( and ). Since is mathematically tied to the mean, we typically lose one degree of freedom for the slope, making the regression . If there were three parameters (e.g., two slopes), regression would be .
Error Degrees of Freedom: This is the total minus the regression . For a simple linear model, it is generally because you need to estimate two parameters ( and ) to calculate the predicted values ().
Linear Models in R: Syntax and Concepts
The
lm()Function: Stands for Linear Model.Formula Syntax:
Y ~ X(where~is the tilde operator, meaning "modeled as a function of").Tilde Logic: It does not mean "equals"; it specifies the relationship between response and predictors.
Intercept Defaults: R automatically includes an intercept in every model (represented by the value
1internally). It is advised never to remove the intercept unless you have a specific "a priori" reason to know it is zero.Interactions:
Denoted by a colon
:(e.g.,A:B). This represents synergistic or antagonistic effects where the impact of one predictor depends on the level of another.Example: Days off () based on Stress () and Sickness (). If you are both stressed and sick, you might take more days off than the simple sum of the two individual effects.
Star Operator:
A * Bis shorthand forA + B + A:B(main effects plus the interaction).
Mathematical Linearity vs. Geometric Shape
Linear Combination: A model is "linear" if it is additive in its parameters. It must follow the form: .
What is NOT Linear:
Parameters multiplied together (e.g., ).
Parameters in exponents or denominators.
Linear Shapes can be Curved:
Polynomials: is a linear model because it is linear in its parameters, even though its shape is a curve (quadratic).
Logs: Linear in parameters after transformation.
Exponential Growth: is non-linear in its raw form, but taking the log of both sides () "linearizes" the equation.
R Functions: Linear models use
lm(), while truly non-linear models requirenls()(Non-linear Least Squares), which is an iterative procedure with no explicit closed-form math.
Interpreting R Model Summary Output
Signal to Noise Ratio: The T-value is calculated as .
Testing Parameters: Each parameter (intercept and slope) is tested against zero.
If the estimate is large relative to its standard error, the T-value will be high and the p-value will be significant.
Interpretation of Slope: A slope of means that for every one-unit increase in , is predicted to increase by units.
ANOVA Tables in R: Using the
anova(model)command breaks down the sums of squares.F-test: This is the ratio of two variances ().
If the variation explained by the regression is significantly larger than the random error (noise), the F-ratio will be high, indicating a significant model.
Core Assumptions of Linear Models
Normality: Residuals should be normally distributed around the regression line.
Homogeneity of Variance (Homoscedasticity): The variance of the error should be constant across all levels of the predictor variable.
Questions & Discussion
Residual Calculation: A student asked about calculations regarding residual tables. The speaker confirmed that to show sums of squares are additive, one must take the difference between the observed value () and the model estimate (), square that difference for the error, and then compare it to the difference between the model estimate and the overall mean ().
Degrees of Freedom Illustration: Reiteration of the $n=5$ mean example. If four numbers are chosen (e.g., ) and the mean must be , the fifth number cannot be chosen freely; it is locked to ensure the set averages to the required value.