Comprehensive Study Guide on Linear and Non-linear Regression

Introduction to Statistical Relationships and Regression

  • This session marks the penultimate lecture for this specific lecturer. Future lectures will be conducted by David Moreau, focusing on quantitative research and probability theory.
  • The current focus transitions away from hypothesis testing toward exploratory methods aimed at understanding and quantifying patterns in data, specifically relationships between variables.
  • Correlation Review:
    • Pearson correlation is the standard tool for measuring the strength and direction of a linear relationship between two variables.
    • It assesses if a significant relationship exists (via the correlation test) and how strong the relationship is (via the correlation value).
  • Regression Intro: Linear regression is a flexible tool that measures linear relationships but can be extended to analyze multiple variables and complex non-linear patterns.

The Equation for a Straight Line

  • Linear regression is based on the high school mathematical formula for a straight line: y=m×x+cy = m \times x + c.
    • yy: The outcome variable, plotted on the vertical (y-axis). Regression aims to predict yy from xx.
    • xx: The predictor variable, plotted on the horizontal (x-axis).
    • mm: The slope parameter. It controls the steepness and direction (positive or negative) of the line.
    • cc: The intercept parameter. It determines the point where the line intersects the y-axis.
  • Every possible straight line is defined entirely by these two numbers (slope and intercept).
  • Formal Regression Formulation in Textbooks: y=intercept+slope×x+errory = \text{intercept} + \text{slope} \times x + \text{error}.
    • The logic remains identical to the slope-intercept form, but adds an error term.
    • Error Term: Captures the "residual" or leftover error. Realistic data never fits a line perfectly due to measurement errors, individual differences, or unobserved factors. Data typically lies around a "line of best fit."

Fitting the Model: The Method of Least Squares

  • "Fitting" a model refers to finding the specific values for the slope and intercept that create the line best representing the collected data.
  • Logic of the Process:
    • Step 1: Measure distances. The software (such as JMovie or R) calculates how far each individual data point is from a candidate line in terms of the outcome (yy). These are vertical distances representing the difference between the prediction (the line) and the reality (the data point).
    • Step 2: Square the distances. Because some points are above the line (positive difference) and some are below (negative difference), the distances are squared to ensure all values are positive (negative×negative=positive\text{negative} \times \text{negative} = \text{positive}). This removes directionality and focuses on distance magnitude.
    • Step 3: Calculation of the Sum of Squares. Add all squared distances together to get the "residual sum of squares" (SSresidualSS_{\text{residual}}).
  • Definition of Best Fit: The line of best fit is defined as the unique line that minimizes the residual sum of squares. Software iteratively adjusts the slope and intercept until it finds the parameters that result in the smallest possible SSresidualSS_{\text{residual}}.

Quantifying Model Quality: Goldness of Fit and R2R^2

  • Once a line is fitted, its quality is measured using R2R^2 (R-squared).
  • R2R^2 is a goodness of fit measure. A high value indicates a good fit (points close to the line); a low value indicates a bad fit (points scattered far from the line).
  • Mathematical Formula for R2R^2:
    • R2=1SSresidualSStotalR^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}
    • SSresidualSS_{\text{residual}}: Variability in yy not explained by the line.
    • SStotalSS_{\text{total}}: Total variability in yy, measured as the sum of squared differences between data points and their mean.
  • Interpretation of R2R^2:
    • It represents the proportion of variance in the outcome variable (yy) that can be explained by the predictor variable (xx).
    • An R2R^2 of 11 means 100%100\% of the variance is explained.
  • Relationship with Correlation: For a straight-line fit, R2R^2 is exactly the square of the Pearson correlation coefficient (rr). Thus, a high correlation necessarily results in a high R2R^2.

Prediction in Linear Regression

  • The advantage of regression over correlation is the ability to predict new, unobserved values.
  • By establishing the specific parameters of the line, researchers can use the resulting equation as a model.
  • Example Study: Relationship between Quiz 1 scores (xx) and Quiz 2 scores (yy).
    • Scenario: A new student scores 3.53.5 on Quiz 1.
    • By plugging x=3.5x = 3.5 into the regression equation, the model provides an estimate of their Quiz 2 score (e.g., 3.43.4).
    • Visually, this is equivalent to finding 3.53.5 on the x-axis, moving up to the regression line, and identifying the corresponding value on the y-axis.

Multiple Regression and Dimensionality

  • Regression can be extended beyond simple two-variable relationships into higher dimensions.
  • Dimensional Logic:
    • 1 variable = 1 dimension (data points on a line).
    • 2 variables = 2 dimensions (scatter plot with a line).
    • 3 variables = 3 dimensions (3D plot where the "line" becomes a 2D plane or "sheet").
    • 4+ variables = Higher dimensions (cannot be visualized, but mathematically calculable).
  • Multiple Regression Formulation:
    • y=intercept+(slope1×X1)+(slope2×X2)++(slopen×Xn)y = \text{intercept} + (\text{slope}_1 \times X_1) + (\text{slope}_2 \times X_2) + \dots + (\text{slope}_n \times X_n)
    • This allows a researcher to use multiple predictors (e.g., income, IQ, and age) to predict a single outcome (e.g., happiness).
    • The result is an R2R^2 value indicating how much of the variance in the outcome is explained by the best combination of all predictors.

Non-linear and Non-monotonic Relationships

  • Regression is not limited to straight lines; any mathematical function can be fitted to data using the method of least squares.
  • Types of Non-linear Models:
    • Quadratic Functions: Horseshoe-shaped parabolic functions useful for non-monotonic data.
    • Polynomial Regression: Uses increasingly complex functions with extra parameters to fit arbitrarily shaped patterns.
    • Sinusoidal Functions: Useful for cyclical data (e.g., seasonal rainfall patterns).
  • Monotonic vs. Non-monotonic:
    • Monotonic: Variables move in one direction (always up or always down). Spearman correlation is suitable here if the relationship is non-linear.
    • Non-monotonic: The relationship changes direction (e.g., increases then decreases).
  • Named Examples:
    • Yerkes-Dodson Law: An upside-down U-shape describing the relationship between arousal/stress and performance/accuracy.
    • Low arousal: Poor performance due to lack of focus.
    • High arousal: Bad performance due to being overwhelmed.
    • Sweet Spot: Optimal arousal levels lead to peak accuracy.

Model Complexity: Overfitting and Underfitting

  • A regression model is a simplified summary of the relationship between variables.
  • Underfitting: The model is too simple to capture the underlying pattern (e.g., fitting a straight line to data that clearly follows a curve). This results in a low R2R^2.
  • Overfitting: The model is too complex and begins to capture the random noise in the data rather than just the signal.
    • An overfitted model might have a very high R2R^2 on the original dataset because it wiggles through every data point.
    • However, an overfitted model is poor at generalizing to new data because it treated random noise as meaningful signal.
  • Evaluation Strategy:
    • A good model captures the overall pattern but ignores the noise.
    • The goal is to create a model that is "as simple as possible but no simpler."

Validation and Generalization

  • The standard method to avoid overfitting is to test the model on new data that was not used during the initial fitting process.
  • Workflow:
    • Phase 1 (Exploratory): Use a portion of the data to test different functions and complexities to find a good fit.
    • Phase 2 (Hypothesis Testing/Confirmation): Apply the finalized model to a "held-out" portion of the data or entirely new data.
    • If the model accurately predicts outcomes (yy) in the new dataset with a high R2R^2, it is confirmed as a generalizable, well-fitted model.

Questions & Discussion

  • Q: What would you like to cover in the last lecture on Thursday?
  • Suggestions from Audience:
    • Recap of everything covered.
    • Hints for the exam.
    • Review of harder quiz questions.
    • Examples of tricky statistics questions.
  • Q: What is happening next week?
  • A: David Moreau will be teaching sessions focused on probability and quantitative research.
  • Note on Final Schedule: A separate exam revision and wrap-up session will occur at the very end of the course; the Thursday session will incorporate popular requested topics from the voting results (recap, tricky questions, etc.).