Linear Regression Comprehensive Notes

Linear Regression

  • Linear regression is a key topic in statistics, assessed in both tests and assignments.
  • It's a widely used tool across various fields and industries.

Relationships Between Quantitative Variables

  • Linear regression explores relationships between two quantitative variables.
  • Example: Relationship between a hospitalized patient's stress level and depression level.
  • Process: Data is plotted, and a straight line is fitted to model the relationship. This can be done using tools like StatKey or Excel.

Linear Model

  • A model is created to predict the y variable using the x variable.
  • The model is based on sample data, which is an estimate of the true relationship.

Cricket Chirp Rate Example

  • Example: Predicting temperature (y) based on cricket chirp rate (x).
  • Data points represent measurements of chirp rate and temperature.
  • A straight line is fitted to the data to create a model.

Model Parameters

  • The general form for the linear model is: y{hat} = b0 + b_1x where
    • y_{hat} = predicted y-value
    • b_0 = y-intercept
    • b_1 = slope.
  • b0 and b1 are coefficients estimated from sample data.
  • These are estimates of the true population parameters, \beta0 (population y-intercept) and \beta1 (population slope).

Population Parameters

  • \beta0 and \beta1 represent population parameters for the relationship.
  • These values are generally unknown and inferred from sample data.

Random Errors (Residuals)

  • Residuals: The difference between observed and predicted y values.
  • Formula: Residual = Observed \space y - Predicted \space y
  • Residuals should be normally distributed with a mean of zero and a standard deviation.
  • Strong relationship: Residuals have a small standard deviation.
  • Larger scatter: Residuals have a larger standard deviation.

Residual Distribution

  • Residuals around the regression line should be normally distributed.
  • For a given x value (e.g., x_1), most residuals should be close to the model, clustering around the line of best fit.
  • It should have a similar distribution around that line of best fit regardless of where we are on the x axis.

Key Points for Linear Regression

  • Constant variability: Residual spread should be consistent across different x values.
  • Consistent band of data above and below the regression line without systematic patterns.

Departures from Linearity

  • Data should approximately follow a straight line.
  • Non-linear relationships (e.g., quadratic, exponential) are not suitable for linear regression across the entire x-axis, but in local ranges a linear model can be fitted.

Changing Variability

  • Variability of residuals should remain constant as x changes.

Outliers and Influential Points

  • Outliers: Data points far from the regression line.
  • Influential points: Outliers that disproportionately affect the line of best fit.

Examples of Linear Regression Candidates

  • Good candidate: Consistent band of residuals around the line of best fit.
  • Poor candidate: Data does not follow a straight line, indicating a non-linear relationship.

Residual Analysis for Model Assessment

  • Consistent bands of residuals are desired as x values change.
  • If X is between eighteen and twenty, well, all of our residuals are positive residuals. Only positive residuals. But if I look at this section, say, between twelve and eighteen, now we're looking at mostly negative residuals.
  • The average residual above 18 is going to be a positive number. But between twelve and eighteen, it's gonna be a negative number. We want our residuals to average out to about zero.

Problems with Changing Variability

  • Fanning: Residuals' variability changes with x.
  • As x gets bigger/smaller, the residuals get larger/smaller.

Outliers and Their Impact

  • Outliers have large residuals and can disproportionately affect the regression line.
  • We want all data to exert similar amounts of influence on the location of the regression line.

Best Practices

  • To examine the relationship between two quantitative variables, it needs to be displayed on a scatter plot.

Confidence Intervals for the Slope

  • The slope (b1) is an estimate of the population slope \beta_1.
  • Confidence intervals estimate the true population slope.
  • Components: Sample slope, t* value, and standard error.

Formula for Confidence Interval

  • Confidence Interval: b_1 \pm t^* \cdot SE where
    • t^* is the t-critical value.
    • SE is the standard error.

Degrees of Freedom

  • For inferences about the slope, degrees of freedom are calculated as n - 2.
  • At least two points are needed to define a line of best fit.
  • If we were ding an inference for the difference in two means, like section 6.4, we took the smaller sample size minus one.

Standard Error

  • The formula to calculate the standard error, for a slope is pretty involved.
  • Rely on statistical software to calculate the standard error of the slope.

Statistical Software Output

  • Software like R provides output for linear regression analysis.
  • Coefficients: b0 (intercept) and b1 (slope).
  • Output includes slope, intercept, standard error, and other statistics.

Example Using Cricket Chirp Data

  • Recall: Temperature = 3.2 + 0.13 * Chirp Rate.
  • Software output provides more precise values for slope and intercept.

Confidence Interval Calculation Example

  • Estimate: 0.12774 (slope) \pm t* * 0.00789 (standard error).
  • Degrees of freedom: 7 (sample size) - 2 = 5.
  • For a 95% confidence interval, t* = 2.571.
  • Margin of error \approx 0.02.
  • Confidence interval: (0.11, 0.15).

Confidence Interval Interpretation

  • We are 95% confident that the true population slope falls within this interval.

Hypothesis Testing for the Slope

  • We can do a hypothesis test for the slope.

Hypotheses

  • Null hypothesis: H0: \beta1 = 0 (no relationship).
  • Alternative hypothesis: Ha: \beta1 \neq 0 (there is a relationship).
  • One-tailed tests are also possible.

Implications of the Null Hypothesis

  • If \beta_1 = 0, there is no relationship between x and y.
  • Changes in x do not result in changes in y.

Implications of a Non-Zero Slope

  • If \beta_1 \neq 0, there is a relationship between x and y.
  • As x changes, y also changes.

Test Statistic

  • The test statistic is a standardized t test statistic:
    t = \frac{statistic - null \space value}{standard \space error}
  • Which can be simplified to:
    t = \frac{slope \space estimate}{standard \space error}

P-Value

  • The p-value is from a t-distribution with n - 2 degrees of freedom.

Hypothesis Test Example: Cricket Chirps

  • Null hypothesis: \beta_1 = 0 (no relationship).
  • Alternative hypothesis: \beta_1 \neq 0 (there is a relationship).

Level of Significance

  • Set alpha = 1%.

Calculation

  • t = \frac{0.12774}{0.00789} \approx 16.19

Interpretation

  • A test statistic can theoretically be anywhere between negative and positive infinity, it’s measuring the number of standard errors our statistic is away from the null value.
  • P-value from software output: 1.64 * 10^(-5).
  • P-value < alpha: Reject the null hypothesis.
  • Conclusion: Strong evidence of a relationship between cricket chirp rate and temperature.

Test for Correlation

  • Tests for correlation and slope yield the same test statistic and p-value.
  • Sample correlation r = 0.99062.

Coefficient of Determination

  • To be discussed further in the next session.