Linear Regression Comprehensive Notes

Linear Regression

Linear regression is a key topic in statistics, assessed in both tests and assignments.
It's a widely used tool across various fields and industries.

Relationships Between Quantitative Variables

Linear regression explores relationships between two quantitative variables.
Example: Relationship between a hospitalized patient's stress level and depression level.
Process: Data is plotted, and a straight line is fitted to model the relationship. This can be done using tools like StatKey or Excel.

Linear Model

A model is created to predict the y variable using the x variable.
The model is based on sample data, which is an estimate of the true relationship.

Cricket Chirp Rate Example

Example: Predicting temperature (y) based on cricket chirp rate (x).
Data points represent measurements of chirp rate and temperature.
A straight line is fitted to the data to create a model.

Model Parameters

The general form for the linear model is: y{hat} = b0 + b_1x where
- y_{hat} = predicted y-value
- b_0 = y-intercept
- b_1 = slope.
b0 and b1 are coefficients estimated from sample data.
These are estimates of the true population parameters, \beta0 (population y-intercept) and \beta1 (population slope).

Population Parameters

\beta0 and \beta1 represent population parameters for the relationship.
These values are generally unknown and inferred from sample data.

Random Errors (Residuals)

Residuals: The difference between observed and predicted y values.
Formula: Residual = Observed \space y - Predicted \space y
Residuals should be normally distributed with a mean of zero and a standard deviation.
Strong relationship: Residuals have a small standard deviation.
Larger scatter: Residuals have a larger standard deviation.

Residual Distribution

Residuals around the regression line should be normally distributed.
For a given x value (e.g., x_1), most residuals should be close to the model, clustering around the line of best fit.
It should have a similar distribution around that line of best fit regardless of where we are on the x axis.

Key Points for Linear Regression

Constant variability: Residual spread should be consistent across different x values.
Consistent band of data above and below the regression line without systematic patterns.

Departures from Linearity

Data should approximately follow a straight line.
Non-linear relationships (e.g., quadratic, exponential) are not suitable for linear regression across the entire x-axis, but in local ranges a linear model can be fitted.

Changing Variability

Variability of residuals should remain constant as x changes.

Outliers and Influential Points

Outliers: Data points far from the regression line.
Influential points: Outliers that disproportionately affect the line of best fit.

Examples of Linear Regression Candidates

Good candidate: Consistent band of residuals around the line of best fit.
Poor candidate: Data does not follow a straight line, indicating a non-linear relationship.

Residual Analysis for Model Assessment

Consistent bands of residuals are desired as x values change.
If X is between eighteen and twenty, well, all of our residuals are positive residuals. Only positive residuals. But if I look at this section, say, between twelve and eighteen, now we're looking at mostly negative residuals.
The average residual above 18 is going to be a positive number. But between twelve and eighteen, it's gonna be a negative number. We want our residuals to average out to about zero.

Problems with Changing Variability

Fanning: Residuals' variability changes with x.
As x gets bigger/smaller, the residuals get larger/smaller.

Outliers and Their Impact

Outliers have large residuals and can disproportionately affect the regression line.
We want all data to exert similar amounts of influence on the location of the regression line.

Best Practices

To examine the relationship between two quantitative variables, it needs to be displayed on a scatter plot.

Confidence Intervals for the Slope

The slope (b1) is an estimate of the population slope \beta_1.
Confidence intervals estimate the true population slope.
Components: Sample slope, t* value, and standard error.

Formula for Confidence Interval

Confidence Interval: b_1 \pm t^* \cdot SE where
- t^* is the t-critical value.
- SE is the standard error.

Degrees of Freedom

For inferences about the slope, degrees of freedom are calculated as n - 2.
At least two points are needed to define a line of best fit.
If we were ding an inference for the difference in two means, like section 6.4, we took the smaller sample size minus one.

Standard Error

The formula to calculate the standard error, for a slope is pretty involved.
Rely on statistical software to calculate the standard error of the slope.

Statistical Software Output

Software like R provides output for linear regression analysis.
Coefficients: b0 (intercept) and b1 (slope).
Output includes slope, intercept, standard error, and other statistics.

Example Using Cricket Chirp Data

Recall: Temperature = 3.2 + 0.13 * Chirp Rate.
Software output provides more precise values for slope and intercept.

Confidence Interval Calculation Example

Estimate: 0.12774 (slope) \pm t* * 0.00789 (standard error).
Degrees of freedom: 7 (sample size) - 2 = 5.
For a 95% confidence interval, t* = 2.571.
Margin of error \approx 0.02.
Confidence interval: (0.11, 0.15).

Confidence Interval Interpretation

We are 95% confident that the true population slope falls within this interval.

Hypothesis Testing for the Slope

We can do a hypothesis test for the slope.

Hypotheses

Null hypothesis: H0: \beta1 = 0 (no relationship).
Alternative hypothesis: Ha: \beta1 \neq 0 (there is a relationship).
One-tailed tests are also possible.

Implications of the Null Hypothesis

If \beta_1 = 0, there is no relationship between x and y.
Changes in x do not result in changes in y.

Implications of a Non-Zero Slope

If \beta_1 \neq 0, there is a relationship between x and y.
As x changes, y also changes.

Test Statistic

The test statistic is a standardized t test statistic:
t = \frac{statistic - null \space value}{standard \space error}
Which can be simplified to:
t = \frac{slope \space estimate}{standard \space error}

P-Value

The p-value is from a t-distribution with n - 2 degrees of freedom.

Hypothesis Test Example: Cricket Chirps

Null hypothesis: \beta_1 = 0 (no relationship).
Alternative hypothesis: \beta_1 \neq 0 (there is a relationship).

Level of Significance

Set alpha = 1%.

Calculation

t = \frac{0.12774}{0.00789} \approx 16.19

Interpretation

A test statistic can theoretically be anywhere between negative and positive infinity, it’s measuring the number of standard errors our statistic is away from the null value.
P-value from software output: 1.64 * 10^(-5).
P-value < alpha: Reject the null hypothesis.
Conclusion: Strong evidence of a relationship between cricket chirp rate and temperature.

Test for Correlation

Tests for correlation and slope yield the same test statistic and p-value.
Sample correlation r = 0.99062.

Coefficient of Determination

To be discussed further in the next session.