Linear Regression Comprehensive Notes
Linear Regression
- Linear regression is a key topic in statistics, assessed in both tests and assignments.
- It's a widely used tool across various fields and industries.
Relationships Between Quantitative Variables
- Linear regression explores relationships between two quantitative variables.
- Example: Relationship between a hospitalized patient's stress level and depression level.
- Process: Data is plotted, and a straight line is fitted to model the relationship. This can be done using tools like StatKey or Excel.
Linear Model
- A model is created to predict the y variable using the x variable.
- The model is based on sample data, which is an estimate of the true relationship.
Cricket Chirp Rate Example
- Example: Predicting temperature (y) based on cricket chirp rate (x).
- Data points represent measurements of chirp rate and temperature.
- A straight line is fitted to the data to create a model.
Model Parameters
- The general form for the linear model is: y{hat} = b0 + b_1x where
- y_{hat} = predicted y-value
- b_0 = y-intercept
- b_1 = slope.
- b0 and b1 are coefficients estimated from sample data.
- These are estimates of the true population parameters, \beta0 (population y-intercept) and \beta1 (population slope).
Population Parameters
- \beta0 and \beta1 represent population parameters for the relationship.
- These values are generally unknown and inferred from sample data.
Random Errors (Residuals)
- Residuals: The difference between observed and predicted y values.
- Formula: Residual = Observed \space y - Predicted \space y
- Residuals should be normally distributed with a mean of zero and a standard deviation.
- Strong relationship: Residuals have a small standard deviation.
- Larger scatter: Residuals have a larger standard deviation.
Residual Distribution
- Residuals around the regression line should be normally distributed.
- For a given x value (e.g., x_1), most residuals should be close to the model, clustering around the line of best fit.
- It should have a similar distribution around that line of best fit regardless of where we are on the x axis.
Key Points for Linear Regression
- Constant variability: Residual spread should be consistent across different x values.
- Consistent band of data above and below the regression line without systematic patterns.
Departures from Linearity
- Data should approximately follow a straight line.
- Non-linear relationships (e.g., quadratic, exponential) are not suitable for linear regression across the entire x-axis, but in local ranges a linear model can be fitted.
Changing Variability
- Variability of residuals should remain constant as x changes.
Outliers and Influential Points
- Outliers: Data points far from the regression line.
- Influential points: Outliers that disproportionately affect the line of best fit.
Examples of Linear Regression Candidates
- Good candidate: Consistent band of residuals around the line of best fit.
- Poor candidate: Data does not follow a straight line, indicating a non-linear relationship.
- Consistent bands of residuals are desired as x values change.
- If X is between eighteen and twenty, well, all of our residuals are positive residuals. Only positive residuals. But if I look at this section, say, between twelve and eighteen, now we're looking at mostly negative residuals.
- The average residual above 18 is going to be a positive number. But between twelve and eighteen, it's gonna be a negative number. We want our residuals to average out to about zero.
Problems with Changing Variability
- Fanning: Residuals' variability changes with x.
- As x gets bigger/smaller, the residuals get larger/smaller.
Outliers and Their Impact
- Outliers have large residuals and can disproportionately affect the regression line.
- We want all data to exert similar amounts of influence on the location of the regression line.
Best Practices
- To examine the relationship between two quantitative variables, it needs to be displayed on a scatter plot.
Confidence Intervals for the Slope
- The slope (b1) is an estimate of the population slope \beta_1.
- Confidence intervals estimate the true population slope.
- Components: Sample slope, t* value, and standard error.
- Confidence Interval: b_1 \pm t^* \cdot SE where
- t^* is the t-critical value.
- SE is the standard error.
Degrees of Freedom
- For inferences about the slope, degrees of freedom are calculated as n - 2.
- At least two points are needed to define a line of best fit.
- If we were ding an inference for the difference in two means, like section 6.4, we took the smaller sample size minus one.
Standard Error
- The formula to calculate the standard error, for a slope is pretty involved.
- Rely on statistical software to calculate the standard error of the slope.
Statistical Software Output
- Software like R provides output for linear regression analysis.
- Coefficients: b0 (intercept) and b1 (slope).
- Output includes slope, intercept, standard error, and other statistics.
Example Using Cricket Chirp Data
- Recall: Temperature = 3.2 + 0.13 * Chirp Rate.
- Software output provides more precise values for slope and intercept.
Confidence Interval Calculation Example
- Estimate: 0.12774 (slope) \pm t* * 0.00789 (standard error).
- Degrees of freedom: 7 (sample size) - 2 = 5.
- For a 95% confidence interval, t* = 2.571.
- Margin of error \approx 0.02.
- Confidence interval: (0.11, 0.15).
Confidence Interval Interpretation
- We are 95% confident that the true population slope falls within this interval.
Hypothesis Testing for the Slope
- We can do a hypothesis test for the slope.
Hypotheses
- Null hypothesis: H0: \beta1 = 0 (no relationship).
- Alternative hypothesis: Ha: \beta1 \neq 0 (there is a relationship).
- One-tailed tests are also possible.
Implications of the Null Hypothesis
- If \beta_1 = 0, there is no relationship between x and y.
- Changes in x do not result in changes in y.
Implications of a Non-Zero Slope
- If \beta_1 \neq 0, there is a relationship between x and y.
- As x changes, y also changes.
Test Statistic
- The test statistic is a standardized t test statistic:
t = \frac{statistic - null \space value}{standard \space error} - Which can be simplified to:
t = \frac{slope \space estimate}{standard \space error}
P-Value
- The p-value is from a t-distribution with n - 2 degrees of freedom.
Hypothesis Test Example: Cricket Chirps
- Null hypothesis: \beta_1 = 0 (no relationship).
- Alternative hypothesis: \beta_1 \neq 0 (there is a relationship).
Level of Significance
Calculation
- t = \frac{0.12774}{0.00789} \approx 16.19
Interpretation
- A test statistic can theoretically be anywhere between negative and positive infinity, it’s measuring the number of standard errors our statistic is away from the null value.
- P-value from software output: 1.64 * 10^(-5).
- P-value < alpha: Reject the null hypothesis.
- Conclusion: Strong evidence of a relationship between cricket chirp rate and temperature.
Test for Correlation
- Tests for correlation and slope yield the same test statistic and p-value.
- Sample correlation r = 0.99062.
Coefficient of Determination
- To be discussed further in the next session.