Regression Notes

Week 9: Regression

Calculating Correlation

Correlation in Excel is used to analyze the relationships between different variables. An example is provided with the following variables:

  • Number of weekly riders

  • Price per week

  • Population of city

  • Monthly income of riders

  • Average parking rates per month

A correlation matrix shows the correlations between each pair of variables. For instance:

  • The correlation between the number of weekly riders and price per week is -0.966.

  • The correlation between the number of weekly riders and the population of the city is 0.898.

  • The correlation between the number of weekly riders and the monthly income of riders is -0.873.

  • The correlation between the number of weekly riders and the average parking rates per month is -0.793.

Correlation

Correlation describes the strength and direction of the linear relationship between two numerical variables, specifically using Pearson correlation. The sample correlation is represented by rr, while the population correlation is represented by ρ\rho.

Mathematical Notation

The equation of the regression line is given by: y^=b<em>0+b</em>1x\hat{y} = b<em>0 + b</em>1x.

  • Intercept (b0b_0):

    • The value of yy when x=0x = 0.

    • The point where the regression line crosses the y-axis.

  • Slope (b1b_1):

    • Defines the steepness of the regression line.

    • Indicates the direction of the line (positive or negative).

Assumptions of a Linear Model

The assumptions are:

  1. (L) Linearity: The relationship between the variables must be linear.

  2. (I) Independence: The residuals must be independent.

  3. (N) Normality: The residuals must be normally distributed.

  4. (E) Equality of Variance: The residuals must have equal variance.

Simple Example

The regression line predicts a score of Y for any given value of X. Observed scores of y exist for each value of x. The closer the observed scores are to the predicted scores, the better the model predicts y and the less variability around the line there is. The further away from the line the observed scores are, the worse the model predicts y, and the more variability around the line there is. The difference between the predicted Y and observed y for any given value of X is called a residual.

Terminology

y^\hat{y} represents the predicted or fitted value.

Sources of Variation

R2R^2 (R-squared), also known as the coefficient of determination, represents the proportion of variability in the outcome variable that is explained by the regression model. R2R^2 can range between 0 and 1.

  • An R2R^2 value of 0 means the model explains none of the variability.

  • An R2R^2 value of 1 means the model explains all of the variability.

If correlation is rr, and R-squared is R2R^2, then correlation is not simply R2R^2. The correlation (r) describes the strength and direction of a linear relationship between two variables, while R2R^2 represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

Hypothesis Testing

In hypothesis testing for regression, we are primarily interested in the slope of the regression line, not the y-intercept.

  • Hypotheses:

    • Null Hypothesis (H<em>0H<em>0): β</em>1=0\beta</em>1 = 0 (there is no relationship between X and Y).

    • Alternative Hypothesis (H<em>1H<em>1): β</em>10\beta</em>1 \neq 0 (there is a relationship between X and Y).

  • Significance Level: α=5%\alpha = 5\%.

  • Assumptions:

    • Linear relationship between X and Y.

    • Residuals are normally distributed.

    • Constant variance in the residuals.

An example of Excel output is given:

  • Multiple R: 0.89756531

  • R Square: 0.80562348

  • Adjusted R Square: 0.79784842

  • Standard Error: 9577.24378

  • Observations: 27

ANOVA table includes degrees of freedom (df), sum of squares (SS), mean square (MS), F statistic, and significance F.

Coefficients table includes:

  • Intercept: -313732.056

  • Population of city: 0.28198024

Along with standard error, t Stat, P-value, Lower 95%, and Upper 95%.

Conclusion

If the p-value is less than 0.05, there is a statistically significant linear relationship. For every additional 1 population increase, the number of weekly riders increases by 0.28198024.

5% Confidence Interval

The 95% confidence interval (CI) for the slope is [0.22492777, 0.339032704]. The link to the p-value is:

  • If the CI contains 0, then the p-value will be > 0.05.

  • If the CI does not contain 0, then the p-value will be < 0.05.

We can be confident that the true value β0\beta_0 is greater than 0.22492777 but less than 0.339032704.

Predictions

To make predictions:

  1. Ensure all assumptions are satisfied.

  2. Ensure a significant linear relationship exists and the model explains a reasonable amount of variation.

  3. Ensure that any predictions are within the observed range of the independent variable.

The equation of the regression line is: y^=b<em>0+b</em>1x\hat{y} = b<em>0 + b</em>1x

  • b0b_0 = -313732.056

  • b1b_1 = 0.28198024

So, y^=313732.056+0.28198024x\hat{y} = -313732.056 + 0.28198024x

Example: Predict the number of weekly riders for a city with a population of 1,680,000 people.

  • x=1,680,000x = 1,680,000

  • y^=313732.056+0.28198024×1680000\hat{y} = -313732.056 + 0.28198024 \times 1680000

  • y^=159994.743\hat{y} = 159994.743

The predicted number of weekly riders in a city with a population of 1,680,000 people is 159,994.743 weekly riders.