Simple Linear Regression Notes

Symbols used in this class

Sample size: $n$
Population size: $N$ (Rarely used in statistics, since most population sizes are effectively infinite)
Population standard deviation: $\sigma$
Sample standard deviation: $s$
Population mean: $\mu$
Sample mean: $\bar{x}$
Standard normal variable: $Z$ (often used for z-scores)
Correlation between two numerical variables: $r$
Regression slope: $\beta_1$
Regression intercept: $\beta_0$
Probability of observing as extreme as the sample (p-value): $p$
Type I error rate (significance level): $\alpha$

Descriptive Statistics and Simple Linear Regression (Chap 5)

Population parameters vs. sample statistics
- Population parameters include things like population mean $\bar{\mu}$ (often denoted BC? in some texts), population standard deviation $\sigma$ , etc.
- Population concept: parameter values that characterize the entire population (usually unknown)
- Sample statistics include sample mean $\bar{x}$ , sample standard deviation $s$ , etc.
Population vs. Sampling framing
- Population: entire group of interest
- Sample: subset of the population used to estimate parameters
- Inference: using sample information to draw conclusions about population
- Probability Distribution: models for how data are generated; underpin inference

Learning Objectives

Perform Simple Linear Regression between two numerical variables and interpret the regression equation (slope $\beta1$ and intercept $\beta0$ ).
Understand Coefficient of Determination (R-squared, $R^2$ ) as a measure of goodness of fit for the regression line.
Use the regression equation to predict one variable from the other.
Be aware of the risk of extrapolation in regression (predictions outside the observed data range may be unreliable).

Recall: Linear function and regression basics

Linear function in math: $y = mx + b$
Simple linear regression shows a linear relationship between a response variable (y) and an explanatory variable (x) (typically two numerical variables).
Regression line is used to predict y for a given x.
In regression notation: $y = \beta0 + \beta1 x$ where $\beta1$ is the slope and $\beta0$ is the intercept.
Interpretation of the slope: if \beta1 > 0, y increases with x; if \beta1 < 0, y decreases with x. The units of the slope are the units of y per unit of x: $|\beta_1|$ (units depend on the data).
Coefficients of Determination (R-squared) indicate how well the model explains the variability in y (What percent of the variability in y is explained by the model?).

Example: Age and Weight

Data example (age, weight):
- 1 → 21.7, 2 → 18.6, 3 → 25.1, 4 → 36.4, 5 → 38.0, 6 → 41.6, 7 → 43.1, 8 → 52.1, 9 → 67.4, 10 → 67.9, 11 → 67.0, 12 → 74.6, 13 → 82.5, 14 → 86.9, 15 → 91.1, 16 → 98.1
Regression Equation obtained: $\text{weight} = 11.3 + 5.4 \times \text{age}$
Slope: $\beta_1 = 5.4$ — body weight increases by about 5.4 pounds per additional year of age.
Prediction examples:
- Predicted weight for an 11-year-old: $\hat{y} = 11.3 + 5.4\times 11 = 11.3 + 59.4 = 70.7$ pounds
- Predicted weight for a 60-year-old: $\hat{y} = 11.3 + 5.4\times 60 = 11.3 + 324 = 335.3$ pounds
Limitations: Extrapolation beyond the observed data range may not be valid; the model fits well only within the observed ages.

Data Exploration: Scatter Plots

Scatter plots help to:
- Assess whether simple linear regression is appropriate
- Detect outliers

Regression: Estimation and Key Formulas

Regression model form: $y = \beta0 + \beta1 x$
Least Squares estimator (to fit the line):
- Slope: $\hat{\beta}1 = \frac{\sum (xi - \bar{x})(yi - \bar{y})}{\sum (xi - \bar{x})^2}$
- Intercept: $\hat{\beta}0 = \bar{y} - \hat{\beta}1 \bar{x}$
Another common presentation (equivalent):
- Let $sx = \sqrt{\dfrac{\sum (xi - \bar{x})^2}{n-1}}$ and $sy = \sqrt{\dfrac{\sum (yi - \bar{y})^2}{n-1}}$
- Correlation: $r = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum (xi - \bar{x})^2}\sqrt{\sum (yi - \bar{y})^2}}$
- Then $\hat{\beta}1 = r \frac{sy}{sx}$ and $\hat{\beta}0 = \bar{y} - \hat{\beta}_1 \bar{x}$
Goodness of fit:
- $R^2 = r^2$ (the proportion of variance in y explained by the model)
Residuals:
- Residual for observation i: $ei = yi - \hat{y}_i$
Objective of least squares: minimize the sum of squared residuals
- $\min{\beta0, \beta1} \sum{i=1}^n (yi - (\beta0 + \beta1 xi))^2$

Example: Knee height vs. overall height (n = 6)

Data (x = knee height in cm, y = overall height in cm):
- (57.7, 192.1), (47.4, 153.3), (43.5, 146.4), (44.8, 162.7), (55.2, 169.1), (54.6, 177.8)
Tasks:
1) Make a scatter plot and examine features.
2) Find the least squares regression line: $y = b0 + b1 x$
3) What percent of variability in y is explained by the least squares regression line? (Compute $R^2$ .)
Worked results (from transcript):
- Final regression line: $y = 43.175 + 2.45 x$
- From table sums: $\bar{x} = 50.5, \bar{y} = 166.9$
- $\sum (xi - \bar{x})^2 = 181.84$ , $\sum (xi - \bar{x})(y_i - \bar{y}) = 446.07$ , hence
- $b_1 = \dfrac{446.07}{181.84} = 2.45$
- $b0 = \bar{y} - b1 \bar{x} = 166.9 - 2.45 \times 50.5 = 43.175$
- Standard deviations: $sx = \sqrt{\dfrac{181.84}{6-1}} \approx 6.03$ , $sy = \sqrt{\dfrac{1381.54}{6-1}} \approx 16.62$
- Correlation: $r = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{(n-1)sx sy} = \dfrac{446.07}{5 \times 6.03 \times 16.62} \approx 0.89$
- $R^2 = r^2 \approx 0.7921$ (about 79.21% of the variability in the data is explained by the regression line)

Example: Chicago monthly temperature (extrapolation exercise)

Data: Months as integers 1–7, Temperatures (°F):
- (1, 21.0), (2, 25.4), (3, 37.2), (4, 48.6), (5, 58.9), (6, 68.6), (7, 73.2)
Tasks:
1) Plot with month on x-axis, temperature on y-axis; describe relationship (form, direction, strength, outliers).
2) Find the least squares regression line.
3) Predict temperature in December (x = 12) using the regression line.
4) Does the prediction in 3 make sense? Why?
5) Discuss extrapolation risk.
6) What percent of variability is explained by the regression line?
Worked results (from transcript):
- Regression line: $y = 9.8 + 9.45 x$ where x = month, y = temperature
- Means: $\bar{x} = 4, \bar{y} = 47.6$
- Extrapolation check: December (x = 12) would give $y = 9.8 + 9.45\times 12 = 123.2^{\circ}F$ which is not realistic for December in Chicago; extrapolation can be risky.
- Sample size: $n = 7$
- Sums used: $\sum (xi - \bar{x})^2 = 28$ , $\sum (xi - \bar{x})(y_i - \bar{y}) = 264.7$
- Slope: $b_1 = \dfrac{264.7}{28} = 9.45$
- Intercept: $b0 = \bar{y} - b1 \bar{x} = 47.6 - 9.45 \times 4 = 9.8$
- Regression line: $y = 9.8 + 9.45 x$
- Standard deviations: $sx = \sqrt{\dfrac{28}{7-1}} \approx 2.16$ , $sy = \sqrt{\dfrac{2533.61}{7-1}} \approx 20.55$
- Correlation: $r = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{(n-1) sx sy} = \dfrac{264.7}{6 \times 2.16 \times 20.55} \approx 0.99$
- $R^2 \approx 0.9801$ (about 98.01% of variability explained by the regression line)

Exercise: Speed vs. MPG (Car dataset)

Data (Speed mph, MPG):
- (50, 35), (60, 38), (70, 30), (80, 24), (90, 20), (100, 18)
Tasks:
1) Plot the scatter, comment on form, direction, strength, and outliers.
2) Find the least squares regression line to fit the data.
3) Using the regression line, what is the predicted MPG at speed 75 mph?
4) Using the regression line, what speed corresponds to MPG = 32?
5) What percent of variability is explained by the regression line?
Worked results (from transcript):
- Regression line: $y = 58.25 - 0.41 x$ where x = speed, y = MPG
- Predictions:
- At x = 75: $\hat{y} = 58.25 - 0.41 \times 75 = 27.5\,\text{MPG}$
- For y = 32, solve 32 = 58.25 - 0.41 x ⇒ $x = 64\,\text{mph}$
- Sample size: $n = 6$
- Sums: $\sum (xi - \bar{x})^2 = 1750$ , $\sum (yi - \bar{y})^2 = 331.5$ , $\sum (xi - \bar{x})(yi - \bar{y}) = -725$
- Slope: $b_1 = \dfrac{-725}{1750} = -0.41$
- Intercept: $b0 = \bar{y} - b1 \bar{x} = 27.5 - (-0.41) \times 75 = 58.25$
- Regression line: $\hat{y} = 58.25 - 0.41 x$
- Standard deviations: $sx = \sqrt{\dfrac{1750}{6-1}} \approx 18.7$ , $sy = \sqrt{\dfrac{331.5}{6-1}} \approx 8.1$
- Correlation: $r = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{(n-1)sx sy} = \dfrac{-725}{5 \times 18.7 \times 8.1} \approx -0.96$
- $R^2 = r^2 \approx 0.9216$ (about 92.16% of the variability explained by the regression line)

Quick reference: Formulas (summary)

Regression line: $\hat{y} = b0 + b1 x$
Slope (least squares): $b1 = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{\sum (xi - \bar{x})^2}$
Intercept: $b0 = \bar{y} - b1 \bar{x}$
Correlation: $r = \dfrac{\sum (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum (xi - \bar{x})^2 \; \sum (yi - \bar{y})^2}}$
Standard deviations (sample):
- $sx = \sqrt{\dfrac{\sum (xi - \bar{x})^2}{n-1}}$
- $sy = \sqrt{\dfrac{\sum (yi - \bar{y})^2}{n-1}}$
Relationship between r and R^2:
- $R^2 = r^2$
Residuals:
- $ei = yi - \hat{y}_i$
Extrapolation caution: predictions far outside the observed data range can be unreliable.

Notes on interpretation and practice

Always plot data (scatter plot) before fitting a regression model to check linearity and identify outliers.
R^2 interpretation: the proportion of variability in the response that the model explains. Higher is better, but not a complete measure of model quality.
Extrapolation awareness: the regression line is most reliable within the range of observed x-values; predictions outside that range should be treated with caution.
Regression assumptions (not deeply covered in the transcript, but important in practice): linear relationship, independent errors, constant variance of errors (homoscedasticity), normality of errors (for inference).

Summary of key numerical results from the transcript

Knee height vs. height (n=6):
- Line: $\hat{y} = 43.175 + 2.45 x$
- $\bar{x} = 50.5, \bar{y} = 166.9$
- $\sum (xi - \bar{x})^2 = 181.84$ , $\sum (xi - \bar{x})(y_i - \bar{y}) = 446.07$
- $sx \approx 6.03, \; sy \approx 16.62$
- $r \approx 0.89, \; R^2 \approx 0.7921$
Chicago temperature (months 1–7) (n=7):
- Line: $\hat{y} = 9.8 + 9.45 x$
- $xbar = 4, \; ybar = 47.6$
- PredictDecember: $\hat{y}(x=12) = 123.2^{\circ}F$ (not realistic; extrapolation risk)
- $r \approx 0.99, \; R^2 \approx 0.9801$
Speed vs. MPG (n=6):
- Line: $\hat{y} = 58.25 - 0.41 x$
- Predictions: $\hat{y}(75) = 27.5\;\text{MPG}, \; x(\text{for } y=32) = 64$
- $\sum (xi - \bar{x})^2 = 1750, \; \sum (yi - \bar{y})^2 = 331.5$ , $\sum (xi - \bar{x})(yi - \bar{y}) = -725$
- $sx \approx 18.7, \; sy \approx 8.1$
- $r \approx -0.96, \; R^2 \approx 0.9216$
General lesson: use scatter plots to validate assumptions, compute regression line via least squares, interpret slope/intercept, assess fit with R^2, and beware extrapolation.