Lecture 9 Inference for Linear Regression – Comprehensive Study Notes

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = $x$ (first-jump score, catchment area, number of drinks…).

– Response variable = $y$ (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept $b0$ and slope $b1$ that minimise $\sum (yi-\hat yi)^2$ .

– Closed-form estimates: $b1 = r \, \dfrac{sy}{sx}$ and $b0 = \bar y - b1\bar x$ . – Interpretations • Point $(\bar x,\bar y)$ always lies on fitted line. • One $sx$ increase in $x$ → expected change $r s_y$ in $y$ .

• Coefficient of determination $r^2$

– $r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}$ .

– Proportion of total variation in $Y$ explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: $Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2)$ . – Now: keep independence but allow means to differ linearly with $X$ : $E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi$ . – Setting $\beta1=0$ collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): $\muy = \beta0+\beta1 x$ ("true line"). – Sample (observable, varies): $\hat y = b0 + b_1 x$ .

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area ( $n=20$ )

– Drinks vs BAC ( $n=22$ )

• Why inference?

– Each sample produces a different $b0,b1$ ; goal is to infer $\beta0,\beta1$ .

• Linear regression model assumptions (initial statement)

Linearity of means: points $\big(xi,\,E[Yi]\big)$ lie on a straight line.
Errors at each $x$ are normally distributed with mean $0$ .
Independent observations.
Homoscedasticity – common error variance $\sigma^2$ .

Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope $\beta1$ – Measures expected change in $Y$ for one-unit change in $X$ . – Signs: $\beta1=0$ (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: $\displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2)$ under model. – R supplies $\hat\beta1$ , $se(\hat\beta_1)$ , $t$ value, two-sided $p$ value.

• Hypothesis test for linear relationship

– $H0: \beta1=0$ vs $Ha$ (one- or two-sided). – Compute observed $t{obs} = \dfrac{b1}{se(\hat\beta1)}$ .

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ $P(T\ge t{obs})$ . • Left-tailed Ha: \beta1<0 ⇒ $P(T\le t{obs})$ .

• Two-sided $Ha: \beta1\neq0$ ⇒ $2P(T\le -|t_{obs}|)$ .

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; $t=1.552; p=0.138$ ⇒ no evidence at 5 %).

– $R^2=0.118$ (≈ 12 % of variability explained).

• Worked supermarket example (display space $\to$ coffee sales)

– $b1=28.0, se=6.1, n=9$ . – $t{obs}=4.593, df=7$ ⇒ $p=0.00125$ (strong evidence sales↑ with space).

– 95 % CI: $[13.6,42.4]$ extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate $\alpha/m$ .

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under $H_0$ , P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of $p$ ’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).

Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

Linearity of $E[Y|X]$ .
Normality of errors.
Independence of responses.
Constant variance.

• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large $n$ thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; $15\le n\le40$ ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small $n=9$ , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: $CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1)$ with $df=n-2$ .

• Standard error expression

$SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}}$ so precision improves when – $n$ increases, – Spread in $X$ ( $Sx$ ) increases,

– Residual SD $S_\epsilon$ decreases.

Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures $\hat\beta1$ ~ Normal – $\hat\beta1$ is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example ( $n=22$ )

– R output: $b0=-0.0044$ , $b1=0.0109$ , $se=0.0030$ , $t=3.644$ , $p=0.0016$ two-sided.

– One-sided test Ha: \beta1>0 ⇒ $p=0.0008$ ; very strong evidence BAC rises with drinks.

– $R^2=0.399$ ⇒ ≈40 % variation explained.

– 95 % CI for slope: $[0.0046,0.0171]$ BAC units per drink.

– Prediction at 4 drinks: $\hat y= -0.0044+0.0109\times4 = 0.039$ (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for $\beta_1$ , recruit larger $n$ and/or plan wider range of $X$ values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit <- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, $R^2$ .

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– $0\le p\le1$

– $\lVert t \rVert$ > 1 for non-trivial results.

– Sign of $t{obs}$ matches alternative. – CI limits in logical order; margin of error positive. – Pooled $sp$ must lie between group SD’s, etc.

Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.

Key Formulas (LaTeX ready)

• Fitted line: $\hat y = b0 + b1 x$ .

• Population line: $\muy = \beta0 + \beta_1 x$ .

• Slope estimator: $b1 = r \dfrac{sy}{s_x}$ .

• Intercept estimator: $b0 = \bar y - b1 \bar x$ .

• Test statistic: $T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2)$ .

• Confidence interval: $\hat\beta1 \pm t^* SE(\hat\beta1)$ .

• Standard error: $SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}$ .

• Coefficient of determination: $r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}$ .

• Bonferroni threshold: $\alpha/m$ ; adjusted P-value: $p_{adj}=m p$ .

Ethical & Practical Considerations

• Misinterpretation of $p$ -values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.

Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")

What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test $H0: \beta1=0$ and construct CI’s for $\beta_1$ .

• Predict mean response at specified $x$ using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = $x$ (first-jump score, catchment area, number of drinks…).

– Response variable = $y$ (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept $b0$ and slope $b1$ that minimise $\sum (yi-\hat yi)^2$ .

• Coefficient of determination $r^2$

– $r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}$ .

– Proportion of total variation in $Y$ explained by linear model (range 0–1).

• Connection to i.i.d. mean model

• Sample vs population notation

– Population (unknown, fixed): $\muy = \beta0+\beta1 x$ ("true line"). – Sample (observable, varies): $\hat y = b0 + b_1 x$ .

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area ( $n=20$ )

– Drinks vs BAC ( $n=22$ )

• Why inference?

– Each sample produces a different $b0,b1$ ; goal is to infer $\beta0,\beta1$ .

• Linear regression model assumptions (initial statement)

Linearity of means: points $\big(xi,\,E[Yi]\big)$ lie on a straight line.
Errors at each $x$ are normally distributed with mean $0$ .
Independent observations.
Homoscedasticity – common error variance $\sigma^2$ .

Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope $\beta1$ – Measures expected change in $Y$ for one-unit change in $X$ . – Signs: $\beta1=0$ (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

• Hypothesis test for linear relationship

– $H0: \beta1=0$ vs $Ha$ (one- or two-sided). – Compute observed $t{obs} = \dfrac{b1}{se(\hat\beta1)}$ .

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ $P(T\ge t{obs})$ . • Left-tailed Ha: \beta1<0 ⇒ $P(T\le t{obs})$ .

• Two-sided $Ha: \beta1\neq0$ ⇒ $2P(T\le -|t_{obs}|)$ .

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; $t=1.552; p=0.138$ ⇒ no evidence at 5 %).

– $R^2=0.118$ (≈ 12 % of variability explained).

• Worked supermarket example (display space $\to$ coffee sales)

– $b1=28.0, se=6.1, n=9$ . – $t{obs}=4.593, df=7$ ⇒ $p=0.00125$ (strong evidence sales↑ with space).

– 95 % CI: $[13.6,42.4]$ extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate $\alpha/m$ .

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under $H_0$ , P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of $p$ ’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).

Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

Linearity of $E[Y|X]$ .
Normality of errors.
Independence of responses.
Constant variance.

• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large $n$ thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; $15\le n\le40$ ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small $n=9$ , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: $CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1)$ with $df=n-2$ .

• Standard error expression

$SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}}$ so precision improves when – $n$ increases, – Spread in $X$ ( $Sx$ ) increases,

– Residual SD $S_\epsilon$ decreases.

Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures $\hat\beta1$ ~ Normal – $\hat\beta1$ is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example ( $n=22$ )

– R output: $b0=-0.0044$ , $b1=0.0109$ , $se=0.0030$ , $t=3.644$ , $p=0.0016$ two-sided.

– One-sided test Ha: \beta1>0 ⇒ $p=0.0008$ ; very strong evidence BAC rises with drinks.

– $R^2=0.399$ ⇒ ≈40 % variation explained.

– 95 % CI for slope: $[0.0046,0.0171]$ BAC units per drink.

– Prediction at 4 drinks: $\hat y= -0.0044+0.0109\times4 = 0.039$ (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for $\beta_1$ , recruit larger $n$ and/or plan wider range of $X$ values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit <- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, $R^2$ .

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– $0\le p\le1$

– $\lVert t \rVert$ > 1 for non-trivial results.

– Sign of $t{obs}$ matches alternative. – CI limits in logical order; margin of error positive. – Pooled $sp$ must lie between group SD’s, etc.

Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.

Key Formulas (LaTeX ready)

• Fitted line: $\hat y = b0 + b1 x$ .

• Population line: $\muy = \beta0 + \beta_1 x$ .

• Slope estimator: $b1 = r \dfrac{sy}{s_x}$ .

• Intercept estimator: $b0 = \bar y - b1 \bar x$ .

• Test statistic: $T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2)$ .

• Confidence interval: $\hat\beta1 \pm t^* SE(\hat\beta1)$ .

• Standard error: $SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}$ .

• Coefficient of determination: $r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}$ .

• Bonferroni threshold: $\alpha/m$ ; adjusted P-value: $p_{adj}=m p$ .

Ethical & Practical Considerations

• Misinterpretation of $p$ -values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.

Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")

What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test $H0: \beta1=0$ and construct CI’s for $\beta_1$ .

• Predict mean response at specified $x$ using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = $x$ (first-jump score, catchment area, number of drinks…).

– Response variable = $y$ (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept $b0$ and slope $b1$ that minimise $\sum (yi-\hat yi)^2$ .

• Coefficient of determination $r^2$

– $r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}$ .

– Proportion of total variation in $Y$ explained by linear model (range 0–1).

• Connection to i.i.d. mean model

• Sample vs population notation

– Population (unknown, fixed): $\muy = \beta0+\beta1 x$ ("true line"). – Sample (observable, varies): $\hat y = b0 + b_1 x$ .

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area ( $n=20$ )

– Drinks vs BAC ( $n=22$ )

• Why inference?

– Each sample produces a different $b0,b1$ ; goal is to infer $\beta0,\beta1$ .

• Linear regression model assumptions (initial statement)

Linearity of means: points $\big(xi,\,E[Yi]\big)$ lie on a straight line.
Errors at each $x$ are normally distributed with mean $0$ .
Independent observations.
Homoscedasticity – common error variance $\sigma^2$ .

Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope $\beta1$ – Measures expected change in $Y$ for one-unit change in $X$ . – Signs: $\beta1=0$ (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

• Hypothesis test for linear relationship

– $H0: \beta1=0$ vs $Ha$ (one- or two-sided). – Compute observed $t{obs} = \dfrac{b1}{se(\hat\beta1)}$ .

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ $P(T\ge t{obs})$ . • Left-tailed Ha: \beta1<0 ⇒ $P(T\le t{obs})$ .

• Two-sided $Ha: \beta1\neq0$ ⇒ $2P(T\le -|t_{obs}|)$ .

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; $t=1.552; p=0.138$ ⇒ no evidence at 5 %).

– $R^2=0.118$ (≈ 12 % of variability explained).

• Worked supermarket example (display space $\to$ coffee sales)

– $b1=28.0, se=6.1, n=9$ . – $t{obs}=4.593, df=7$ ⇒ $p=0.00125$ (strong evidence sales↑ with space).

– 95 % CI: $[13.6,42.4]$ extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate $\alpha/m$ .

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under $H_0$ , P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of $p$ ’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).

Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

Linearity of $E[Y|X]$ .
Normality of errors.
Independence of responses.
Constant variance.

• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large $n$ thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; $15\le n\le40$ ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small $n=9$ , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: $CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1)$ with $df=n-2$ .

• Standard error expression

$SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}}$ so precision improves when – $n$ increases, – Spread in $X$ ( $Sx$ ) increases,

– Residual SD $S_\epsilon$ decreases.

Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures $\hat\beta1$ ~ Normal – $\hat\beta1$ is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example ( $n=22$ )

– R output: $b0=-0.0044$ , $b1=0.0109$ , $se=0.0030$ , $t=3.644$ , $p=0.0016$ two-sided.

– One-sided test Ha: \beta1>0 ⇒ $p=0.0008$ ; very strong evidence BAC rises with drinks.

– $R^2=0.399$ ⇒ ≈40 % variation explained.

– 95 % CI for slope: $[0.0046,0.0171]$ BAC units per drink.

– Prediction at 4 drinks: $\hat y= -0.0044+0.0109\times4 = 0.039$ (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for $\beta_1$ , recruit larger $n$ and/or plan wider range of $X$ values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit <- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, $R^2$ .

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– $0\le p\le1$

– $\lVert t \rVert$ > 1 for non-trivial results.

– Sign of $t{obs}$ matches alternative. – CI limits in logical order; margin of error positive. – Pooled $sp$ must lie between group SD’s, etc.

Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.

Key Formulas (LaTeX ready)

• Fitted line: $\hat y = b0 + b1 x$ .

• Population line: $\muy = \beta0 + \beta_1 x$ .

• Slope estimator: $b1 = r \dfrac{sy}{s_x}$ .

• Intercept estimator: $b0 = \bar y - b1 \bar x$ .

• Test statistic: $T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2)$ .

• Confidence interval: $\hat\beta1 \pm t^* SE(\hat\beta1)$ .

• Standard error: $SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}$ .

• Coefficient of determination: $r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}$ .

• Bonferroni threshold: $\alpha/m$ ; adjusted P-value: $p_{adj}=m p$ .

Ethical & Practical Considerations

• Misinterpretation of $p$ -values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.

Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")

What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test $H0: \beta1=0$ and construct CI’s for $\beta_1$ .

• Predict mean response at specified $x$ using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see