Lecture 9 Inference for Linear Regression – Comprehensive Study Notes

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = x (first-jump score, catchment area, number of drinks…).

– Response variable = y (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.

– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.

• Coefficient of determination r^2

– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in Y explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20)

– Drinks vs BAC (n=22)

• Why inference?

– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.

  2. Errors at each x are normally distributed with mean 0.

  3. Independent observations.

  4. Homoscedasticity – common error variance \sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.

• Hypothesis test for linear relationship

– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).

• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).

– R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate \alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of p’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.

• Standard error expression

SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,

– Residual SD S_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22)

– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.

– R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– 0\le p\le1

– \lVert t \rVert > 1 for non-trivial results.

– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: \hat y = b0 + b1 x.

• Population line: \muy = \beta0 + \beta_1 x.

• Slope estimator: b1 = r \dfrac{sy}{s_x}.

• Intercept estimator: b0 = \bar y - b1 \bar x.

• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).

• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H0: \beta1=0 and construct CI’s for \beta_1.

• Predict mean response at specified x using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = x (first-jump score, catchment area, number of drinks…).

– Response variable = y (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.

– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.

• Coefficient of determination r^2

– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in Y explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20)

– Drinks vs BAC (n=22)

• Why inference?

– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.

  2. Errors at each x are normally distributed with mean 0.

  3. Independent observations.

  4. Homoscedasticity – common error variance \sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.

• Hypothesis test for linear relationship

– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).

• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).

– R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate \alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of p’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.

• Standard error expression

SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,

– Residual SD S_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22)

– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.

– R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– 0\le p\le1

– \lVert t \rVert > 1 for non-trivial results.

– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: \hat y = b0 + b1 x.

• Population line: \muy = \beta0 + \beta_1 x.

• Slope estimator: b1 = r \dfrac{sy}{s_x}.

• Intercept estimator: b0 = \bar y - b1 \bar x.

• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).

• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H0: \beta1=0 and construct CI’s for \beta_1.

• Predict mean response at specified x using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = x (first-jump score, catchment area, number of drinks…).

– Response variable = y (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.

– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.

• Coefficient of determination r^2

– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in Y explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20)

– Drinks vs BAC (n=22)

• Why inference?

– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.

  2. Errors at each x are normally distributed with mean 0.

  3. Independent observations.

  4. Homoscedasticity – common error variance \sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.

• Hypothesis test for linear relationship

– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).

• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).

– R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate \alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of p’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.

• Standard error expression

SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,

– Residual SD S_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22)

– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.

– R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

– 0\le p\le1

– \lVert t \rVert > 1 for non-trivial results.

– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: \hat y = b0 + b1 x.

• Population line: \muy = \beta0 + \beta_1 x.

• Slope estimator: b1 = r \dfrac{sy}{s_x}.

• Intercept estimator: b0 = \bar y - b1 \bar x.

• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).

• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H0: \beta1=0 and construct CI’s for \beta_1.

• Predict mean response at specified x using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see