Lecture 9 Inference for Linear Regression – Comprehensive Study Notes
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = x (first-jump score, catchment area, number of drinks…).
– Response variable = y (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.
– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.
• Coefficient of determination r^2
– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.
– Proportion of total variation in Y explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area (n=20)
– Drinks vs BAC (n=22)
• Why inference?
– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.
• Linear regression model assumptions (initial statement)
Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.
Errors at each x are normally distributed with mean 0.
Independent observations.
Homoscedasticity – common error variance \sigma^2.
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.
• Hypothesis test for linear relationship
– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).
• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).
– R^2=0.118 (≈ 12 % of variability explained).
• Worked supermarket example (display space \to coffee sales)
– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).
– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate \alpha/m.
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under H_0, P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of p’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of E[Y|X].
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.
• Standard error expression
SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,
– Residual SD S_\epsilon decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example (n=22)
– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.
– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.
– R^2=0.399 ⇒ ≈40 % variation explained.
– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.
– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, R^2.
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
– 0\le p\le1
– \lVert t \rVert > 1 for non-trivial results.
– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: \hat y = b0 + b1 x.
• Population line: \muy = \beta0 + \beta_1 x.
• Slope estimator: b1 = r \dfrac{sy}{s_x}.
• Intercept estimator: b0 = \bar y - b1 \bar x.
• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).
• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).
• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.
• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.
• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.
Ethical & Practical Considerations
• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test H0: \beta1=0 and construct CI’s for \beta_1.
• Predict mean response at specified x using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = x (first-jump score, catchment area, number of drinks…).
– Response variable = y (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.
– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.
• Coefficient of determination r^2
– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.
– Proportion of total variation in Y explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area (n=20)
– Drinks vs BAC (n=22)
• Why inference?
– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.
• Linear regression model assumptions (initial statement)
Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.
Errors at each x are normally distributed with mean 0.
Independent observations.
Homoscedasticity – common error variance \sigma^2.
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.
• Hypothesis test for linear relationship
– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).
• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).
– R^2=0.118 (≈ 12 % of variability explained).
• Worked supermarket example (display space \to coffee sales)
– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).
– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate \alpha/m.
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under H_0, P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of p’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of E[Y|X].
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.
• Standard error expression
SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,
– Residual SD S_\epsilon decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example (n=22)
– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.
– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.
– R^2=0.399 ⇒ ≈40 % variation explained.
– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.
– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, R^2.
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
– 0\le p\le1
– \lVert t \rVert > 1 for non-trivial results.
– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: \hat y = b0 + b1 x.
• Population line: \muy = \beta0 + \beta_1 x.
• Slope estimator: b1 = r \dfrac{sy}{s_x}.
• Intercept estimator: b0 = \bar y - b1 \bar x.
• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).
• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).
• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.
• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.
• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.
Ethical & Practical Considerations
• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test H0: \beta1=0 and construct CI’s for \beta_1.
• Predict mean response at specified x using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = x (first-jump score, catchment area, number of drinks…).
– Response variable = y (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept b0 and slope b1 that minimise \sum (yi-\hat yi)^2.
– Closed-form estimates: b1 = r \, \dfrac{sy}{sx} and b0 = \bar y - b1\bar x. – Interpretations • Point (\bar x,\bar y) always lies on fitted line. • One sx increase in x → expected change r s_y in y.
• Coefficient of determination r^2
– r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.
– Proportion of total variation in Y explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: Yi \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with X: E[Yi|Xi=xi]=\mu{Yi}=\beta0+\beta1 xi. – Setting \beta1=0 collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): \muy = \beta0+\beta1 x ("true line"). – Sample (observable, varies): \hat y = b0 + b_1 x.
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area (n=20)
– Drinks vs BAC (n=22)
• Why inference?
– Each sample produces a different b0,b1; goal is to infer \beta0,\beta1.
• Linear regression model assumptions (initial statement)
Linearity of means: points \big(xi,\,E[Yi]\big) lie on a straight line.
Errors at each x are normally distributed with mean 0.
Independent observations.
Homoscedasticity – common error variance \sigma^2.
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope \beta1 – Measures expected change in Y for one-unit change in X. – Signs: \beta1=0 (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: \displaystyle T=\frac{\hat\beta1-\beta1}{SE(\hat\beta1)} \sim t(n-2) under model. – R supplies \hat\beta1, se(\hat\beta_1), t value, two-sided p value.
• Hypothesis test for linear relationship
– H0: \beta1=0 vs Ha (one- or two-sided). – Compute observed t{obs} = \dfrac{b1}{se(\hat\beta1)}.
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ P(T\ge t{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(T\le t{obs}).
• Two-sided Ha: \beta1\neq0 ⇒ 2P(T\le -|t_{obs}|).
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; t=1.552; p=0.138 ⇒ no evidence at 5 %).
– R^2=0.118 (≈ 12 % of variability explained).
• Worked supermarket example (display space \to coffee sales)
– b1=28.0, se=6.1, n=9. – t{obs}=4.593, df=7 ⇒ p=0.00125 (strong evidence sales↑ with space).
– 95 % CI: [13.6,42.4] extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate \alpha/m.
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under H_0, P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of p’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of E[Y|X].
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large n thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; 15\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: CIC(\beta1)=\hat\beta1 \pm t^* SE(\hat\beta1) with df=n-2.
• Standard error expression
SE(\hat\beta1)=\dfrac{S\epsilon}{Sx}\,\sqrt{\dfrac1{n-1}} so precision improves when – n increases, – Spread in X (Sx) increases,
– Residual SD S_\epsilon decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures \hat\beta1 ~ Normal – \hat\beta1 is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example (n=22)
– R output: b0=-0.0044, b1=0.0109, se=0.0030, t=3.644, p=0.0016 two-sided.
– One-sided test Ha: \beta1>0 ⇒ p=0.0008; very strong evidence BAC rises with drinks.
– R^2=0.399 ⇒ ≈40 % variation explained.
– 95 % CI for slope: [0.0046,0.0171] BAC units per drink.
– Prediction at 4 drinks: \hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for \beta_1, recruit larger n and/or plan wider range of X values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, R^2.
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
– 0\le p\le1
– \lVert t \rVert > 1 for non-trivial results.
– Sign of t{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled sp must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: \hat y = b0 + b1 x.
• Population line: \muy = \beta0 + \beta_1 x.
• Slope estimator: b1 = r \dfrac{sy}{s_x}.
• Intercept estimator: b0 = \bar y - b1 \bar x.
• Test statistic: T = \dfrac{\hat\beta1-\beta1}{SE(\hat\beta_1)} \sim t(n-2).
• Confidence interval: \hat\beta1 \pm t^* SE(\hat\beta1).
• Standard error: SE(\hat\beta1)=\dfrac{S\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.
• Coefficient of determination: r^2 = 1-\dfrac{\sum (yi-\hat yi)^2}{\sum (y_i-\bar y)^2}.
• Bonferroni threshold: \alpha/m; adjusted P-value: p_{adj}=m p.
Ethical & Practical Considerations
• Misinterpretation of p-values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test H0: \beta1=0 and construct CI’s for \beta_1.
• Predict mean response at specified x using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see