Lecture 9 Inference for Linear Regression – Comprehensive Study Notes
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = (first-jump score, catchment area, number of drinks…).
– Response variable = (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept and slope that minimise .
– Closed-form estimates: and . – Interpretations • Point always lies on fitted line. • One increase in → expected change in .
• Coefficient of determination
– .
– Proportion of total variation in explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: . – Now: keep independence but allow means to differ linearly with : . – Setting collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): ("true line"). – Sample (observable, varies): .
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area ()
– Drinks vs BAC ()
• Why inference?
– Each sample produces a different ; goal is to infer .
• Linear regression model assumptions (initial statement)
Linearity of means: points lie on a straight line.
Errors at each are normally distributed with mean .
Independent observations.
Homoscedasticity – common error variance .
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope – Measures expected change in for one-unit change in . – Signs: (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: under model. – R supplies , , value, two-sided value.
• Hypothesis test for linear relationship
– vs (one- or two-sided). – Compute observed .
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ . • Left-tailed Ha: \beta1<0 ⇒ .
• Two-sided ⇒ .
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; ⇒ no evidence at 5 %).
– (≈ 12 % of variability explained).
• Worked supermarket example (display space coffee sales)
– . – ⇒ (strong evidence sales↑ with space).
– 95 % CI: extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate .
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under , P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of ’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of .
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: with .
• Standard error expression
so precision improves when – increases, – Spread in () increases,
– Residual SD decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures ~ Normal – is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example ()
– R output: , , , , two-sided.
– One-sided test Ha: \beta1>0 ⇒ ; very strong evidence BAC rises with drinks.
– ⇒ ≈40 % variation explained.
– 95 % CI for slope: BAC units per drink.
– Prediction at 4 drinks: (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for , recruit larger and/or plan wider range of values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, .
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
–
– > 1 for non-trivial results.
– Sign of matches alternative. – CI limits in logical order; margin of error positive. – Pooled must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: .
• Population line: .
• Slope estimator: .
• Intercept estimator: .
• Test statistic: .
• Confidence interval: .
• Standard error: .
• Coefficient of determination: .
• Bonferroni threshold: ; adjusted P-value: .
Ethical & Practical Considerations
• Misinterpretation of -values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test and construct CI’s for .
• Predict mean response at specified using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = (first-jump score, catchment area, number of drinks…).
– Response variable = (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept and slope that minimise .
– Closed-form estimates: and . – Interpretations • Point always lies on fitted line. • One increase in → expected change in .
• Coefficient of determination
– .
– Proportion of total variation in explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: . – Now: keep independence but allow means to differ linearly with : . – Setting collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): ("true line"). – Sample (observable, varies): .
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area ()
– Drinks vs BAC ()
• Why inference?
– Each sample produces a different ; goal is to infer .
• Linear regression model assumptions (initial statement)
Linearity of means: points lie on a straight line.
Errors at each are normally distributed with mean .
Independent observations.
Homoscedasticity – common error variance .
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope – Measures expected change in for one-unit change in . – Signs: (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: under model. – R supplies , , value, two-sided value.
• Hypothesis test for linear relationship
– vs (one- or two-sided). – Compute observed .
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ . • Left-tailed Ha: \beta1<0 ⇒ .
• Two-sided ⇒ .
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; ⇒ no evidence at 5 %).
– (≈ 12 % of variability explained).
• Worked supermarket example (display space coffee sales)
– . – ⇒ (strong evidence sales↑ with space).
– 95 % CI: extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate .
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under , P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of ’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of .
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: with .
• Standard error expression
so precision improves when – increases, – Spread in () increases,
– Residual SD decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures ~ Normal – is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example ()
– R output: , , , , two-sided.
– One-sided test Ha: \beta1>0 ⇒ ; very strong evidence BAC rises with drinks.
– ⇒ ≈40 % variation explained.
– 95 % CI for slope: BAC units per drink.
– Prediction at 4 drinks: (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for , recruit larger and/or plan wider range of values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, .
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
–
– > 1 for non-trivial results.
– Sign of matches alternative. – CI limits in logical order; margin of error positive. – Pooled must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: .
• Population line: .
• Slope estimator: .
• Intercept estimator: .
• Test statistic: .
• Confidence interval: .
• Standard error: .
• Coefficient of determination: .
• Bonferroni threshold: ; adjusted P-value: .
Ethical & Practical Considerations
• Misinterpretation of -values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test and construct CI’s for .
• Predict mean response at specified using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see
Lecture 1 – Simple Regression Modelling
• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.
• Previous topic: chi-square inference for association between two categorical variables.
• Today’s focus: linear relationship between two quantitative variables (simple linear regression).
• Learning outcomes
– Distinguish regression line at population level (true line) from regression line estimated in a sample.
– Know & visualise assumptions of linear regression model.
• Terminology revision (Week 2)
– Scatterplot: primary graphical tool for two quantitative variables.
– Explanatory variable = (first-jump score, catchment area, number of drinks…).
– Response variable = (total score, water quality, BAC…).
– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.
• Least-squares principle
– Choose intercept and slope that minimise .
– Closed-form estimates: and . – Interpretations • Point always lies on fitted line. • One increase in → expected change in .
• Coefficient of determination
– .
– Proportion of total variation in explained by linear model (range 0–1).
• Connection to i.i.d. mean model
– Previously: . – Now: keep independence but allow means to differ linearly with : . – Setting collapses to i.i.d. mean model.
• Sample vs population notation
– Population (unknown, fixed): ("true line"). – Sample (observable, varies): .
– “Error” = deviation from true mean; “residual” = deviation from fitted line.
• Motivating data sets
– Water quality vs catchment area ()
– Drinks vs BAC ()
• Why inference?
– Each sample produces a different ; goal is to infer .
• Linear regression model assumptions (initial statement)
Linearity of means: points lie on a straight line.
Errors at each are normally distributed with mean .
Independent observations.
Homoscedasticity – common error variance .
Lecture 2 – Inference for the Slope & Prediction
• Primary parameter of interest: slope – Measures expected change in for one-unit change in . – Signs: (no linear relation); >0 increasing; <0 decreasing.
• Sampling distribution of estimator
– Proposition: under model. – R supplies , , value, two-sided value.
• Hypothesis test for linear relationship
– vs (one- or two-sided). – Compute observed .
– P-value rules
• Right-tailed Ha: \beta1>0 ⇒ . • Left-tailed Ha: \beta1<0 ⇒ .
• Two-sided ⇒ .
• Reading R output (example – Water quality)
– Intercept 49.79 (SE 8.52; * * *).
– Slope 0.460 (SE 0.297; ⇒ no evidence at 5 %).
– (≈ 12 % of variability explained).
• Worked supermarket example (display space coffee sales)
– . – ⇒ (strong evidence sales↑ with space).
– 95 % CI: extra dollars per extra ft² (derived later in L3).
• Large-scale regression application – fMRI study
– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.
– Multiplicity problem: many simultaneous tests.
– Bonferroni correction: control family-wise error rate .
– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.
• Multiple testing concepts
– Under , P-values ~ Uniform(0,1).
– If many tests, expect ≈10 % of ’s <0.1 just by chance.
– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).
Lecture 3 – Checking Model Assumptions
• Expanded assumption list (Assumptions 9.6)
Linearity of .
Normality of errors.
Independence of responses.
Constant variance.
• Importance
– (1) & (4) are crucial for unbiased estimates & valid SE’s.
– (2) less critical for large thanks to CLT, but watch out for severe skew/outliers.
– (3) depends on study design – random sampling / random allocation.
• Diagnostic plots
– Residual vs fitted
• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.
– Normal Q-Q plot of residuals
• Assess normality; rule-of-thumb sample-size guidance:
◦ n<15 need near-normal; ok if not strongly skew; n>40 robust except gross outliers.
• Examples
– Water quality residual plot: roughly random, variance constant → assumptions ok.
– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.
– Supermarket example: small , residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).
• Confidence interval for slope
– General form: with .
• Standard error expression
so precision improves when – increases, – Spread in () increases,
– Residual SD decreases.
Lecture 4 – A Bit More on Regression & Exam-style Synthesis
• Why CLT ensures ~ Normal – is a weighted sum of residuals; CLT extends to such linear combinations.
• Full worked BAC example ()
– R output: , , , , two-sided.
– One-sided test Ha: \beta1>0 ⇒ ; very strong evidence BAC rises with drinks.
– ⇒ ≈40 % variation explained.
– 95 % CI for slope: BAC units per drink.
– Prediction at 4 drinks: (below legal 0.05 limit).
– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.
• Study design advice
– To narrow CI for , recruit larger and/or plan wider range of values.
• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.
• R workflow summary
– Fit model: fit <- lm(y ~ x)
– Output: summary(fit) provides estimates, SE’s, t, P, .
– Diagnostics: plot(fit) or custom residual & Q-Q plots.
– Critical values: qt(prob, df) for t, qnorm(prob) for z.
• Sanity-check list before finalising answers
–
– > 1 for non-trivial results.
– Sign of matches alternative. – CI limits in logical order; margin of error positive. – Pooled must lie between group SD’s, etc.
Model Assumptions – Concise Checklist
• Linearity – Residual vs fitted shows no systematic curve.
• Independence – Random sample / independent units ensured by design.
• Normality – Residual Q-Q roughly straight or sample size sufficiently large.
• Equal variance – Residual spread appears constant across fitted values.
Failures require: transformations, alternative models (e.g.
weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.
Key Formulas (LaTeX ready)
• Fitted line: .
• Population line: .
• Slope estimator: .
• Intercept estimator: .
• Test statistic: .
• Confidence interval: .
• Standard error: .
• Coefficient of determination: .
• Bonferroni threshold: ; adjusted P-value: .
Ethical & Practical Considerations
• Misinterpretation of -values (see Wasserstein et al. 2019): always accompany with effect size & CI.
• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).
• Linearity assumption is conceptual: ensure scientific plausibility before applying model.
• Sample selection bias (e.g. all participants from one university) limits generalisability.
Quick-Reference R Commands
• Fit model lm(y ~ x)
• Summary summary(fit)
• Coefficients coef(fit)
• Residuals residuals(fit)
• Fitted values fitted.values(fit)
• Diagnostics plot(fit) or manual residual & Q-Q plots
• t critical value qt(0.975, df=n-2)
• Bonferroni adjust p.adjust(pvec, method="bonferroni")
What You Should Be Able To Do After Chapter 9
• Fit and interpret a simple linear regression in R.
• Test and construct CI’s for .
• Predict mean response at specified using fitted line.
• Generate & interpret residual and Q-Q plots.
• Assess when model assumptions are (not) met.
• Apply Bonferroni (or describe need for) in multiple-comparison settings.
• Choose correct inference tool given data type & research question (see