Lecture 9 Inference for Linear Regression – Comprehensive Study Notes

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = xx (first-jump score, catchment area, number of drinks…).

– Response variable = yy (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b<em>0b<em>0 and slope b</em>1b</em>1 that minimise (y<em>iy^</em>i)2\sum (y<em>i-\hat y</em>i)^2.

– Closed-form estimates: b<em>1=rs</em>ys<em>xb<em>1 = r \, \dfrac{s</em>y}{s<em>x} and b</em>0=yˉb<em>1xˉb</em>0 = \bar y - b<em>1\bar x. – Interpretations • Point (xˉ,yˉ)(\bar x,\bar y) always lies on fitted line. • One s</em>xs</em>x increase in xx → expected change rsyr s_y in yy.

• Coefficient of determination r2r^2

r2=Var(Y^)Var(Y)r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in YY explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Y<em>ii.i.d.(μ,σ2)Y<em>i \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with XX: E[Y</em>iX<em>i=x</em>i]=μ<em>Y</em>i=β<em>0+β</em>1x<em>iE[Y</em>i|X<em>i=x</em>i]=\mu<em>{Y</em>i}=\beta<em>0+\beta</em>1 x<em>i. – Setting β</em>1=0\beta</em>1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): μ<em>y=β</em>0+β<em>1x\mu<em>y = \beta</em>0+\beta<em>1 x ("true line"). – Sample (observable, varies): y^=b</em>0+b1x\hat y = b</em>0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20n=20)

– Drinks vs BAC (n=22n=22)

• Why inference?

– Each sample produces a different b<em>0,b</em>1b<em>0,b</em>1; goal is to infer β<em>0,β</em>1\beta<em>0,\beta</em>1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points (x<em>i,E[Y</em>i])\big(x<em>i,\,E[Y</em>i]\big) lie on a straight line.

  2. Errors at each xx are normally distributed with mean 00.

  3. Independent observations.

  4. Homoscedasticity – common error variance σ2\sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope β<em>1\beta<em>1 – Measures expected change in YY for one-unit change in XX. – Signs: β</em>1=0\beta</em>1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: T=β^<em>1β</em>1SE(β^<em>1)t(n2)\displaystyle T=\frac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta<em>1)} \sim t(n-2) under model. – R supplies β^</em>1\hat\beta</em>1, se(β^1)se(\hat\beta_1), tt value, two-sided pp value.

• Hypothesis test for linear relationship

H<em>0:β</em>1=0H<em>0: \beta</em>1=0 vs H<em>aH<em>a (one- or two-sided). – Compute observed t</em>obs=b<em>1se(β^</em>1)t</em>{obs} = \dfrac{b<em>1}{se(\hat\beta</em>1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(Tt<em>obs)P(T\ge t<em>{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(Tt</em>obs)P(T\le t</em>{obs}).

• Two-sided H<em>a:β</em>10H<em>a: \beta</em>1\neq02P(Ttobs)2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552;p=0.138t=1.552; p=0.138 ⇒ no evidence at 5 %).

R2=0.118R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

b<em>1=28.0,se=6.1,n=9b<em>1=28.0, se=6.1, n=9. – t</em>obs=4.593,df=7t</em>{obs}=4.593, df=7p=0.00125p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4][13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate α/m\alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H0H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of pp’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[YX]E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large nn thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15n4015\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CI<em>C(β</em>1)=β^<em>1±tSE(β^</em>1)CI<em>C(\beta</em>1)=\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1) with df=n2df=n-2.

• Standard error expression

SE(β^<em>1)=S</em>ϵS<em>x1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S<em>x}\,\sqrt{\dfrac1{n-1}} so precision improves when – nn increases, – Spread in XX (S</em>xS</em>x) increases,

– Residual SD SϵS_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures β^<em>1\hat\beta<em>1 ~ Normal – β^</em>1\hat\beta</em>1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22n=22)

– R output: b<em>0=0.0044b<em>0=-0.0044, b</em>1=0.0109b</em>1=0.0109, se=0.0030se=0.0030, t=3.644t=3.644, p=0.0016p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008p=0.0008; very strong evidence BAC rises with drinks.

R2=0.399R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171][0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: y^=0.0044+0.0109×4=0.039\hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for β1\beta_1, recruit larger nn and/or plan wider range of XX values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R2R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

0p10\le p\le1

t\lVert t \rVert > 1 for non-trivial results.

– Sign of t<em>obst<em>{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled s</em>ps</em>p must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: y^=b<em>0+b</em>1x\hat y = b<em>0 + b</em>1 x.

• Population line: μ<em>y=β</em>0+β1x\mu<em>y = \beta</em>0 + \beta_1 x.

• Slope estimator: b<em>1=rs</em>ysxb<em>1 = r \dfrac{s</em>y}{s_x}.

• Intercept estimator: b<em>0=yˉb</em>1xˉb<em>0 = \bar y - b</em>1 \bar x.

• Test statistic: T=β^<em>1β</em>1SE(β^1)t(n2)T = \dfrac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: β^<em>1±tSE(β^</em>1)\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1).

• Standard error: SE(β^<em>1)=S</em>ϵSx1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r2=1(y<em>iy^</em>i)2(yiyˉ)2r^2 = 1-\dfrac{\sum (y<em>i-\hat y</em>i)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: α/m\alpha/m; adjusted P-value: padj=mpp_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of pp-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H<em>0:β</em>1=0H<em>0: \beta</em>1=0 and construct CI’s for β1\beta_1.

• Predict mean response at specified xx using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = xx (first-jump score, catchment area, number of drinks…).

– Response variable = yy (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b<em>0b<em>0 and slope b</em>1b</em>1 that minimise (y<em>iy^</em>i)2\sum (y<em>i-\hat y</em>i)^2.

– Closed-form estimates: b<em>1=rs</em>ys<em>xb<em>1 = r \, \dfrac{s</em>y}{s<em>x} and b</em>0=yˉb<em>1xˉb</em>0 = \bar y - b<em>1\bar x. – Interpretations • Point (xˉ,yˉ)(\bar x,\bar y) always lies on fitted line. • One s</em>xs</em>x increase in xx → expected change rsyr s_y in yy.

• Coefficient of determination r2r^2

r2=Var(Y^)Var(Y)r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in YY explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Y<em>ii.i.d.(μ,σ2)Y<em>i \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with XX: E[Y</em>iX<em>i=x</em>i]=μ<em>Y</em>i=β<em>0+β</em>1x<em>iE[Y</em>i|X<em>i=x</em>i]=\mu<em>{Y</em>i}=\beta<em>0+\beta</em>1 x<em>i. – Setting β</em>1=0\beta</em>1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): μ<em>y=β</em>0+β<em>1x\mu<em>y = \beta</em>0+\beta<em>1 x ("true line"). – Sample (observable, varies): y^=b</em>0+b1x\hat y = b</em>0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20n=20)

– Drinks vs BAC (n=22n=22)

• Why inference?

– Each sample produces a different b<em>0,b</em>1b<em>0,b</em>1; goal is to infer β<em>0,β</em>1\beta<em>0,\beta</em>1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points (x<em>i,E[Y</em>i])\big(x<em>i,\,E[Y</em>i]\big) lie on a straight line.

  2. Errors at each xx are normally distributed with mean 00.

  3. Independent observations.

  4. Homoscedasticity – common error variance σ2\sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope β<em>1\beta<em>1 – Measures expected change in YY for one-unit change in XX. – Signs: β</em>1=0\beta</em>1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: T=β^<em>1β</em>1SE(β^<em>1)t(n2)\displaystyle T=\frac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta<em>1)} \sim t(n-2) under model. – R supplies β^</em>1\hat\beta</em>1, se(β^1)se(\hat\beta_1), tt value, two-sided pp value.

• Hypothesis test for linear relationship

H<em>0:β</em>1=0H<em>0: \beta</em>1=0 vs H<em>aH<em>a (one- or two-sided). – Compute observed t</em>obs=b<em>1se(β^</em>1)t</em>{obs} = \dfrac{b<em>1}{se(\hat\beta</em>1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(Tt<em>obs)P(T\ge t<em>{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(Tt</em>obs)P(T\le t</em>{obs}).

• Two-sided H<em>a:β</em>10H<em>a: \beta</em>1\neq02P(Ttobs)2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552;p=0.138t=1.552; p=0.138 ⇒ no evidence at 5 %).

R2=0.118R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

b<em>1=28.0,se=6.1,n=9b<em>1=28.0, se=6.1, n=9. – t</em>obs=4.593,df=7t</em>{obs}=4.593, df=7p=0.00125p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4][13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate α/m\alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H0H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of pp’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[YX]E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large nn thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15n4015\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CI<em>C(β</em>1)=β^<em>1±tSE(β^</em>1)CI<em>C(\beta</em>1)=\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1) with df=n2df=n-2.

• Standard error expression

SE(β^<em>1)=S</em>ϵS<em>x1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S<em>x}\,\sqrt{\dfrac1{n-1}} so precision improves when – nn increases, – Spread in XX (S</em>xS</em>x) increases,

– Residual SD SϵS_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures β^<em>1\hat\beta<em>1 ~ Normal – β^</em>1\hat\beta</em>1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22n=22)

– R output: b<em>0=0.0044b<em>0=-0.0044, b</em>1=0.0109b</em>1=0.0109, se=0.0030se=0.0030, t=3.644t=3.644, p=0.0016p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008p=0.0008; very strong evidence BAC rises with drinks.

R2=0.399R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171][0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: y^=0.0044+0.0109×4=0.039\hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for β1\beta_1, recruit larger nn and/or plan wider range of XX values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R2R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

0p10\le p\le1

t\lVert t \rVert > 1 for non-trivial results.

– Sign of t<em>obst<em>{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled s</em>ps</em>p must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: y^=b<em>0+b</em>1x\hat y = b<em>0 + b</em>1 x.

• Population line: μ<em>y=β</em>0+β1x\mu<em>y = \beta</em>0 + \beta_1 x.

• Slope estimator: b<em>1=rs</em>ysxb<em>1 = r \dfrac{s</em>y}{s_x}.

• Intercept estimator: b<em>0=yˉb</em>1xˉb<em>0 = \bar y - b</em>1 \bar x.

• Test statistic: T=β^<em>1β</em>1SE(β^1)t(n2)T = \dfrac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: β^<em>1±tSE(β^</em>1)\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1).

• Standard error: SE(β^<em>1)=S</em>ϵSx1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r2=1(y<em>iy^</em>i)2(yiyˉ)2r^2 = 1-\dfrac{\sum (y<em>i-\hat y</em>i)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: α/m\alpha/m; adjusted P-value: padj=mpp_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of pp-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H<em>0:β</em>1=0H<em>0: \beta</em>1=0 and construct CI’s for β1\beta_1.

• Predict mean response at specified xx using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see

Lecture 1 – Simple Regression Modelling

• Overarching aim of course: understand Statistics as the science of collecting, analysing & interpreting data.

• Previous topic: chi-square inference for association between two categorical variables.

• Today’s focus: linear relationship between two quantitative variables (simple linear regression).

• Learning outcomes

– Distinguish regression line at population level (true line) from regression line estimated in a sample.

– Know & visualise assumptions of linear regression model.

• Terminology revision (Week 2)

– Scatterplot: primary graphical tool for two quantitative variables.

– Explanatory variable = xx (first-jump score, catchment area, number of drinks…).

– Response variable = yy (total score, water quality, BAC…).

– Linear relationship ⇒ fit the “line of best fit” / “least-squares regression line”.

• Least-squares principle

– Choose intercept b<em>0b<em>0 and slope b</em>1b</em>1 that minimise (y<em>iy^</em>i)2\sum (y<em>i-\hat y</em>i)^2.

– Closed-form estimates: b<em>1=rs</em>ys<em>xb<em>1 = r \, \dfrac{s</em>y}{s<em>x} and b</em>0=yˉb<em>1xˉb</em>0 = \bar y - b<em>1\bar x. – Interpretations • Point (xˉ,yˉ)(\bar x,\bar y) always lies on fitted line. • One s</em>xs</em>x increase in xx → expected change rsyr s_y in yy.

• Coefficient of determination r2r^2

r2=Var(Y^)Var(Y)r^2 = \dfrac{\mathrm{Var}(\hat Y)}{\mathrm{Var}(Y)}.

– Proportion of total variation in YY explained by linear model (range 0–1).

• Connection to i.i.d. mean model

– Previously: Y<em>ii.i.d.(μ,σ2)Y<em>i \stackrel{\text{i.i.d.}}{\sim} (\mu,\sigma^2). – Now: keep independence but allow means to differ linearly with XX: E[Y</em>iX<em>i=x</em>i]=μ<em>Y</em>i=β<em>0+β</em>1x<em>iE[Y</em>i|X<em>i=x</em>i]=\mu<em>{Y</em>i}=\beta<em>0+\beta</em>1 x<em>i. – Setting β</em>1=0\beta</em>1=0 collapses to i.i.d. mean model.

• Sample vs population notation

– Population (unknown, fixed): μ<em>y=β</em>0+β<em>1x\mu<em>y = \beta</em>0+\beta<em>1 x ("true line"). – Sample (observable, varies): y^=b</em>0+b1x\hat y = b</em>0 + b_1 x.

– “Error” = deviation from true mean; “residual” = deviation from fitted line.

• Motivating data sets

– Water quality vs catchment area (n=20n=20)

– Drinks vs BAC (n=22n=22)

• Why inference?

– Each sample produces a different b<em>0,b</em>1b<em>0,b</em>1; goal is to infer β<em>0,β</em>1\beta<em>0,\beta</em>1.

• Linear regression model assumptions (initial statement)

  1. Linearity of means: points (x<em>i,E[Y</em>i])\big(x<em>i,\,E[Y</em>i]\big) lie on a straight line.

  2. Errors at each xx are normally distributed with mean 00.

  3. Independent observations.

  4. Homoscedasticity – common error variance σ2\sigma^2.


Lecture 2 – Inference for the Slope & Prediction

• Primary parameter of interest: slope β<em>1\beta<em>1 – Measures expected change in YY for one-unit change in XX. – Signs: β</em>1=0\beta</em>1=0 (no linear relation); >0 increasing; <0 decreasing.

• Sampling distribution of estimator

– Proposition: T=β^<em>1β</em>1SE(β^<em>1)t(n2)\displaystyle T=\frac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta<em>1)} \sim t(n-2) under model. – R supplies β^</em>1\hat\beta</em>1, se(β^1)se(\hat\beta_1), tt value, two-sided pp value.

• Hypothesis test for linear relationship

H<em>0:β</em>1=0H<em>0: \beta</em>1=0 vs H<em>aH<em>a (one- or two-sided). – Compute observed t</em>obs=b<em>1se(β^</em>1)t</em>{obs} = \dfrac{b<em>1}{se(\hat\beta</em>1)}.

– P-value rules

• Right-tailed Ha: \beta1>0 ⇒ P(Tt<em>obs)P(T\ge t<em>{obs}). • Left-tailed Ha: \beta1<0 ⇒ P(Tt</em>obs)P(T\le t</em>{obs}).

• Two-sided H<em>a:β</em>10H<em>a: \beta</em>1\neq02P(Ttobs)2P(T\le -|t_{obs}|).

• Reading R output (example – Water quality)

– Intercept 49.79 (SE 8.52; * * *).

– Slope 0.460 (SE 0.297; t=1.552;p=0.138t=1.552; p=0.138 ⇒ no evidence at 5 %).

R2=0.118R^2=0.118 (≈ 12 % of variability explained).

• Worked supermarket example (display space \to coffee sales)

b<em>1=28.0,se=6.1,n=9b<em>1=28.0, se=6.1, n=9. – t</em>obs=4.593,df=7t</em>{obs}=4.593, df=7p=0.00125p=0.00125 (strong evidence sales↑ with space).

– 95 % CI: [13.6,42.4][13.6,42.4] extra dollars per extra ft² (derived later in L3).

• Large-scale regression application – fMRI study

– 0/1 stimulus regressor – run separate regressions for 26 033 voxels.

– Multiplicity problem: many simultaneous tests.

– Bonferroni correction: control family-wise error rate α/m\alpha/m.

– Visualisation: 3-D brain maps of voxels with adjusted p<0.05.

• Multiple testing concepts

– Under H0H_0, P-values ~ Uniform(0,1).

– If many tests, expect ≈10 % of pp’s <0.1 just by chance.

– Bonferroni is simple but conservative; other procedures (Holm, Benjamini-Hochberg, knockoffs, etc.).


Lecture 3 – Checking Model Assumptions

• Expanded assumption list (Assumptions 9.6)

  1. Linearity of E[YX]E[Y|X].

  2. Normality of errors.

  3. Independence of responses.

  4. Constant variance.


• Importance

– (1) & (4) are crucial for unbiased estimates & valid SE’s.

– (2) less critical for large nn thanks to CLT, but watch out for severe skew/outliers.

– (3) depends on study design – random sampling / random allocation.

• Diagnostic plots

– Residual vs fitted

• Detect non-linearity (curvature), heteroscedasticity (fan-shape), outliers.

– Normal Q-Q plot of residuals

• Assess normality; rule-of-thumb sample-size guidance:

◦ n<15 need near-normal; 15n4015\le n\le40 ok if not strongly skew; n>40 robust except gross outliers.

• Examples

– Water quality residual plot: roughly random, variance constant → assumptions ok.

– Simulated U-shape & fan-shape patterns shown; warn when assumptions fail.

– Supermarket example: small n=9n=9, residual plot looks acceptable, Q-Q fairly linear → proceed, but results sensitive to independence assumption (sampling scheme not given).

• Confidence interval for slope

– General form: CI<em>C(β</em>1)=β^<em>1±tSE(β^</em>1)CI<em>C(\beta</em>1)=\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1) with df=n2df=n-2.

• Standard error expression

SE(β^<em>1)=S</em>ϵS<em>x1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S<em>x}\,\sqrt{\dfrac1{n-1}} so precision improves when – nn increases, – Spread in XX (S</em>xS</em>x) increases,

– Residual SD SϵS_\epsilon decreases.


Lecture 4 – A Bit More on Regression & Exam-style Synthesis

• Why CLT ensures β^<em>1\hat\beta<em>1 ~ Normal – β^</em>1\hat\beta</em>1 is a weighted sum of residuals; CLT extends to such linear combinations.

• Full worked BAC example (n=22n=22)

– R output: b<em>0=0.0044b<em>0=-0.0044, b</em>1=0.0109b</em>1=0.0109, se=0.0030se=0.0030, t=3.644t=3.644, p=0.0016p=0.0016 two-sided.

– One-sided test Ha: \beta1>0 ⇒ p=0.0008p=0.0008; very strong evidence BAC rises with drinks.

R2=0.399R^2=0.399 ⇒ ≈40 % variation explained.

– 95 % CI for slope: [0.0046,0.0171][0.0046,0.0171] BAC units per drink.

– Prediction at 4 drinks: y^=0.0044+0.0109×4=0.039\hat y= -0.0044+0.0109\times4 = 0.039 (below legal 0.05 limit).

– Diagnostics: slight fan-shape heteroscedasticity; independence unclear (sample convenience from one university) ⇒ results tentative.

• Study design advice

– To narrow CI for β1\beta_1, recruit larger nn and/or plan wider range of XX values.

• Comprehensive inference menu (one vs two variables table) – reminds which test/CI to use given variable types; regression occupies “both quantitative” cell.

• R workflow summary

– Fit model: fit &lt;- lm(y ~ x)

– Output: summary(fit) provides estimates, SE’s, t, P, R2R^2.

– Diagnostics: plot(fit) or custom residual & Q-Q plots.

– Critical values: qt(prob, df) for t, qnorm(prob) for z.

• Sanity-check list before finalising answers

0p10\le p\le1

t\lVert t \rVert > 1 for non-trivial results.

– Sign of t<em>obst<em>{obs} matches alternative. – CI limits in logical order; margin of error positive. – Pooled s</em>ps</em>p must lie between group SD’s, etc.


Model Assumptions – Concise Checklist

• Linearity – Residual vs fitted shows no systematic curve.

• Independence – Random sample / independent units ensured by design.

• Normality – Residual Q-Q roughly straight or sample size sufficiently large.

• Equal variance – Residual spread appears constant across fitted values.

Failures require: transformations, alternative models (e.g.

weighted least squares), or non-parametric / permutation methods beyond scope of MATH1041.


Key Formulas (LaTeX ready)

• Fitted line: y^=b<em>0+b</em>1x\hat y = b<em>0 + b</em>1 x.

• Population line: μ<em>y=β</em>0+β1x\mu<em>y = \beta</em>0 + \beta_1 x.

• Slope estimator: b<em>1=rs</em>ysxb<em>1 = r \dfrac{s</em>y}{s_x}.

• Intercept estimator: b<em>0=yˉb</em>1xˉb<em>0 = \bar y - b</em>1 \bar x.

• Test statistic: T=β^<em>1β</em>1SE(β^1)t(n2)T = \dfrac{\hat\beta<em>1-\beta</em>1}{SE(\hat\beta_1)} \sim t(n-2).

• Confidence interval: β^<em>1±tSE(β^</em>1)\hat\beta<em>1 \pm t^* SE(\hat\beta</em>1).

• Standard error: SE(β^<em>1)=S</em>ϵSx1n1SE(\hat\beta<em>1)=\dfrac{S</em>\epsilon}{S_x}\sqrt{\dfrac1{n-1}}.

• Coefficient of determination: r2=1(y<em>iy^</em>i)2(yiyˉ)2r^2 = 1-\dfrac{\sum (y<em>i-\hat y</em>i)^2}{\sum (y_i-\bar y)^2}.

• Bonferroni threshold: α/m\alpha/m; adjusted P-value: padj=mpp_{adj}=m p.


Ethical & Practical Considerations

• Misinterpretation of pp-values (see Wasserstein et al. 2019): always accompany with effect size & CI.

• Multiple testing inflates false discoveries – correction mandatory in large-scale studies (e.g. neuro-imaging, genomics).

• Linearity assumption is conceptual: ensure scientific plausibility before applying model.

• Sample selection bias (e.g. all participants from one university) limits generalisability.


Quick-Reference R Commands

• Fit model lm(y ~ x)

• Summary summary(fit)

• Coefficients coef(fit)

• Residuals residuals(fit)

• Fitted values fitted.values(fit)

• Diagnostics plot(fit) or manual residual & Q-Q plots

• t critical value qt(0.975, df=n-2)

• Bonferroni adjust p.adjust(pvec, method="bonferroni")


What You Should Be Able To Do After Chapter 9

• Fit and interpret a simple linear regression in R.

• Test H<em>0:β</em>1=0H<em>0: \beta</em>1=0 and construct CI’s for β1\beta_1.

• Predict mean response at specified xx using fitted line.

• Generate & interpret residual and Q-Q plots.

• Assess when model assumptions are (not) met.

• Apply Bonferroni (or describe need for) in multiple-comparison settings.

• Choose correct inference tool given data type & research question (see