Outcome Y’s distribution
- Y∼f(y;θ): Y is a random variable that follows a probability distribution with pdf f(⋅) and parameter vector θ (vector or scalar).
- We often parametrize θ=(μ,ψ), where:
- μ=E(Y) is the expected value (mean),
- ψ are zero or more additional parameters.
- The focus is typically on modeling the mean of the distribution (μ).
How to think of regression modelling as specifications for conditional distributions (Y given X)?
Y∼f(y∣x; θ)
Regression modeling provides a framework for defining conditional distributions of the response variable Y based on predictor variables X.
It allows for estimating how the expected value of Y changes with respect to variations in X and the parameters θ of the conditional distribution.
1/129
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Outcome Y’s distribution
- Y∼f(y;θ): Y is a random variable that follows a probability distribution with pdf f(⋅) and parameter vector θ (vector or scalar).
- We often parametrize θ=(μ,ψ), where:
- μ=E(Y) is the expected value (mean),
- ψ are zero or more additional parameters.
- The focus is typically on modeling the mean of the distribution (μ).
How to think of regression modelling as specifications for conditional distributions (Y given X)?
Y∼f(y∣x; θ)
Regression modeling provides a framework for defining conditional distributions of the response variable Y based on predictor variables X.
It allows for estimating how the expected value of Y changes with respect to variations in X and the parameters θ of the conditional distribution.
Models for the mean parameter μ
Focus of this course, where the mean param μ of Y depends on X through a linear combination of the X’s.
g(.) is the link function where g(μ)=μ
Beta’s are a vector of parameters - reg coeffs
How do we model the mean μ of Y based on x?
The mean μ depends on x through a linear combination:
g(μ)=x′β=x1β1+x2β2+⋯+xpβp
What is g(μ)?
- g(μ) is a link function, a specified function applied to the mean μ of Y.
- Example: In linear regression, g(μ)=μ (identity link).
What is β0?
- When the first element of x is 1, β0 is the constant term or intercept in the model.
- x=(1,x1,…,xp−1)′,β=(β0,β1,…,βp−1)′
What is the distribution of Y in a standard linear regression model?
- Y∼N(x′β,σ2)
- The mean is a linear combination of explanatory variables.
- The variance σ2 is constant (homoskedastic).
How does the expected value μ=E(Y) depend on x?
The mean is a linear combination, where the regression coeffs describe how the mean depends on each of the explanatory variables:
μ=x′β=x1β1+x2β2+⋯+xpβp
What is assumed about the variance σ2 in linear regression?
- The conditional variance σ2 does not depend on x.
- Variance is constant across all values of x (homoskedasticity).
What model do you use for continuous (normal) Y?
Linear regression
What model do you use for binary (binomial) Y?
Binary logistic regression
What model do you use for multinomial Y? (Provide distinction between nominal and ordinal model)
Nominal - multinomial logit model
Ordinal - cumulative (ordinal) logit model
What model do you use for Y where it is count data?
Poisson, negative binomial models, log linear models
What model do you use for duration Y?
Survival analysis models
What we need to understand for each model?
Linear reg. overview
Is a GLM for a normally distributed Y and identity link
Estimation done with maximum likelihood
Inference using asymptotic results for ML estimation
Unique properties for linear regression
MLE is also ordinary least squares estimation
Normality → exact distributional results for test statistics and inference (as opposed to asymptotically)
Model can also be motivated without distributional assumptions (using only mean and var. of Y)
What are the random variables and explanatory variables in the linear regression model?
Y1,…,Yn are independent random variables (responses).
Each Yi has associated observed explanatory variables xi=(xi1,…,xip)′.
p≤n, and the xi are treated as fixed.
Linear Regression Model: Distribution of Yi
- What is the distribution of each Yi?
Yi∼N(μi,σ2)
where μi=E(Yi)=xi′β=xi1β1+⋯+xipβp.
σ2 is constant across all i (homoskedastic).
Expected Value E(Yi)
- How is E(Yi) expressed in terms of xi and β?
- E(Yi)=μi=xi′β=xi1β1+⋯+xipβp
- A linear combination of explanatory variables.
Unknown Parameters in Linear Regression
- What are the unknown parameters in the linear regression model?
- The regression coefficients β=(β1,…,βp)′.
- The residual variance σ2.
Linear Regression Model: Matrix Form Setup
- How do we define Y, μ, and X in matrix form?
* Y=(Y1,…,Yn)′
* μ=(μ1,…,μn)′
* X=[x1…xn]′ (an n×p matrix of explanatory variables)
Linear model in matrix Notation? Expression for μ ? And distribution.
Y=Xβ+ϵ
μ=Xβ
* Y∼N(Xβ,σ2In)
* In is the n×n identity matrix.
Definition as Mean + Residuals
- How can we rewrite the linear regression model using residuals?
* Y=Xβ+ϵ
* where ϵ=(ϵ1,…,ϵn)′∼N(0,σ2In)
- What are the properties of the residuals ϵi?
* Each ϵi∼N(0,σ2)
Residuals are *independent** of each other.
* All randomness in Y is attributed to ϵ.
residuals are errors. They capture deviation of observed Yi from its mean μi=xi′β.
What assumption regarding residuals for linear regression?
ϵ∼N(0,σ2I)
What does ϵ∼N(0,σ2I) imply? (4 items)
1. E(ϵi)=0 for all i
2. var(ϵi)=σ2 for all i
3. ϵi and ϵj are independent for all i=j
4. Each ϵi is normally distributed
Estimation of β: Method
- How is β estimated in linear regression?
By Ordinary Least Squares (OLS)
No distributional assumptions are needed (unlike Maximum Likelihood).
Ordinary Least Squares (OLS): Objective
- What is the goal of the OLS method?
Find β that minimizes the **sum of squared differences** between observed yi and expected E(Yi∣xi;β)=xi′β.
Sum of Squared Errors: Formula
- What is the formula for the sum of squared errors in OLS? (equation and matrix notation)
S(β)=i=1∑n(yi−xi′β)2
Equivalently,
S(β)=(y−Xβ)′(y−Xβ)
Definition of β^
- What does β^ represent?
* β^=(β^1,…,β^p)′
* It is the value of β that minimizes S(β) (the sum of squared errors).
OLS Estimator β^: Derivation
- How is β^ obtained?
* By solving:
∂β∂[(y−Xβ)′(y−Xβ)]=0
Involves taking a *partial derivative** and setting it to zero.
Full derivation of β^
What is the final formula for β^?
* β^=(X′X)−1X′y
* Requires that (X′X)−1 exists (i.e., X′X must be invertible).
Connection Between OLS and MLE
- When is the OLS estimator also the Maximum Likelihood Estimator (MLE)?
When the Yi are normally distributed
Under normality, OLS and MLE give the same estimate for β
What is the formula for β^0 in simple linear regression?
β^0=yˉ−xˉβ^1
Estimator for Slope β^1
- What is the formula for β^1 in simple linear regression?
β^1=∑i(xi−xˉ)2∑i(yi−yˉ)(xi−xˉ)
This measures the scaled covariance between x and y.
Alternative Expression for β^1
- How else can β^1 be written as a function of the sample standard deviation and correlations?
* β^1=(sxsy)rxy
* where:
* sx = sample standard deviation of xi
* sy = sample standard deviation of yi
* rxy = sample correlation between xi and yi
Unbiased Estimator of Variance σ2
- What is the unbiased estimator for σ2 in linear regression?
* σ^2=n−p1∑i=1n(yi−xi′β^)2
* Alternatively: σ^2=n−p1e′e
* Divides by n−p for unbiasedness.
Definition of Residuals
- How are the residuals e defined?
e=y−Xβ^=y−y^
-
Each residual: ei=yi−xi′β^=yi−y^i
-
* e = vector of residuals (errors) for all observations
* y = vector of observed outcomes
* X = matrix of explanatory variables
* β̂ = vector of estimated regression coefficients
* ŷ = vector of predicted (fitted) values
* eᵢ = residual for the i-th observation
* yᵢ = observed outcome for the i-th observation
* xᵢ'β̂ = predicted value for the i-th observation (ŷᵢ)
* ŷᵢ = fitted value for the i-th observation
MLE Estimator of Variance σ2
- What is the maximum likelihood estimator (MLE) of σ2?
* σ^ML2=ne′e
* Divides by n (not n−p).
Formula for Residual Vector e
- What is the formula for the residual vector e?
* e=y−Xβ^
* Also written as: e=y−y^ where y_hat is the fitted values
Formula for Individual Residual ei
- What is the formula for the residual ei?
* ei=yi−xi′β^
* Also written as: ei=yi−y^i
What is the expectation of β^
E(β^)=β
Thus, β^ is an unbiased estimator of β.
Derivation of E(β^)
Variance of β^
- What is the variance of β^? (true variable of beta hat)
Var(β^)=σ2(X′X)−1
Unbiased Estimator of Var(β^)
- What is the unbiased estimator for Var(β^)? This is the estimated variance based on our data. We use sigma hat instead of sigma.
* Var(β^)=σ^2(X′X)−1
* where σ^2 is the unbiased estimator of σ2.
* σ^2=n−p1∑i=1n(yi−y^i)2
* σ^2=n−p1e′e
* where ei=yi−y^i and e=(e1,…,en)′
Derivation of Var(β^)
Condition on X for OLS
- What condition must X satisfy for β^ to be uniquely defined?
* X must be full rank: rank(X)=p
* The inverse (X′X)−1 must exist.
* Condition fails if:
* Columns are linearly dependent (multicollinearity)
* or if p>n (more variables than observations)
Distribution of β^
β^∼N(β,σ2(X′X)−1)
β^ is normally distributed because it is a linear combination of normally distributed Y.
In R, how do you run a linear regression using faculty
data, with response variable salary
and explanatory vars market
and yearsdg
?
lm(salary ~ market + yearsdg, data = faculty)
Where can you find the distribution of residuals?
Label Beta_0, Beta_1, and Beta_2
What is the “residual standard error” part telling you? How did it get 511 df?
It represents the square root of the unbiased estimate of σ2, which we denote σ^2 .
Residual standard error = σ^2
Where σ2^=n−p1e’e
We get 511 df because we have 514 - 3 parameters (B0, B1, B2).
Explain how we get the standard errors for the betas?
General interpretation of regression coefficient on xj in a linear regression model
If xj increases by a units, while controlling for other explanatory variables, the expected value of Y changes by a⋅βj units.
Interpret the coefficient on market
, which the marketability point of their discipline. salary is in $1,000 units.
Holding other explanatory variables constant, a one-unit increase in the marketability of ones discipline increases the expected salary by $396 (0.396 × 1000).
What does TSS represent in regression?
* TSS = Total Sum of Squares
* TSS=∑i=1n(yi−yˉ)2
* Measures total variability in y around its mean
* It’s the baseline variability before fitting any model
What is the decomposition of TSS in linear regression?
* TSS=XSS+RSS
* XSS = Explained (Regression) Sum of Squares
* RSS = Residual Sum of Squares
* This breaks total variation into model + leftover:
* ∑(yi−yˉ)2=∑(y^i−yˉ)2+∑(yi−y^i)2
What does XSS represent?
* XSS = ∑(y^i−yˉ)2
* Measures improvement from using predictors
* Variation explained by the model (fitted values)
Amount of the variation explained when the fitted values are allowed to depend on xi and thus vary by i.
What does RSS represent?
* RSS = ∑(yi−y^i)2
* Measures how far the actual yi are from the model’s predictions
* Unexplained variation = residuals
What does TSS represent?
Describes the total variation in the values of yi in the sample.
What is the formula for R2 (coefficient of determination)?
* R2=TSSXSS=1−TSSRSS
* Measures goodness of fit
* Tells how much of the total variation in y is explained by the model
-
R2 is the proportion of total variation in y that is explained by variation in the explanatory variables.
* R2∈[0,1]
* Closer to 1 → better model fit
How else can R2 be calculated?
* R2=(cor(y,y^))2
* It’s the square of the correlation between observed and predicted y values
Interpret the R²
Interpretation example: R2=0.6795
About 68% of the variation in salaries in this sample is accounted for by variation in years since PhD and marketability.
What is the formula for the adjusted R2 statistic?
Radj2=n−p(n−1)R2−(p−1)
How is adjusted R2 different from regular R2?
Adjusted R2 does not necessarily increase when we add new explanatory variables
-
It accounts for model complexity using a penalty factor
-
It is more similar to penalised model assessment criteria such as AIC
What general form do most hypothesis tests in regression take?
Rβ=r
R is a known matrix and r is a known vector — together they define constraints on β
What is a null hypothesis for testing a single coefficient?
H0:βj=0
Implies the coefficient of xj is 0, so xj can be omitted without loss
What is a hypothesis involving a linear combination of coefficients?
H0:β1=β2
Matrix form: R=[0 1 −1] (B0, B1, B2) and r=0 .
More generally, that some coeffs are equal to each other.
What is an example of multiple simultaneous coefficient tests?
H0:β1=0 and β2=0
R=[001001] and r=[00]
What is the sampling distribution of β^ in normal linear regression?
β^∼N(β,σ2(X′X)−1)
The estimated coefficients follow a normal distribution with mean equal to the true coefficients and a variance that depends on the error variance and the design matrix.
How do you compute se(β^j)?
se(β^j)=σ2[(X′X)−1]jj
Formula for t-statistic (general)?
t=se(β^j)β^j−βj
What is the formula for the t-statistic used to test H0:βj=r or 0?
t=se(β^j)β^j−r∼tn−pif H0 is true
What distribution does the test statistic follow under the null?
It follows a tn−p distribution
What happens to the t-distribution if n is moderately large?
It can be approximated by a standard normal distribution N(0,1)
What is the formula for a (1−α)×100% confidence interval for βj?
β^j±tn−p(1−α/2)⋅se^(β^j)
What does tn−p(1−α/2) represent?
It is the 1−α/2 quantile of the tn−p distribution used to determine the critical value for constructing confidence intervals.
What is the value of tn−p(0.975) for a 95% confidence interval?
For large n, it approximates 1.96 from N(0,1) .
In particular, for a 95% CI, α=0.95 .
What is the null hypothesis tested using Rβ=r when R is a 1×p row vector?
It tests a single linear constraint on β, like βj=βk
What is the distribution of Rβ^−r under H0?
Rβ^−r∼N(0,σ2R(X′X)−1R′)
What is the formula for the t-statistic for testing Rβ=r?
t=σ^2R(X′X)−1R′Rβ^−r∼tn−p
What is the key idea behind using matrix notation Rβ=r for hypothesis testing?
It allows us to test any single linear constraint on β — regardless of how many coefficients are involved — using a t-test
What kind of hypothesis does the F-test handle that the t-test cannot?
The F-test is used to jointly test multiple constraints — i.e., q>1 constraints on β
What does the null hypothesis H0:Rβ=r look like in an F-test?
R is a q×p matrix and r is a q×1 vector, with q>1 indicating multiple linear constraints on the coefficients.
What is the distribution of Rβ^−r under H0?
Rβ^−r∼N(0,σ2R(X′X)−1R′)
What is the formula for the F-statistic used in this test?
F=q(Rβ^−r)′(σ^2R(X′X)−1R′)−1(Rβ^−r)∼Fq,n−p
How are the t-test and F-test related when q=1?
F=t2 and the tests give the same p-value since F1,n−p=tn−p2
What is the big-picture idea of the F-test as a comparison of nested models?
The F-test can compare two nested models by testing whether a subset of regression coefficients are all zero — i.e., whether adding predictors significantly improves model fit
What is the null hypothesis in the nested model version of the F-test?
H0:βj=0 for all j∈Sj⊂{1,…,p} — i.e., a subset of coefficients are all zero.
How are Model 0 and Model 1 defined in a nested model comparison?
Model 0 is the restricted model under H0 with predictors (x1,x2), the model under the null.
Model 1 is the full model under H1 (x1,x2,x3,x4) , the model under the alternative.
Model 0 is obtained from Model 1 by setting β3=β4=0, so Model 0 is nested in model 1.
What is the general null hypothesis when comparing nested models by F-test?
If Model 0 has p0 predictors and Model 1 has p1 predictors where p1 > p0.
β∗ are the predictors in β1 but not in β0.
H0:β∗=0 (Model 0)
H1:At least one of the coefficients is non-zero
What is the F-statistic formula using residual sums of squares (RSS)?
F=RSS1/(n−p1)(RSS0−RSS1)/(p1−p0)∼Fp1−p0,n−p1
F=(1−R12)/(n−p1)(R12−R02)/(p1−p0)
F=p1−p0n−p1⋅RSS1RSS0−RSS1
What are the two formulations of the F-test, and how do they differ in logic?
1. The Wald (t-test) form tests whether β^∗ are close to 0 relative to their variances
2. The nested model form tests whether Model 0 explains the data nearly as well as Model 1 (which includes β∗).
Wald form ≈ Wald test
Nested model form ≈ Likelihood ratio test
IN LINEAR REGRESSION, THE WALD AND LR TEST VERSIONS OF THE F-TEST ARE EQUIVALENT. The Wald test evaluates the significance of individual coefficients, while the nested model form assesses overall model fit by comparing explanatory power between the two models.
What happens when the F-test has only one constraint (q=1)?
It becomes a test of H0:βj=0 and the F-statistic reduces to F=t2 — the test is equivalent to the t-test
What is the null hypothesis in the F-test when β∗ includes all coefficients except the intercept?
That all explanatory variable coefficients are zero — i.e., H0:β1=⋯=βp−1=0
What is the F-statistic formula when testing if all explanatory variable coefficients are zero?
F=p−1n−p⋅RSSTSS−RSS=(1−R2)/(n−p)(R2)/(p−1)∼Fp−1,n−p
where n is the sample size, and p is the number of parameters. In this case, RSS0 = TSS and R²_0 = 0.
This test is usually reported in standard reg. output, but not really an interesting hypothesis.
What are residuals, and why are they useful in regression diagnostics?
Residuals are ei=yi−xi′β^.
They help check whether model assumptions are satisfied - To check model assumptions like normality, constant variance, correct specification, and influence of individual observations