Linear Regression Models: Key Concepts, Assumptions, and Interpretation

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/105

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

106 Terms

1
New cards

What is the simple linear regression model with one predictor?

yᵢ = β₀ + β₁xᵢ + εᵢ.

2
New cards

What is the multiple linear regression model with p predictors?

yᵢ = β₀ + β₁xᵢ₁ + ... + βₚxᵢₚ + εᵢ.

3
New cards

In the linear regression model, which part is treated as random?

The error term εᵢ is random, so the response yᵢ is random.

4
New cards

In the regression model, how do we usually treat the predictors xᵢⱼ?

As fixed, known values once the dataset is given.

5
New cards

In the regression model, how do we treat the coefficients βⱼ?

As fixed but unknown constants that we estimate from the data.

6
New cards

What is the matrix form of the linear regression model?

y = Xβ + ε.

7
New cards

What are the dimensions of X, β, and y in linear regression?

X is n × p, β is p × 1, and y is n × 1.

8
New cards

What is the fitted value ŷᵢ for observation i?

ŷᵢ = β̂₀ + β̂₁xᵢ₁ + ... + β̂ₚxᵢₚ.

9
New cards

What is the residual for observation i?

eᵢ = yᵢ − ŷᵢ.

10
New cards

What is the Residual Sum of Squares (RSS)?

RSS(β) = Σᵢ (yᵢ − ŷᵢ)² = Σᵢ (yᵢ − β₀ − ... − βₚxᵢₚ)².

11
New cards

What optimization problem defines the least squares estimator?

β̂ = arg minᵦ RSS(β).

12
New cards

What is the normal equation solution for linear regression (when it exists)?

β̂ = (XᵀX)⁻¹Xᵀy.

13
New cards

Why might we not use the normal equation directly in practice for very large problems?

Computing and inverting XᵀX can be very expensive and numerically unstable with many predictors or strong collinearity.

14
New cards

How is training RSS defined?

RSS_train = Σᵢ∈train (yᵢ − ŷᵢ)².

15
New cards

How is test RSS defined?

RSS_test = Σⱼ∈test (yⱼ − ŷⱼ)².

16
New cards

How is training MSE defined?

MSE_train = RSS_train ÷ n_train.

17
New cards

How is test MSE defined?

MSE_test = RSS_test ÷ n_test.

18
New cards

As you add more predictors, what happens to training RSS?

Training RSS always decreases or stays the same, it never increases.

19
New cards

As the model becomes more complex, how does test error typically behave?

Test error usually decreases at first then increases, giving a U shaped curve.

20
New cards

What is overfitting?

When a model is too flexible, fits noise in the training data, has very low training error but poor prediction on new data.

21
New cards

What is underfitting?

When a model is too simple, cannot capture the true relationship, and has high error on both training and test data.

22
New cards

Does an extremely small training RSS guarantee small test RSS?

No, it may indicate overfitting and poor generalization.

23
New cards

What is the Total Sum of Squares (TSS)?

TSS = Σᵢ (yᵢ − ȳ)².

24
New cards

What is R² in regression?

R² = 1 − RSS ÷ TSS, the proportion of variability in the response explained by the model.

25
New cards

How do you interpret R² = 0.8?

About 80 percent of the variability in the response is explained by the model.

26
New cards

Why might adjusted R² be preferred over R² for model comparison?

Adjusted R² penalizes adding predictors, so it does not automatically increase when you add useless variables.

27
New cards

What is the general form of adjusted R²?

Adj R² = 1 − (RSS ÷ (n − p)) ÷ (TSS ÷ (n − 1)), where p is the number of parameters including the intercept.

28
New cards

What are the main assumptions of the standard linear regression model?

Linearity of mean response in predictors, independent errors, constant error variance (homoscedasticity), and normally distributed errors for inference.

29
New cards

What does the linearity assumption mean?

The conditional mean of Y given X is a linear function of the predictors: E[Y ∣ X] = β₀ + β₁x₁ + ... + βₚxₚ.

30
New cards

What does homoscedasticity mean?

The error variance is constant across all observations: Var(εᵢ) = σ² for all i.

31
New cards

How can heteroscedasticity appear in a residual vs fitted plot?

Residuals may fan out or funnel in as fitted values increase, instead of having roughly constant spread.

32
New cards

What pattern in a residual plot suggests nonlinearity in the mean relationship?

A clear curved pattern or systematic shape instead of random scatter around zero.

33
New cards

In a simple linear regression, how do you interpret the slope β₁?

It is the expected change in the response for a one unit increase in x, on average.

34
New cards

In a simple linear regression, how do you interpret the intercept β₀?

It is the expected value of Y when x = 0, on average.

35
New cards

In multiple regression, how do you interpret a slope coefficient βⱼ?

It is the expected change in the response for a one unit increase in xⱼ, holding all other predictors constant.

36
New cards

For a factor with two levels (A and B) coded with a dummy variable for B, how do you interpret the dummy coefficient?

Holding other predictors fixed, it is the average difference in response between level B and the baseline level A.

37
New cards

In a model y = β₀ + β₁x + β₂z + β₃xz + ε, what does the interaction coefficient β₃ represent?

It represents how the effect (slope) of x on y changes depending on z; if β₃ ≠ 0, the slope of x differs across values of z.

38
New cards

In the interaction model above, what is the slope of x when z = 0?

The slope is β₁.

39
New cards

In the interaction model above, what is the slope of x when z = 1?

The slope is β₁ + β₃.

40
New cards

If β₃ > 0 in an interaction between size and a neighborhood dummy, what does that mean?

The effect of size on price is stronger (slope is steeper) in that neighborhood than in the baseline neighborhood.

41
New cards

In R, what does the formula y ~ x1 + x2 + x3 mean?

A linear model for y with predictors x1, x2, x3 and an intercept.

42
New cards

In R, what does the formula y ~ x1 * x2 expand to?

y ~ x1 + x2 + x1:x2, including both main effects and their interaction.

43
New cards

In R, what does the formula y ~ x1:x2 (with a colon only) mean?

A model with only the interaction between x1 and x2, no main effects.

44
New cards

In R, what does I(x^2) do in a formula like y ~ x + I(x^2)?

It tells R to include the literal squared term x² as a predictor instead of interpreting ^ as a formula operator.

45
New cards

How would you include a quadratic effect of x in a linear regression model in R?

Use lm(y ~ x + I(x^2), data = ...).

46
New cards

What does the coefficient table from summary(lm(...)) give you?

Estimates of coefficients, their standard errors, t values, and p values.

47
New cards

What null hypothesis is tested by the p value for a coefficient in the regression summary?

H₀: βⱼ = 0 versus Hₐ: βⱼ ≠ 0.

48
New cards

What does a very small p value for a coefficient suggest?

Strong evidence that the corresponding predictor is associated with the response, after controlling for other predictors.

49
New cards

What does the F statistic in the regression summary test?

The null hypothesis that all non intercept coefficients are zero versus the alternative that at least one is nonzero.

50
New cards

What is the approximate formula for the residual standard error in linear regression?

σ̂ = √(RSS ÷ (n − p)), the estimated standard deviation of the errors.

51
New cards

What is the bias variance decomposition for prediction at a point x₀?

E[(Y − f̂(x₀))²] = σ² + Bias[f̂(x₀)]² + Var[f̂(x₀)].

52
New cards

In the bias variance tradeoff, how does increasing model flexibility affect bias?

Bias tends to decrease as model flexibility increases.

53
New cards

In the bias variance tradeoff, how does increasing model flexibility affect variance?

Variance tends to increase as model flexibility increases.

54
New cards

What combination of bias and variance do very simple models usually have?

High bias and low variance.

55
New cards

What combination of bias and variance do very complex models usually have?

Low bias and high variance.

56
New cards

What does the U shaped test error curve represent?

Test error is high for very simple models, decreases to a minimum at intermediate complexity, then increases again for very complex models.

57
New cards

What kind of tuning knobs typically control the bias variance tradeoff in models?

Tuning parameters such as polynomial degree, number of predictors, or penalty parameters like λ in Ridge and Lasso.

58
New cards

Does changing the optimization algorithm (for example from gradient descent to Newton) directly change the bias variance tradeoff of the model?

No, it changes how we compute the solution, not the statistical complexity of the model.

59
New cards

What is Mallows Cₚ conceptually in linear regression?

A model selection criterion combining fit and complexity: Cₚ = RSS + 2pσ̂_full² up to constants.

60
New cards

How is Mallows Cₚ used to choose a model?

Compute Cₚ for each candidate model and choose the model with the smallest Cₚ.

61
New cards

What is the general form of AIC?

AIC = −2 log L + 2p, where L is the likelihood and p is the number of parameters.

62
New cards

For linear regression with Gaussian errors, how does AIC relate to RSS?

AIC = n log(RSS ÷ n) + 2p + constant.

63
New cards

What is the general form of BIC?

BIC = −2 log L + (log n)p.

64
New cards

How does the BIC penalty compare to the AIC penalty as sample size n grows?

The BIC penalty (log n)p is usually larger than the AIC penalty 2p, so BIC tends to pick smaller models.

65
New cards

When comparing two models with the same RSS, which has smaller AIC or BIC, the simpler or the more complex model?

The simpler model, because it has fewer parameters and therefore a smaller penalty.

66
New cards

Which criteria tend to choose larger models and which tend to choose smaller models: AIC vs BIC?

AIC tends to choose larger models, BIC tends to choose smaller models.

67
New cards

What is best subset selection?

A model selection method that fits all possible subsets of predictors and chooses the best model according to a criterion such as AIC, BIC, or Cₚ.

68
New cards

What is a main disadvantage of best subset selection?

It is computationally expensive for many predictors, since it considers about 2ᵖ models.

69
New cards

What is forward stepwise selection?

Start from the null model, add predictors one by one, each time adding the predictor that most improves the chosen criterion, and stop when no addition improves it.

70
New cards

In forward stepwise selection, once a predictor is added, can it later be removed?

No, once added it remains in the model.

71
New cards

What is backward stepwise selection?

Start from the full model, remove predictors one by one, each time removing the predictor that most improves the criterion, and stop when removing any predictor makes the criterion worse.

72
New cards

Why can backward selection be problematic when the number of predictors is larger than the number of observations?

Because you cannot reliably fit the full model when p is greater than n.

73
New cards

What does the R function step() do by default when given a full model?

It performs stepwise model selection using AIC, usually in a backward direction by default.

74
New cards

How can you make step() do forward selection in R?

Start from a null model, specify a scope that includes all predictors, and use direction = "forward".

75
New cards

In best subset selection output, how do you choose a model size using BIC or Cₚ?

Look at the criterion value for each model size and choose the size where the criterion is minimized.

76
New cards

What is the gradient of a function f(θ)?

The vector of partial derivatives ∇f(θ) = (∂f/∂θ₁, ..., ∂f/∂θₚ)ᵀ.

77
New cards

What is the first order condition for a local minimum of a smooth function?

The gradient is zero at the minimum: ∇f(θ*) = 0.

78
New cards

What is the Hessian matrix of a function f?

The matrix of second derivatives: ∇²f, whose (i, j) entry is ∂²f ÷ (∂θᵢ ∂θⱼ).

79
New cards

In one dimension, what equation do we solve in calculus to find candidate minima of a function f?

Solve f′(θ) = 0.

80
New cards

What is the gradient descent update rule?

θₖ₊₁ = θₖ − η ∇f(θₖ), where η is the step size.

81
New cards

Why do we subtract the gradient in gradient descent?

Because the gradient points toward steepest increase, and we want to move in the opposite direction to decrease the function.

82
New cards

What happens if the step size η in gradient descent is too small?

The algorithm converges very slowly.

83
New cards

What happens if the step size η in gradient descent is too large?

The algorithm can overshoot the minimum and possibly diverge.

84
New cards

Is gradient descent a first order or second order optimization method?

First order, it uses only the gradient.

85
New cards

What is the Newton update in one dimension?

θₖ₊₁ = θₖ − f′(θₖ) ÷ f″(θₖ).

86
New cards

What is the Newton update in multiple dimensions?

θₖ₊₁ = θₖ − [∇²f(θₖ)]⁻¹ ∇f(θₖ).

87
New cards

Is Newton's method a first order or second order method?

Second order, it uses both gradient and Hessian.

88
New cards

Why might gradient descent be preferred over Newton's method for very large models?

Because Newton's method requires computing and inverting the Hessian, which is expensive in time and memory when there are many parameters, while gradient descent only needs the gradient.

89
New cards

In R's optim function, what does the par argument represent?

The initial guess for the parameter vector.

90
New cards

In R's optim function, what does the fn argument represent?

The function to be minimized, for example the loss such as RSS.

91
New cards

In R's optim function, what does the optional gr argument represent?

A function that returns the gradient of the objective function with respect to the parameters.

92
New cards

In R's optim output, what does out$par represent?

The parameter values at the minimum, that is the estimate θ̂ (in regression, the β̂ vector).

93
New cards

In R's optim output, what does out$convergence == 0 indicate?

That the algorithm claims to have successfully converged.

94
New cards

What type of method is BFGS in optim?

A quasi Newton method that approximates the Hessian instead of computing it exactly.

95
New cards

What are the three main ways to compute derivatives mentioned in this unit?

Manual or analytic derivatives, numerical differentiation (finite differences), and automatic differentiation (autograd).

96
New cards

What is numerical differentiation?

Approximating derivatives using finite differences, for example f′(θ) ≈ [f(θ + h) − f(θ)] ÷ h for small h.

97
New cards

What is automatic differentiation?

A technique where a library tracks the operations used to compute a function and applies the chain rule automatically to compute exact gradients up to machine precision.

98
New cards

Why is automatic differentiation preferred over numerical differentiation for complex models?

It is usually much faster and more accurate, especially when there are many parameters.

99
New cards

In PyTorch with autograd, what does setting requires_grad = True on a tensor do?

It tells PyTorch to track operations on that tensor so that gradients can be computed with respect to it.

100
New cards

In PyTorch with autograd, what does calling loss.backward() do?

It computes the gradient of loss with respect to all parameters that have requires_grad = True, storing results in their .grad fields.