1/105
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
What is the simple linear regression model with one predictor?
yᵢ = β₀ + β₁xᵢ + εᵢ.
What is the multiple linear regression model with p predictors?
yᵢ = β₀ + β₁xᵢ₁ + ... + βₚxᵢₚ + εᵢ.
In the linear regression model, which part is treated as random?
The error term εᵢ is random, so the response yᵢ is random.
In the regression model, how do we usually treat the predictors xᵢⱼ?
As fixed, known values once the dataset is given.
In the regression model, how do we treat the coefficients βⱼ?
As fixed but unknown constants that we estimate from the data.
What is the matrix form of the linear regression model?
y = Xβ + ε.
What are the dimensions of X, β, and y in linear regression?
X is n × p, β is p × 1, and y is n × 1.
What is the fitted value ŷᵢ for observation i?
ŷᵢ = β̂₀ + β̂₁xᵢ₁ + ... + β̂ₚxᵢₚ.
What is the residual for observation i?
eᵢ = yᵢ − ŷᵢ.
What is the Residual Sum of Squares (RSS)?
RSS(β) = Σᵢ (yᵢ − ŷᵢ)² = Σᵢ (yᵢ − β₀ − ... − βₚxᵢₚ)².
What optimization problem defines the least squares estimator?
β̂ = arg minᵦ RSS(β).
What is the normal equation solution for linear regression (when it exists)?
β̂ = (XᵀX)⁻¹Xᵀy.
Why might we not use the normal equation directly in practice for very large problems?
Computing and inverting XᵀX can be very expensive and numerically unstable with many predictors or strong collinearity.
How is training RSS defined?
RSS_train = Σᵢ∈train (yᵢ − ŷᵢ)².
How is test RSS defined?
RSS_test = Σⱼ∈test (yⱼ − ŷⱼ)².
How is training MSE defined?
MSE_train = RSS_train ÷ n_train.
How is test MSE defined?
MSE_test = RSS_test ÷ n_test.
As you add more predictors, what happens to training RSS?
Training RSS always decreases or stays the same, it never increases.
As the model becomes more complex, how does test error typically behave?
Test error usually decreases at first then increases, giving a U shaped curve.
What is overfitting?
When a model is too flexible, fits noise in the training data, has very low training error but poor prediction on new data.
What is underfitting?
When a model is too simple, cannot capture the true relationship, and has high error on both training and test data.
Does an extremely small training RSS guarantee small test RSS?
No, it may indicate overfitting and poor generalization.
What is the Total Sum of Squares (TSS)?
TSS = Σᵢ (yᵢ − ȳ)².
What is R² in regression?
R² = 1 − RSS ÷ TSS, the proportion of variability in the response explained by the model.
How do you interpret R² = 0.8?
About 80 percent of the variability in the response is explained by the model.
Why might adjusted R² be preferred over R² for model comparison?
Adjusted R² penalizes adding predictors, so it does not automatically increase when you add useless variables.
What is the general form of adjusted R²?
Adj R² = 1 − (RSS ÷ (n − p)) ÷ (TSS ÷ (n − 1)), where p is the number of parameters including the intercept.
What are the main assumptions of the standard linear regression model?
Linearity of mean response in predictors, independent errors, constant error variance (homoscedasticity), and normally distributed errors for inference.
What does the linearity assumption mean?
The conditional mean of Y given X is a linear function of the predictors: E[Y ∣ X] = β₀ + β₁x₁ + ... + βₚxₚ.
What does homoscedasticity mean?
The error variance is constant across all observations: Var(εᵢ) = σ² for all i.
How can heteroscedasticity appear in a residual vs fitted plot?
Residuals may fan out or funnel in as fitted values increase, instead of having roughly constant spread.
What pattern in a residual plot suggests nonlinearity in the mean relationship?
A clear curved pattern or systematic shape instead of random scatter around zero.
In a simple linear regression, how do you interpret the slope β₁?
It is the expected change in the response for a one unit increase in x, on average.
In a simple linear regression, how do you interpret the intercept β₀?
It is the expected value of Y when x = 0, on average.
In multiple regression, how do you interpret a slope coefficient βⱼ?
It is the expected change in the response for a one unit increase in xⱼ, holding all other predictors constant.
For a factor with two levels (A and B) coded with a dummy variable for B, how do you interpret the dummy coefficient?
Holding other predictors fixed, it is the average difference in response between level B and the baseline level A.
In a model y = β₀ + β₁x + β₂z + β₃xz + ε, what does the interaction coefficient β₃ represent?
It represents how the effect (slope) of x on y changes depending on z; if β₃ ≠ 0, the slope of x differs across values of z.
In the interaction model above, what is the slope of x when z = 0?
The slope is β₁.
In the interaction model above, what is the slope of x when z = 1?
The slope is β₁ + β₃.
If β₃ > 0 in an interaction between size and a neighborhood dummy, what does that mean?
The effect of size on price is stronger (slope is steeper) in that neighborhood than in the baseline neighborhood.
In R, what does the formula y ~ x1 + x2 + x3 mean?
A linear model for y with predictors x1, x2, x3 and an intercept.
In R, what does the formula y ~ x1 * x2 expand to?
y ~ x1 + x2 + x1:x2, including both main effects and their interaction.
In R, what does the formula y ~ x1:x2 (with a colon only) mean?
A model with only the interaction between x1 and x2, no main effects.
In R, what does I(x^2) do in a formula like y ~ x + I(x^2)?
It tells R to include the literal squared term x² as a predictor instead of interpreting ^ as a formula operator.
How would you include a quadratic effect of x in a linear regression model in R?
Use lm(y ~ x + I(x^2), data = ...).
What does the coefficient table from summary(lm(...)) give you?
Estimates of coefficients, their standard errors, t values, and p values.
What null hypothesis is tested by the p value for a coefficient in the regression summary?
H₀: βⱼ = 0 versus Hₐ: βⱼ ≠ 0.
What does a very small p value for a coefficient suggest?
Strong evidence that the corresponding predictor is associated with the response, after controlling for other predictors.
What does the F statistic in the regression summary test?
The null hypothesis that all non intercept coefficients are zero versus the alternative that at least one is nonzero.
What is the approximate formula for the residual standard error in linear regression?
σ̂ = √(RSS ÷ (n − p)), the estimated standard deviation of the errors.
What is the bias variance decomposition for prediction at a point x₀?
E[(Y − f̂(x₀))²] = σ² + Bias[f̂(x₀)]² + Var[f̂(x₀)].
In the bias variance tradeoff, how does increasing model flexibility affect bias?
Bias tends to decrease as model flexibility increases.
In the bias variance tradeoff, how does increasing model flexibility affect variance?
Variance tends to increase as model flexibility increases.
What combination of bias and variance do very simple models usually have?
High bias and low variance.
What combination of bias and variance do very complex models usually have?
Low bias and high variance.
What does the U shaped test error curve represent?
Test error is high for very simple models, decreases to a minimum at intermediate complexity, then increases again for very complex models.
What kind of tuning knobs typically control the bias variance tradeoff in models?
Tuning parameters such as polynomial degree, number of predictors, or penalty parameters like λ in Ridge and Lasso.
Does changing the optimization algorithm (for example from gradient descent to Newton) directly change the bias variance tradeoff of the model?
No, it changes how we compute the solution, not the statistical complexity of the model.
What is Mallows Cₚ conceptually in linear regression?
A model selection criterion combining fit and complexity: Cₚ = RSS + 2pσ̂_full² up to constants.
How is Mallows Cₚ used to choose a model?
Compute Cₚ for each candidate model and choose the model with the smallest Cₚ.
What is the general form of AIC?
AIC = −2 log L + 2p, where L is the likelihood and p is the number of parameters.
For linear regression with Gaussian errors, how does AIC relate to RSS?
AIC = n log(RSS ÷ n) + 2p + constant.
What is the general form of BIC?
BIC = −2 log L + (log n)p.
How does the BIC penalty compare to the AIC penalty as sample size n grows?
The BIC penalty (log n)p is usually larger than the AIC penalty 2p, so BIC tends to pick smaller models.
When comparing two models with the same RSS, which has smaller AIC or BIC, the simpler or the more complex model?
The simpler model, because it has fewer parameters and therefore a smaller penalty.
Which criteria tend to choose larger models and which tend to choose smaller models: AIC vs BIC?
AIC tends to choose larger models, BIC tends to choose smaller models.
What is best subset selection?
A model selection method that fits all possible subsets of predictors and chooses the best model according to a criterion such as AIC, BIC, or Cₚ.
What is a main disadvantage of best subset selection?
It is computationally expensive for many predictors, since it considers about 2ᵖ models.
What is forward stepwise selection?
Start from the null model, add predictors one by one, each time adding the predictor that most improves the chosen criterion, and stop when no addition improves it.
In forward stepwise selection, once a predictor is added, can it later be removed?
No, once added it remains in the model.
What is backward stepwise selection?
Start from the full model, remove predictors one by one, each time removing the predictor that most improves the criterion, and stop when removing any predictor makes the criterion worse.
Why can backward selection be problematic when the number of predictors is larger than the number of observations?
Because you cannot reliably fit the full model when p is greater than n.
What does the R function step() do by default when given a full model?
It performs stepwise model selection using AIC, usually in a backward direction by default.
How can you make step() do forward selection in R?
Start from a null model, specify a scope that includes all predictors, and use direction = "forward".
In best subset selection output, how do you choose a model size using BIC or Cₚ?
Look at the criterion value for each model size and choose the size where the criterion is minimized.
What is the gradient of a function f(θ)?
The vector of partial derivatives ∇f(θ) = (∂f/∂θ₁, ..., ∂f/∂θₚ)ᵀ.
What is the first order condition for a local minimum of a smooth function?
The gradient is zero at the minimum: ∇f(θ*) = 0.
What is the Hessian matrix of a function f?
The matrix of second derivatives: ∇²f, whose (i, j) entry is ∂²f ÷ (∂θᵢ ∂θⱼ).
In one dimension, what equation do we solve in calculus to find candidate minima of a function f?
Solve f′(θ) = 0.
What is the gradient descent update rule?
θₖ₊₁ = θₖ − η ∇f(θₖ), where η is the step size.
Why do we subtract the gradient in gradient descent?
Because the gradient points toward steepest increase, and we want to move in the opposite direction to decrease the function.
What happens if the step size η in gradient descent is too small?
The algorithm converges very slowly.
What happens if the step size η in gradient descent is too large?
The algorithm can overshoot the minimum and possibly diverge.
Is gradient descent a first order or second order optimization method?
First order, it uses only the gradient.
What is the Newton update in one dimension?
θₖ₊₁ = θₖ − f′(θₖ) ÷ f″(θₖ).
What is the Newton update in multiple dimensions?
θₖ₊₁ = θₖ − [∇²f(θₖ)]⁻¹ ∇f(θₖ).
Is Newton's method a first order or second order method?
Second order, it uses both gradient and Hessian.
Why might gradient descent be preferred over Newton's method for very large models?
Because Newton's method requires computing and inverting the Hessian, which is expensive in time and memory when there are many parameters, while gradient descent only needs the gradient.
In R's optim function, what does the par argument represent?
The initial guess for the parameter vector.
In R's optim function, what does the fn argument represent?
The function to be minimized, for example the loss such as RSS.
In R's optim function, what does the optional gr argument represent?
A function that returns the gradient of the objective function with respect to the parameters.
In R's optim output, what does out$par represent?
The parameter values at the minimum, that is the estimate θ̂ (in regression, the β̂ vector).
In R's optim output, what does out$convergence == 0 indicate?
That the algorithm claims to have successfully converged.
What type of method is BFGS in optim?
A quasi Newton method that approximates the Hessian instead of computing it exactly.
What are the three main ways to compute derivatives mentioned in this unit?
Manual or analytic derivatives, numerical differentiation (finite differences), and automatic differentiation (autograd).
What is numerical differentiation?
Approximating derivatives using finite differences, for example f′(θ) ≈ [f(θ + h) − f(θ)] ÷ h for small h.
What is automatic differentiation?
A technique where a library tracks the operations used to compute a function and applies the chain rule automatically to compute exact gradients up to machine precision.
Why is automatic differentiation preferred over numerical differentiation for complex models?
It is usually much faster and more accurate, especially when there are many parameters.
In PyTorch with autograd, what does setting requires_grad = True on a tensor do?
It tells PyTorch to track operations on that tensor so that gradients can be computed with respect to it.
In PyTorch with autograd, what does calling loss.backward() do?
It computes the gradient of loss with respect to all parameters that have requires_grad = True, storing results in their .grad fields.