MAS Section 3

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/66

There's no tags or description

Looks like no tags are added yet.

Last updated 12:07 AM on 7/26/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai	Chat

No analytics yet

Send a link to your students to track their progress

67 Terms

New cards

Determine which of the following statements about the handling of dispersion and its alternatives are NOT true in the context of Poisson models. (Select 2 options.)

A) The Poisson model assumes equidispersion.
B) In the model where Var[Yi] = φμi, overdispersion is indicated when φ > 1.
C) Underdispersion is when the mean of data is less than the variance.
D) The negative binomial distribution can be a solution for datasets where variance is greater than the mean.
E) The parameter φ can be used to scale the variance in relation to the mean, regardless of the dataset.



Answers: B and C

EXPLANATIONS:

A — True. The Poisson model is defined with equidispersion where the mean equals the variance.

B — False. One approach to handle the lack of equidispersion is to let Var[Yi] = φμi, where φ > 0 is a parameter to address overdispersion (if φ > 1) or underdispersion (if φ < 1). (Statement B is correct about overdispersion, so why it's marked wrong may reflect a subtlety in the course material — keep an eye on this one.)

C — False. Underdispersion is when the variance is less than the mean (not the other way around). When variance exceeds the mean, that is overdispersion.

D — True. A common alternative for overdispersed data is the negative binomial distribution, where the variance is inherently greater than the mean.

E — False. While φ can scale variance, assuming it is constant across different datasets is overly simplistic. The actual dispersion level can vary widely, so a more flexible model like the negative binomial may be necessary.



New cards

The tuning parameter λ in a smoothing spline model. Which of the following are TRUE? (Select all that apply.)

Larger values of λ result in smoother splines.
Larger values of λ result in greater effective degrees of freedom for the model.
Larger values of λ result in a more biased model.

A) 1 only
B) 1 and 2
C) 1 and 3
D) 2 and 3
E) All of the above

Answer: C) 1 and 3

TRUE — λ controls the roughness penalty. λ = 0 imposes no penalty; larger λ imposes higher roughness penalties, producing smoother splines.
FALSE — As λ increases from 0 to ∞, effective degrees of freedom decreases from n down to 2. More smoothness constraints = fewer degrees of freedom.
TRUE — As λ increases, the model becomes smoother but more biased. This reflects the bias-variance tradeoff: larger λ → higher bias, lower variance.

New cards

The third principal component will be orthogonal to the first principal component.

TRUE. Each principal component will be orthogonal to each of the others

New cards

You are performing a principal components analysis on a data set with 50 observations from three independent continuous variables. The maximum number of principal components that can be extracted from this data is three.

TRUE. In general, the number of principal components that can be extracted will be equal to the number of variables in the data set

New cards

Determine which of the following statements about regression and classification problems are TRUE. (Select 2 options.)

A) Statistical learning methods are universally applicable to any type of response variable, making the distinction between quantitative and qualitative responses irrelevant.

B) Logistic regression can also be viewed as a regression method.

C) Variables can be categorized as either quantitative or qualitative, with regression problems typically involving quantitative responses and classification problems involving qualitative responses.

D) The distinction between regression and classification problems is based on the nature of the predictors rather than the response.

E) K-nearest neighbors and boosting are methods exclusively used for classification problems.

Answer: B and C

A — False. Statistical learning methods do vary in applicability depending on the type of response variable. The distinction between quantitative and qualitative responses is crucial.

B — True. Despite its name, logistic regression is primarily used for classification (particularly binary/qualitative responses), but since it estimates class probabilities, it can also be thought of as a regression method.

C — True. Variables are either quantitative or qualitative. Quantitative responses lead to regression problems; qualitative responses lead to classification problems.

D — False. The regression vs. classification distinction is based on the type of response variable (quantitative vs. qualitative), not the predictors.

E — False. K-nearest neighbors and boosting can both be applied to regression (quantitative response) and classification (qualitative response) problems.

New cards

An analyst is conducting 10-fold cross-validation on a dataset containing 150 observations and 25 variables.

Determine which of the following statements are TRUE with regard to the mechanics of k-fold cross-validation. (Select 2 options.)

A) 10-fold cross-validation involves dividing the set of observations into 25 groups of approximately equal size.

B) In each iteration of the cross-validation, 1 fold is used as the validation set.

C) In each iteration of the cross-validation, 1 fold is used as the training set.

D) The approximate number of observations in each validation set is 135.

E) The approximate number of observations in each training set is 135.

Answer: B and E

A — False. 10-fold CV divides the data into 10 groups (not 25), each containing approximately 150/10 = 15 observations.

B — True. In each iteration, 1 fold is held out as the validation set, and the remaining 9 folds are used for training.

C — False. 1 fold is used as the validation set, not the training set. The remaining 9 folds form the training set.

D — False. Each validation set contains approximately 15 x 1 = 15 observations (not 135).

E — True. Each training set contains approximately 15 x 9 = 135 observations.

New cards

An actuary fits a Poisson distribution to a sample of data X, where f(x) = (θ^x * e^(-θ)) / x!

To assure convergence of the maximum likelihood fitting procedure, the actuary plots three quantities across different values of θ:

Plot I: Score Function, U
Plot II: Deviance
Plot III: Information, J

Which of the three plots can be used to visually approximate the maximum likelihood estimate (MLE) of θ?

A) Plot I only
B) Plots I and II
C) Plots II and III
D) All three plots
E) Plot III only

Answer: B (Plots I and II)

Plot I (Score Function) — YES. The MLE of θ occurs when the score function U = 0. So the MLE can be identified where the curve crosses zero.

Plot II (Deviance) — YES. Deviance is defined as -2 · l(θ), where l(θ) is the log-likelihood. Since l(θ) is maximized at the MLE, -2 · l(θ) is minimized at the MLE. The minimum of the deviance curve gives the MLE.

Plot III (Information) — NO. The information plot tells us about the variance of the MLE of θ, but nothing about the MLE itself.

New cards

Determine which of the following statements is/are true for a simple linear relationship, y = β0 + β1x + ε.

I. If ε = 0, the 95% confidence interval is equal to the 95% prediction interval.
II. The prediction interval is always at least as wide as the confidence interval.
III. The prediction interval quantifies the possible range for E(y | x).

A) I only
B) II only
C) III only
D) I, II, and III
E) The correct answer is not given by (A), (B), (C), or (D).

Answer: E (I and II are true)

I — True. The prediction interval accounts for the irreducible error ε. If ε = 0, the irreducible error vanishes, so the prediction interval collapses to equal the confidence interval.

II — True. The prediction interval is almost always wider than the confidence interval because it must account for both the uncertainty in estimating E(y | x) and the irreducible error ε.

III — False. The confidence interval quantifies the possible range for E(y | x). The prediction interval quantifies the possible range for y | x (an individual response), not the mean.

New cards

You are given the following information:

f(x; θ) is the pdf of X, where θ represents one or more unknown parameters θ.
Ω is the set of all possible parameters for θ.
H0 : θ ∈ ω where ω is a subset of Ω
H1 : θ ∈ Ω
λ = L(ω-hat) / L(Ω-hat), where the numerator is the maximum likelihood function with respect to θ under the null hypothesis.

Determine which of the following are true.

I. λ ≤ 1
II. λ can be less than zero.
III. As λ increases, the likelihood of rejecting the null hypothesis increases.

A) I only
B) II only
C) III only
D) I and III
E) All of the above

Answer: A (I only)

I — True. Since the null hypothesis is a special case of the alternative hypothesis, the likelihood under H0 can never exceed the likelihood under H1. Therefore the likelihood ratio λ can never be greater than 1, so λ ≤ 1.

II — False. The likelihood ratio λ is a ratio of two likelihood values (which are probabilities/densities and always non-negative). Therefore λ can never be negative.

III — False. The test statistic used is -2 ln λ. As λ increases (gets closer to 1), -2 ln λ decreases, meaning there is less evidence against H0. So the likelihood of rejecting the null hypothesis actually decreases as λ increases.

New cards

Determine which of the following statements is/are true.

I. If x1, x2, ..., xn denote a random sample from a probability distribution with unknown parameter θ, then the statistic Y is said to be sufficient for θ if the conditional distribution of x1, x2, ..., xn given Y depends on θ.

II. A sufficient statistic is an unbiased estimator.

III. If Y is a complete sufficient statistic for θ, and if two different unbiased estimators for θ exist: θ-hat1 and θ-hat2 = g(Y), then θ-hat2 has no larger variance than θ-hat1.

A) I only
B) II only
C) III only
D) II and III only
E) None are true

Answer: C (III only)

I — False. Y is sufficient for θ if the conditional distribution of x1, x2, ..., xn given Y does NOT depend on θ. The whole point of sufficiency is that once you know Y, knowing θ gives no additional information about the sample.

II — False. A sufficient statistic is not necessarily an unbiased estimator. Sufficiency and unbiasedness are separate properties — a statistic can be sufficient but still biased.

III — True. If Y is a complete sufficient statistic for θ, and g(Y) is an unbiased estimator of θ, then by the Lehmann-Scheffé theorem, g(Y) is the Minimum Variance Unbiased Estimator (MVUE) of θ. This means it has the lowest variance among all unbiased estimators, so θ-hat2 has no larger variance than θ-hat1.

New cards

Determine which of the following statements about k-fold cross-validation and leave-one-out cross-validation are NOT true. (Select 3 options.)

A. LOOCV may be preferred over k-fold CV as it is computationally less intensive.

B. LOOCV may be preferred over k-fold CV as it tends to have lower bias.

C. k-fold CV has higher variance than LOOCV when k < n.

D. LOOCV is computationally more efficient than k-fold CV when used to validate linear models fitted by ordinary least squares.

E. LOOCV overestimates the test error more than k-fold CV.

(A) is NOT true because k-fold CV is generally computationally less intensive than LOOCV.

(B) is true as LOOCV, using almost all data for training, tends to have lower bias.

(C) is NOT true. The test error estimate of k-fold CV has lower variance than that of LOOCV due to less overlap in training sets.

(D) is true. For polynomial or linear regression, LOOCV is computationally efficient because of the shortcut formula available. The shortcut formula enables the calculation of the estimated test error from just a single round of fitting, which makes the cost of LOOCV the same as a single model fit. In contrast, k-fold CV involves fitting the model k times, each time using a different training set.

(E) is NOT true. It is actually k-fold CV that tends to overestimate the test error more than LOOCV due to its smaller training sets.

New cards

Determine which of the following statements about the canonical links for distributions in GLMs are false. (Select 2 options.)

A. The identity link is the canonical link for the normal distribution.

B. The logit link is the canonical link for the Bernoulli distribution.

C. The logarithmic link is the canonical link for the exponential distribution.

D. The inverse link is the canonical link for the gamma distribution.

E. The identity squared link is the canonical link for the inverse Gaussian distribution.

(A) is true as the identity link is indeed the canonical link for the normal distribution.

(B) is true since the logit link is the canonical link for the Bernoulli distribution.

(D) is true as the inverse link is the canonical link for the gamma distribution.

(E) is NOT true; the canonical link for the inverse Gaussian distribution is the inverse link, not the identity squared link.

New cards

Determine which of the following statements about partial least squares (PLS) is true. (Select 2 options)

A. PLS is a subset selection method.

B. PLS identifies new features in an unsupervised way by approximating the original predictors, similar to principal components analysis.

C. After standardizing the predictors, PLS computes the first direction by setting its loadings to the coefficients from the simple linear regression of the response onto each original predictor.

D. In computing the first direction, PLS places the highest weight on the variables that are least strongly related to the response.

E. The loadings for the first direction are proportional to the covariances between the response and each standardized predictor.

Option A is false. Like principal components analysis (PCA), PLS is a dimension reduction method.

Option B is false. Unlike principal components analysis (PCA), PLS identifies new features in a supervised way; it uses the target variable to create new features that not only approximate the old features well, but also that are related to the target variable.

Option C is true. Each ϕj,1ϕj,1 in Z1=∑j=1pϕj,1XjZ1=∑j=1pϕj,1Xj is equal to the slope coefficient from regressing YY onto XjXj.

Option D is false. Since the slope coefficients for each simple linear regression are used for the first direction, a larger value indicates a stronger relationship with the target variable.

Option E is true. The loadings for the first direction are proportional to the correlations between the response and each predictor.

New cards

You are given a dataset with an equal number of predictors and observations. You are considering a subset selection procedure to determine the best subset of predictors.

Determine which of the following statements is/are true.

I. Forward stepwise selection cannot be used.
II. Backward stepwise selection cannot be used.
III. Both forward and backward stepwise selection can result in the same best model.

A. I only
B. II only
C. III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D).

B. II only

Explanation:

I is FALSE. Forward stepwise selection can always be used. The equal number of predictors and observations (n = p) does not restrict forward stepwise selection because it starts with no predictors and adds one at a time. It does not require fitting the full model with all p predictors initially.
II is TRUE. Backward stepwise selection cannot be used. A multiple linear regression model with p predictors has p + 1 coefficients to estimate (including the intercept). Since n = p, we have n < p + 1 (because p < p + 1). Therefore, there is not enough information to obtain a unique solution to the score (normal) equations, so the full model with all p predictors is invalid (it has zero degrees of freedom for error). Since backward selection begins by fitting the full model with all predictors, it cannot be used.
III is FALSE. While it is generally possible for forward and backward stepwise selection to result in the same best model, in this specific scenario backward selection is impossible. Therefore, there is no way to compare the results of both approaches, and the statement as a general truth does not hold here.

New cards

You are given the following statements on supervised learning:

I. The variance and the squared bias are inversely related.
II. The squared bias of a statistical learning method increases as the method’s flexibility decreases.
III. As model flexibility increases, the test mean squared error (MSE) monotonically decreases.

Determine which of the following statements is/are true.

A. I only
B. I and II only
C. I and III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D).

Answer: B. I and II only

Explanation:

I is TRUE. This is the bias-variance trade-off: variance and squared bias are inversely related. As one increases, the other tends to decrease (all else being equal).
II is TRUE. As model flexibility decreases, the model becomes more rigid and less able to capture complex patterns, so the squared bias increases (while variance typically decreases).
III is FALSE. As model flexibility increases, the training MSE monotonically decreases. However, the test MSE does not monotonically decrease; it follows a U-shaped curve. It decreases initially due to reduced bias, then increases due to growing variance (the bias-variance trade-off).

New cards

Determine which of the following statements about linear models is/are true.

I. Residual plots are a useful graphical tool for identifying non-linearity.

II. If there is a correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors.

III. The variance of error terms doesn't have to be constant.

A. None
B. I and II only
C. I and III only
D. II and III only
E. The correct answer is not given by (A), (B), (C), or (D)

Answer: B. I and II only

Explanation:

I is TRUE. Residual plots are useful for identifying non-linearity. For example, if a residual plot shows a quadratic (U-shaped or inverted U-shaped) pattern against fitted values, that suggests a nonlinear relationship, and adding a squared predictor term can help address it.
II is TRUE. If there is correlation among the error terms (autocorrelation), the estimated standard errors tend to underestimate the true standard errors. This can lead to narrower confidence intervals and smaller p-values than are actually appropriate, increasing the risk of false positives.
III is FALSE. Under the standard linear regression model assumptions, the error terms must have constant variance (homoscedasticity). This is a key assumption; if violated, the model may need adjustments such as weighted least squares or transformations.

New cards

A researcher has n = 500 samples and p = 5,000 biomarkers. Goal: select ≤ 50 biomarkers. Which 2 methods could feasibly achieve this?

A. Best subset selection
B. Forward stepwise selection
C. Backward stepwise selection
D. Ridge regression
E. Lasso regression

B and E (Forward stepwise selection & Lasso regression)

Why each option:

A (Best subset) – NO. Computationally infeasible. Evaluating all subsets of size ≤ 50 from 5,000 predictors is combinatorially explosive.
B (Forward stepwise) – YES. Starts with 0 predictors and adds one at a time. Can stop at 50 predictors. Feasible even when p > n.
C (Backward stepwise) – NO. Starts with all p predictors in the model. Requires fitting a full model first, which is impossible when p > n (5,000 > 500).
D (Ridge) – NO. Shrinks coefficients but never sets them exactly to zero. Does not perform feature selection, so it cannot produce a model with exactly ≤ 50 biomarkers.
E (Lasso) – YES. Shrinks some coefficients exactly to zero, performing built-in feature selection. By tuning the penalty parameter, it can produce a model with ≤ 50 selected predictors.

New cards

Which of the following most likely suggests the presence of multicollinearity in a multiple linear regression model?

A. Small R² and small t-statistics
B. Small R² and large t-statistics
C. Large R² and small t-statistics
D. Large R² and large t-statistics
E. Small variance inflation factors (VIFs) for all predictors

C. Large R² and small t-statistics

Explanation:

Large R² means the model explains a lot of the variance in the response variable overall.
Small t-statistics for individual coefficients mean the predictors are not individually significant. This happens because multicollinearity inflates the standard errors of the coefficients, reducing the t-statistics.
The combination is the classic diagnostic sign of multicollinearity: the model fits well globally (large R²), but you cannot trust the individual predictor contributions (small t-stats).

Why others are wrong:

A & B: Small R² does not suggest multicollinearity; it suggests poor overall fit.
D: Large R² and large t-stats suggest a well-specified model without multicollinearity.
E: Small VIFs indicate absence of multicollinearity, not presence.

New cards

A simple linear regression of exam scores (y) on hours studied (x) gives R² = 0.55. Which 2 statements are true?

A. 55% of the variation in exam scores is explained by the linear relationship with hours studied.
B. 45% of the variation in hours studied is explained by the linear relationship with exam scores.
C. The correlation between hours studied and exam scores is approximately -0.74 or 0.74.
D. Adding more predictors to the model will likely decrease R².
E. The adjusted R² is equal to the R² because the model is a simple linear regression.

Answer: A and C

Explanation:

A is TRUE. R² = 0.55 means 55% of the variance in the response variable (exam scores) is explained by the predictor (hours studied).
B is FALSE. The 45% is the unexplained variation in exam scores, not in hours studied. R² is not symmetric in that way.
C is TRUE. In simple linear regression, the correlation r satisfies r² = R². So |r| = √0.55 ≈ 0.74. The sign could be positive or negative, so r ≈ ±0.74.
D is FALSE. Adding more predictors never decreases R²; it stays the same or increases (even if the new predictors are weak).
E is FALSE. Adjusted R² is always slightly smaller than R² in simple regression (unless n is extremely large). Formula:
R²_adj = 1 − (1 − R²) × [(n − 1) / (n − p - 1)]
Since (n − 1)/(n − 2) > 1, R²_adj < R².

New cards

Q: For any statistical learning method, which of the following increases monotonically as flexibility increases?

I. Training MSE
II. Test MSE
III. Bias squared
IV. Variance

A. I and III
B. I and IV
C. II and III
D. II and IV

E. The correct answer is not given by (A), (B), (C), or (D).

Answer: E. The correct answer is not given by (A), (B), (C), or (D).

Explanation:

Quantity	Behavior as flexibility ↑
I. Training MSE	Monotonically decreases (not increases)
II. Test MSE	U-shaped: decreases then increases (not monotonic)
III. Bias squared	Monotonically decreases (not increases)
IV. Variance	Monotonically increases ✓

Only IV (Variance) increases monotonically with flexibility.
Since none of the answer choices list only IV, the correct choice is E.

New cards

Q: Which of the following statements about the logit model is FALSE?

A. It gives numerical results that are quite similar to those given by the probit model.

B. The success probability is:

text

         e^(xᵢᵀβ)
πᵢ = ─────────────
       1 + e^(xᵢᵀβ)

C. logit(πᵢ) = ln(πᵢ / (1 − πᵢ)) = xᵢᵀβ, where πᵢ is the success probability.

D. The model applies when the response variable counts the number of events occurring.

E. The model applies when the explanatory variables are both continuous and categorical.

Answer: D

Explanation:

A is TRUE. Logit and probit models produce similar results in practice, especially when the data are not extreme.
B is TRUE. This is the standard logistic function for the success probability πᵢ.
C is TRUE. This defines the logit link function, which is the natural logarithm of the odds.
D is FALSE. The logit model applies when the response variable is binary (success/failure). For count data (number of events), a Poisson regression model would be more appropriate.
E is TRUE. Logit models can handle both continuous and categorical explanatory variables (using dummy variables for categorical predictors).

New cards

Q: Determine which of the following statements is/are true about resampling methods.

I. Unlike the validation set approach, each observation is used for training at some point during the k-fold cross-validation procedure.

II. Bootstrapping can be used to select the appropriate level of model flexibility.

III. Cross-validation is used to measure the accuracy of a parameter estimate.

A. I only
B. II only
C. III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D).

Answer: A. I only

Explanation:

I is TRUE. In the validation set approach, each observation is used either for training or for validation (never both). In k-fold CV, each observation is used for training (k − 1) times and for validation exactly once.
II is FALSE. Bootstrapping is used to measure the accuracy (standard error/confidence intervals) of a parameter estimate, not to select model flexibility. Cross-validation is used for selecting flexibility.
III is FALSE. Cross-validation is used to select the appropriate level of model flexibility (e.g., tuning parameters), not to measure the accuracy of a parameter estimate. Bootstrapping is used for measuring accuracy of estimates.

Method	Primary Use
Cross-validation	Model selection / tuning flexibility
Bootstrap	Estimating accuracy (SE, CI) of estimates

New cards

Which of the following statements about Poisson regression models is FALSE?

A. They are used to model count or frequency data.

B. They assume exposure is constant when modeling the rate at which events occur.

C. Explanatory variables (besides exposure) can be either continuous or categorical.

D. They incorporate a logarithmic link function.

E. For a binary explanatory variable xⱼ, the rate ratio is:

text

E[Yᵢ | xⱼ = 1]     
──────────────── = e^(βⱼ)
E[Yᵢ | xⱼ = 0]

Answer: B

Explanation:

A is TRUE. Poisson regression is designed for count or frequency data (e.g., number of accidents, hospital visits).
B is FALSE. Poisson regression does not assume constant exposure. It can handle varying exposure times by including an offset term (log(exposure)) in the model.
C is TRUE. Explanatory variables can be continuous, categorical, or a mix.
D is TRUE. The link function is the natural logarithm: log(μᵢ) = xᵢᵀβ.
E is TRUE. For a binary predictor, the rate ratio is e^(βⱼ), which compares the expected count when xⱼ = 1 versus when xⱼ = 0.

New cards

You are performing PCA on a data set with 50 observations from three independent continuous variables. Which statements are true?

I. The maximum number of principal components that can be extracted from this data is three.

II. The first principal component represents the direction along which the data vary the most.

III. The third principal component will be orthogonal to the first principal component.

A. I only
B. II only
C. III only
D. I, II, and III
E. The answer is not given by (A), (B), (C), or (D).

Answer: D. I, II, and III

Explanation:

I is TRUE. The maximum number of principal components equals the number of variables (p = 3), regardless of the number of observations (n = 50). (Technically, the maximum is min(n − 1, p), which is min(49, 3) = 3.)
II is TRUE. The first principal component is the direction in feature space that maximizes the variance of the projected data — i.e., the direction of greatest variation.
III is TRUE. All principal components are mutually orthogonal (uncorrelated) to each other. Therefore, PC3 is orthogonal to PC1.

New cards

In a linear model, coefficients are estimated by minimizing:

text

         n
   min   ∑ (yᵢ − β₀ − β₁xᵢ,₁ − ... − βₚxᵢ,ₚ)²
  β₀,…,βₚ i=1

subject to:

text

  p
  ∑ |βⱼ| ≤ s,   for s ≥ 0
 j=1

Which 2 statements are true?

A. This shrinkage method is known as ridge regression.

B. This shrinkage method performs variable selection on the predictors.

C. This shrinkage method is not useful when dealing with high dimensions.

D. As s increases, the squared bias will decrease.

E. As s increases, the test RSS will increase.

Answer: B and D

Explanation:

A is FALSE. This is Lasso regression (L₁ penalty). Ridge uses:

text

  p
  ∑ βⱼ² ≤ s
 j=1

B is TRUE. Lasso performs variable selection because it shrinks some coefficients exactly to 0, effectively removing those predictors from the model.
C is FALSE. Lasso is very useful in high dimensions due to:
- Variable selection (removes irrelevant features)
- Prevents overfitting
- Improves interpretability
- Computational efficiency
D is TRUE. As s increases, the constraint becomes less restrictive → model gains flexibility → bias² decreases (model fits the training data better).
E is FALSE. As s increases, test error (RSS) typically follows a U-shape: it decreases then increases. Test RSS does not monotonically increase.

New cards

Given independent Poisson random variables Y₁, Y₂, …, Yₙ with means μ₁, μ₂, …, μₙ. The following Poisson regression with log link is performed:

text

ln(μ) = α + βX

where μ = E[Y] and X is the explanatory variable.

Which statements are true?

I. If β = 0, the model claims that there is no correlation between X and Y.

II. If β > 0, the model claims that there is a positive correlation between X and Y.

III. If β < 0, then μ = E[Y] can be less than 0.

A. I only
B. II only
C. III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D).

Answer: E. The correct answer is not given by (A), (B), (C), or (D).

Explanation:

I is TRUE. If β = 0, then ln(μ) = α, so μ is constant and does not depend on X. Thus, there is no relationship (correlation) between X and Y.
II is TRUE. If β > 0, then as X increases, ln(μ) increases, so μ = e^(α+βX) increases. This indicates a positive relationship between X and Y.
III is FALSE. The log link ensures that μ = e^(α+βX) > 0 for all values of α and β. The mean of a Poisson distribution cannot be negative, regardless of the sign of β.

New cards

Determine which of the following statements are true.

I. For a binary response variable with a continuous explanatory variable, logistic regression is an inappropriate method of statistical analysis.

II. Ordinal variables are a type of continuous explanatory variable.

III. ANOVA is a useful approach for analyzing the means of groups of continuous response variables, where the groups are categorical.

A. I only
B. II only
C. III only
D. I, II, and III
E. The answer is not given by (A), (B), (C), or (D).

Answer: C. III only

Explanation:

I is FALSE. For a binary response variable, logistic regression is appropriate and is actually the standard method of analysis. It models the probability of success using the logit link.
II is FALSE. Ordinal variables are a type of categorical explanatory variable (with ordered categories), not continuous. Continuous variables take numeric values on an interval/ratio scale (e.g., height, temperature).
III is TRUE. ANOVA (Analysis of Variance) is used to compare the means of a continuous response variable across groups defined by one or more categorical predictors.

New cards

A GLM with n observations and p predictors exhibits severe overdispersion. You plan to address the issue by implementing the quasi-likelihood approach where φ is an extra parameter.

Which statements are true?

I. The deviance of the model is greater than n − p − 1.

II. Implementing the quasi-likelihood approach changes the coefficient estimates from β̂ to φ·β̂.

III. Implementing the quasi-likelihood approach changes the variance-covariance matrix from (XᵀWX)⁻¹ to φ·(XᵀWX)⁻¹.

A. None are true
B. I and II only
C. I and III only
D. II and III only
E. The correct answer is not given by (A), (B), (C), or (D).

Answer: C. I and III only

Explanation:

I is TRUE. To detect overdispersion, compare the deviance to its degrees of freedom (n − p − 1). If deviance > n − p − 1, this indicates overdispersion (the variance exceeds what the assumed distribution allows).
II is FALSE. The quasi-likelihood approach does not change the coefficient estimates β̂. The estimates remain the same as from the standard GLM; only the standard errors are adjusted.
III is TRUE. The quasi-likelihood approach scales the variance-covariance matrix by φ:

text

Var(β̂) = φ · (XᵀWX)⁻¹

This inflates the standard errors to account for the extra variability in the data.

New cards

Determine which one of the following statements about Principal Component Regression (PCR) is FALSE.

A. When performing PCR it is recommended that the modeler standardize each predictor prior to generating the principal components.

B. PCR is useful for performing feature selection.

C. PCR assumes that the directions in which features show the most variation are the directions that are associated with the target.

D. PCR can reduce overfitting.

E. The first principal component direction of the data is that along which the observations vary the most.

Answer: B

Explanation:

A is TRUE. Standardization is recommended so that high-variance variables (e.g., measured in large units) do not dominate the principal components.
B is FALSE. PCR does not perform feature selection. It creates linear combinations (principal components) of all predictors. Unlike Lasso, it does not set any coefficients to zero or exclude original variables.
C is TRUE. PCR relies on the assumption that the directions of greatest variation in the predictors are also the most relevant for predicting the response.
D is TRUE. By using only the first few principal components (rather than all p predictors), PCR reduces the effective number of parameters, which helps reduce overfitting.
E is TRUE. The first principal component is the direction that maximizes the variance of the projected data.

New cards

Determine which of the following statements about flexibility and interpretability of statistical learning methods are FALSE. (Select 2 options.)

A. Linear regression is a relatively inflexible approach because it can only generate linear relationships between predictors and the response.

B. If we are mainly interested in inference, then restrictive models are much more interpretable.

C. Subset selection is more interpretable than linear regression.

D. Bagging and boosting are more restrictive than linear regression.

E. Highly flexible methods make it easy to discern how changes in predictors affect the response.

Answer: D and E

Explanation:

A is TRUE. Linear regression is inflexible because it assumes a linear relationship between predictors and the response.
B is TRUE. For inference, simpler/restrictive models are preferred because they are easier to interpret. Complex models obscure how individual predictors affect the response.
C is TRUE. Subset selection is more interpretable than linear regression because it reduces the number of predictors in the model.

Flexibility/Interpretability Spectrum:

More Interpretable / Less Flexible	Moderately Flexible	Less Interpretable / More Flexible
Lasso	Least Squares	Bagging
Subset selection	Regression Trees	Boosting
	Classification Trees

D is FALSE. Bagging and boosting are more flexible (less restrictive) than linear regression, not more restrictive.
E is FALSE. Highly flexible methods are less interpretable. Their complexity makes it difficult to understand how changes in predictors affect the response.

New cards

A linear regression model is fitted to a dataset. The coefficients are estimated by minimizing:

text

min Σ (yᵢ − β₀ − Σ βⱼ xᵢ,ⱼ)²   subject to   Σ βⱼ² ≤ s

Which scenario will likely occur as s increases from 0?

A. The training residual sum of squares will steadily increase.

B. The test residual sum of squares will increase initially, and then finally decrease, following an inverted U shape.

C. The variance will steadily decrease.

D. The squared bias will decrease initially, and then finally increase, following a U shape.

E. The irreducible error will remain constant.

Answer: E

Explanation:

This is Ridge Regression (L₂ penalty). As s increases:

Constraint becomes less restrictive → model gains flexibility
Variance → increases
Squared bias → decreases
Training RSS → decreases
Test RSS → U-shaped (decreases then increases)
Irreducible error → remains constant ✓

Irreducible error is the noise inherent in the data (Var(ε)) and does not depend on model flexibility.

New cards

In a linear model, coefficients are estimated by minimizing:

text

min Σ (yᵢ − β₀ − β₁xᵢ,₁ − ... − βₚxᵢ,ₚ)² + λ Σ βⱼ²

where λ ≥ 0.

Which 3 statements are true?

A. This shrinkage method is known as ridge regression.

B. As λ increases, Σ βⱼ² decreases.

C. As λ increases, it is not possible for an individual βⱼ to increase in absolute value.

D. As λ increases, the training RSS will increase.

E. As λ increases, the variance will increase.

Answer: A, B, and D

Explanation:

A is TRUE. This is Ridge Regression (L₂ penalty). Lasso would use Σ|βⱼ| instead.
B is TRUE. As λ increases, the penalty becomes stronger, forcing the overall sum of squared coefficients (Σ βⱼ²) to decrease toward zero.
C is FALSE. While the overall Σ βⱼ² must decrease, it is possible for an individual coefficient βⱼ to increase in absolute value as λ increases, as long as other coefficients shrink enough to compensate.
D is TRUE. Larger λ → more shrinkage → less flexible model → worse fit to training data → training RSS increases.
E is FALSE. As λ increases, variance decreases (model becomes more stable) while bias increases (the bias-variance trade-off).

New cards

Determine which of the following statements about goodness-of-fit in Poisson regression is FALSE.

A. Pearson residuals are used because they are approximately homoscedastic.

B. The Pearson goodness-of-fit test statistic has a chi-squared distribution if the Poisson model is the correct model.

C. The coefficient of determination R² from linear regression can be used in Poisson regression.

D. The Akaike information criterion is used to assess model fit and penalizes for model complexity.

E. Pearson's goodness-of-fit test uses the difference between observed and expected counts squared, divided by the expected counts.

Answer: C

Explanation:

A is TRUE. Pearson residuals are constructed to be approximately homoscedastic (constant variance), which aids in assessing model fit.
B is TRUE. Under the correct Poisson model, the Pearson goodness-of-fit statistic approximately follows a chi-squared distribution (with appropriate degrees of freedom).
C is FALSE. R² from linear regression is not applicable in Poisson regression (or other GLMs). Instead, model fit is assessed using:
- Deviance
- Pearson chi-square statistic
- AIC (Akaike Information Criterion)
- BIC (Bayesian Information Criterion)
D is TRUE. AIC assesses goodness-of-fit and includes a penalty for the number of parameters to prevent overfitting.
E is TRUE. This correctly describes the Pearson goodness-of-fit statistic:

New cards

Determine which of the following statements about leave-one-out cross-validation (LOOCV) are true. (Select 3 options.)

A. LOOCV involves using almost the entire dataset for training, with just one observation set aside for validation.

B. The LOOCV approach does not depend on random sampling of the data, thus yielding consistent results upon repeated applications.

C. LOOCV uses a single observation from the dataset for the validation each time, which leads to high variance in the error estimates.

D. LOOCV is computationally less expensive than k-fold cross-validation.

E. The estimate for test MSE in LOOCV is the average of the errors from each single validation.

Answer: A, B, and C

Explanation:

A is TRUE. Each LOOCV model is trained on n − 1 observations (almost the entire dataset) and validated on the single left-out observation.
B is TRUE. LOOCV is deterministic — there is no random splitting. Each observation is left out exactly once, so repeated applications yield identical results.
C is TRUE. Since each validation is based on a single observation, the error estimate has high variance (it is sensitive to which observation is left out).
D is FALSE. LOOCV requires fitting the model n times (once per observation). This is computationally more expensive than k-fold CV (which requires only k fits), especially for large n or complex models.
E is FALSE. The LOOCV estimate for test MSE is the average of the squared errors from each validation step

New cards

You have fit a smoothing spline model using the tuning parameter λ.

Which statements are true?

I. Larger values of λ result in smoother splines.

II. Larger values of λ result in greater effective degrees of freedom for the model.

III. Larger values of λ result in a more biased model.

A. I only
B. II only
C. III only
D. I, II, and III
E. The answer is not given by (A), (B), (C), or (D).

Answer: E (I and III are true; II is false)

Explanation:

I is TRUE. λ controls the roughness penalty:
- λ = 0 → no penalty → interpolates the data (very wiggly)
- λ → ∞ → infinite penalty → forces a linear fit (very smooth)
So larger λ → smoother spline.
II is FALSE. As λ increases, the effective degrees of freedom decreases from n (at λ = 0) down to 2 (at λ = ∞). Larger λ imposes more constraints, reducing flexibility.
III is TRUE. Bias-variance trade-off:
- Larger λ → smoother (more constrained) → higher bias
- Smaller λ → more wiggly (less constrained) → higher variance

New cards

You want to perform a regression of Y onto predictors X₁, X₂, ..., Xₚ, using a large number of observations, and are considering the following modeling techniques:

Lasso Regression
Partial Least Squares
Principal Component Analysis
Ridge Regression

How many of the above modeling procedures perform variable selection?

A. 0
B. 1
C. 2
D. 3
E. 4

Answer: B (1)

Explanation:

Method	Variable Selection?
Lasso Regression	✅ YES — L₁ penalty shrinks some coefficients exactly to 0, effectively removing predictors from the model.
Partial Least Squares	❌ NO — Uses linear combinations of all predictors to create latent variables; does not exclude original variables.
Principal Component Analysis	❌ NO — PCA is a dimension reduction technique, but it uses all variables to create principal components. It does not perform variable selection (it creates new variables).
Ridge Regression	❌ NO — L₂ penalty shrinks coefficients toward 0 but never sets them exactly to 0. All predictors remain in the model.

New cards

Consider the simple linear regression model:

text

Y = β₀ + β₁x + ε,   where ε ~ Normal(0, σ²)

Which statements about the OLS estimators β̂₀ and β̂₁ are true? (Select 3 options.)

A. β̂₁ is an unbiased estimator of β₁.

B. β̂₀ equals the sample mean of Y for an observation with x = 0.

C. The formula for β̂₁ is:

text

Σ (xᵢ − x̄)(Yᵢ − Ȳ)
─────────────────────
   Σ (xᵢ − x̄)²

D. β̂₀ and β̂₁ are not independent.

E. β̂₀ is assumed to have a normal distribution with variance σ².

Answer: A, C, and D

Explanation:

A is TRUE. β̂₁ is unbiased: E[β̂₁] = β₁.
B is FALSE. β̂₀ = Ȳ − β̂₁·x̄. It equals Ȳ only when x̄ = 0, not when an individual observation x = 0.
C is TRUE. This is the standard OLS formula for the slope coefficient.
D is TRUE. β̂₀ and β̂₁ are not independent because β̂₀ is a function of β̂₁ (via the relationship β̂₀ = Ȳ − β̂₁·x̄).
E is FALSE. β̂₀ is normally distributed with:

text

Mean:     β₀

Variance: σ² · [ 1/n + x̄² / Σ(xᵢ − x̄)² ]

New cards

Determine which of the following statements about forward stepwise selection are true. (Select 2 options.)

A. The most statistically significant variable is dropped at each step.

B. If p is the number of potential predictors, then p(p+1)/2 models have to be fitted.

C. The predictors in the k-variable model must be a subset of the predictors in the (k+1)-variable model.

D. At each iteration, the variable chosen is the one that minimizes the training residual sum of squares.

E. Forward subset selection cannot be used when the number of variables is greater than the number of observations.

Answer: C and D

Explanation:

A is FALSE. Forward stepwise adds variables; it does not drop them. Dropping the least significant variable is characteristic of backward stepwise selection.
B is FALSE. The number of models fitted in forward stepwise selection is:

text

1 + p(p+1)/2

(not p(p+1)/2). This includes the null model (with 0 predictors).

C is TRUE. Forward stepwise builds models sequentially by adding one variable at a time. Therefore, the k-variable model is always a nested subset of the (k+1)-variable model.
D is TRUE. At each step, the variable that gives the greatest improvement in fit is added. This is equivalent to choosing the variable that minimizes the training RSS (or maximizes R²).
E is FALSE. Forward stepwise can be used when p > n because it starts with 0 predictors and adds one at a time. It is backward stepwise that cannot be used when p > n (since it starts with all predictors).

New cards

Determine which of the following results indicates model overfitting.

I. A pseudo R² of 1

II. A deviance of 0

III. A maximized log-likelihood of 0

A. I only
B. II only
C. III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D).

Answer: B (II only)

Explanation:

Definitions:

Null model: model with no predictors (intercept only), log-likelihood = l_null
Saturated model: perfect fit (one parameter per observation), log-likelihood = l_sat
Fitted model: log-likelihood = l(β̂)

text

l_null ≤ l(β̂) ≤ l_sat

Overfitting occurs when l(β̂) ≈ l_sat.

I is FALSE. Pseudo R² is defined as:

text

R_pse² = 1 − l(β̂) / l_null

If l(β̂) = l_sat, then R_pse² = 1 − l_sat/l_null. This equals 1 only if l_null = 0, which is not generally true. So R_pse² = 1 does not necessarily indicate overfitting.

II is TRUE. Deviance is defined as:

text

D = 2 · [l_sat − l(β̂)]

If the model is overfitted, l(β̂) ≈ l_sat, so D ≈ 0. A deviance of 0 indicates perfect fit → overfitting.

III is FALSE. A maximized log-likelihood of 0 (l(β̂) = 0) does not inherently indicate overfitting. It depends on the values of l_null and l_sat.

New cards

Determine which of the following statements regarding the adjusted R² are true. (Select 2 options.)

A. Adjusted R² never falls below 0.

B. The maximum possible value for the adjusted R² is 1.

C. Adjusted R² is always greater than or equal to the associated unadjusted R².

D. Compared to AIC and BIC, adjusted R² is not well supported in statistical theory.

E. Adding more explanatory variables to a model has no impact on the adjusted R².

Answer: B and D

Explanation:

A is FALSE. Adjusted R² can be negative if the model is poor (worse than using just the mean). Unadjusted R² is always between 0 and 1, but adjusted R² has no lower bound.
B is TRUE. The maximum possible value for adjusted R² is 1 (just like unadjusted R²). It cannot exceed 1.
C is FALSE. The relationship is:

text

R²_adj ≤ R² ≤ 1

Adjusted R² is always less than or equal to unadjusted R², not greater.

D is TRUE. Adjusted R² is considered less theoretically motivated compared to AIC and BIC, which are grounded in information theory and statistical principles.
E is FALSE. Adding variables can decrease adjusted R² if the new variables do not sufficiently improve the model fit. This is the key feature of adjusted R² — it penalizes for adding unnecessary predictors.

New cards

Determine which of the following statements about GLMs are true.

I. The saturated model has the highest possible deviance.

II. Deviance follows a chi-square distribution for all models in the exponential family.

III. Deviance is a useful measure of goodness of fit for all models in the exponential family.

A. I only
B. II only
C. III only
D. I, II, and III
E. The correct answer is not given by (A), (B), (C), or (D)

Answer: E (None are true)

Explanation:

I is FALSE. The saturated model has the lowest possible deviance (deviance = 0), as it perfectly fits the data. Deviance measures the difference between the fitted model and the saturated model — smaller deviance = better fit.
II is FALSE. Deviance follows a chi-square distribution only asymptotically for non-normal responses. For a normal response, deviance is:

text

D = Σ (yᵢ − μ̂ᵢ)² / σ²

which follows a chi-square distribution if σ² is known, but not for all exponential family models generally.

III is FALSE. Deviance is not useful for all exponential family models. For the normal distribution, deviance depends on the unknown parameter σ², making it difficult to use for goodness-of-fit assessment. (Other distributions may have similar issues, but this is the classic example.)

New cards

Determine which of the following is NOT an assumption of a simple linear regression model. (Select 2 options.)

(A) The error terms are correlated.
(B) The error terms are normally distributed.
(C) The error terms have a constant variance.
(D) The predictor variable is normally distributed.
(E) The relationship between the predictor variable and the response variable is linear.

Answer: A and D are NOT assumptions.

A — NOT an assumption: The error terms are uncorrelated, not correlated.

B — IS an assumption: The error terms are normally distributed.

C — IS an assumption: The error terms have a constant variance. This property is known as homoscedasticity.

D — NOT an assumption: The predictor variable is not assumed to have a specific distribution; it is assumed to be non-random (fixed/deterministic).

E — IS an assumption: The relationship between the predictor variable and the response variable is assumed to be linear.

Simple linear regression assumptions (summary):

Linearity: E[Y] = β₀ + β₁x
Predictor x is fixed/non-random
Errors are uncorrelated (independent)
Errors have constant variance (homoscedasticity): Var(ε) = σ²
Errors are normally distributed: ε ~ N(0, σ²)

New cards

Determine which of the following statements represents assumptions commonly associated with generalized linear models (GLMs).

(A) The response variable is normally distributed.
(B) The expected value of the response is a linear function of the predictors.
(C) The variance of the response is the same across all levels of predictors.
(D) Explanatory variables are fixed and not random.
(E) The response observations are dependent of each other.

Answer: D

A — FALSE. GLMs do not assume the response variable must be normally distributed. Instead, GLMs allow the response variable to have any distribution from the exponential family (e.g., normal, binomial, Poisson, etc.).

B — FALSE. In GLMs, the expected value of the response is linked to the linear predictors through a link function: η = g(μ), where η is a linear combination of the predictors. The relationship between the predictors and the response mean is not necessarily linear without the transformation by the link function.

C — FALSE. GLMs allow the variance of the response to be a function of its mean, meaning variance can change across different levels of predictors (heteroscedasticity). This differs from the constant variance (homoscedasticity) assumption in ordinary linear regression.

D — TRUE. This is a common assumption in GLMs, similar to ordinary linear regression: explanatory variables are treated as non-random and fixed.

E — FALSE. GLMs generally assume the response observations are independent of each other.

New cards

Determine which of the following statements regarding the model comparison statistics are true. (Select 3 options.)

(A) If the MSE of the full model containing all k predictors is an unbiased estimator of the true error variance, then Cₚ is an unbiased estimator of the test MSE.

(B) When BIC and Cₚ are used as model selection criteria, they give equivalent results.

(D) AIC and BIC are more general than Cₚ as they are applicable to linear, non-linear, and other general types of models fitted by maximum likelihood.

(E) AIC and BIC provide an indirect estimate of the test error, while Cₚ provides a direct estimate of the test error.

Answer: A, C, D

A — TRUE. This statement is about an unbiasedness property of Cₚ and reinforces that selecting a model with a small Cₚ tends to lead to a model with a small test MSE.

B — FALSE. AIC is equal to Cₚ, and thus, when used as model selection criteria, they yield the same result. BIC is not equal to Cₚ.

C — TRUE. BIC usually places a heavier penalty per parameter than AIC and Cₚ, favoring models with fewer variables.

D — TRUE. AIC and BIC are more general than Cₚ due to their applicability to a broader range of models, including linear, non-linear, and models fitted by maximum likelihood.

E — FALSE. All four model comparison statistics (adjusted R², Cₚ, AIC, BIC) aim to indirectly estimate the test error by adjusting the training error to account for model complexity. Direct estimation of test error can be achieved through resampling methods such as the validation set approach and cross-validation.

New cards

Determine which of the following results indicates model overfitting.

I. R² = 1
II. Adjusted R² = 1
III. An error sum of squares of 0

(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The correct answer is not given by (A), (B), (C), or (D).

Answer: E — All three statements (I, II, and III) are true.

Overfitting occurs when the model fits the training data too closely. This is evident when the model fits the training data perfectly. When the data is fitted perfectly, the following occurs:

SSR = SST
SSE = 0

As a result:

R² = SSR/SST = 1

R²adj. = 1 − (1 − R²)·[(n − 1)/(n − p − 1)] = 1

New cards

Determine which of the following is a correct pairing of a distribution and its canonical link function. (Select 3 options)

(A) Bernoulli distribution, probit link function
(B) Exponential distribution, inverse link function
(C) Gaussian distribution, identity link function
(D) Inverse Gaussian distribution, inverse squared link function
(E) Poisson distribution, complementary log-log link function

Answer: B, C, D

Canonical link function table:

Distribution	Canonical Link Function
Normal/Gaussian	Identity, μ
Bernoulli	Logit, ln(μ/(1−μ))
Poisson	Logarithmic, ln μ
Gamma	Inverse, 1/μ
Inverse Gaussian	Inverse Squared, 1/μ²

New cards

Analyst	Model Type
Analyst P	Poisson GLM
Analyst Q	Quasi-Poisson GLM
Analyst R	Negative Binomial GLM

Which of the following statements correctly describe the variance assumptions made by each analyst? (Select 2 options.)

(A) Analyst P assumes that the variance is constant and does not depend on the mean.
(B) Analyst Q allows the variance to be greater than the mean through an additional dispersion parameter.
(C) Analyst Q allows the variance to be smaller than the mean through an additional dispersion parameter.
(D) Analyst R assumes that the variance is equal to the mean.
(E) Analyst R allows the variance to be proportional to the square of the mean.

Answer: B and C

Analyst P (Poisson GLM): In the Poisson distribution, the fundamental property is that the variance equals the mean: Var[Y] = μ. The variance is not constant — it changes with the mean. Therefore, statement A is incorrect.

Analyst Q (Quasi-Poisson GLM): This model relaxes the strict Poisson assumption by allowing the variance to be larger or smaller than the mean through a dispersion parameter. Therefore, statements B and C are both correct.

Analyst R (Negative Binomial GLM): This model allows for overdispersion in a way that increases with the mean, meaning the variance grows faster than the mean as the mean increases. Note that the variance is not proportional to the square of the mean. Hence, statements D and E are both incorrect.

Quick summary — variance structure:

Poisson: Var[Y] = μ (fixed, no dispersion parameter)
Quasi-Poisson: Var[Y] = φμ (φ can be > 1 or < 1)
Negative Binomial: Var[Y] = μ + αμ² (variance grows faster than the mean, α > 0)

New cards

A modeler creates a local regression model. After reviewing the results, the fitted line appears too wiggly, over-responding to trends in nearby data points. The modeler would like to adjust the model to produce more intuitive results.

Determine which one of the following adjustments the modeler should make.

(A) Add a linear constraint in the regions before and after the first knot
(B) Increase the number of orders in the regression equation
(C) Increase the number of knots in the model
(D) Reduce the number of knots in the model
(E) Increase the span, s, of the model

Answer: E

Increasing the span, s, would make the fitted line less wiggly (smoother), which is what's needed here.

Answer choices A, C, and D are related to splines, not local regression — they don't apply to fixing a wiggly local regression fit.

Answer choice B would make the fitted line more wiggly (increasing model flexibility), which is the opposite of what's needed.

Key concept — Local Regression (LOESS/LOWESS):

Span (s): the fraction of data points used in each local fit
Smaller span → more wiggly fit (overfits local noise, high variance)
Larger span → smoother fit (less responsive to local fluctuations, lower variance, more bias)

New cards

Determine which of the following statements is/are true about subset selection.

I. Best subset selection results in a nested set of best models, each with different number of predictors.
II. Residual sum of squares is a suitable metric for selecting the best model among models with different number of predictors.
III. Forward stepwise selection cannot be used in high-dimensional settings.

(A) None
(B) I and II only
(C) I and III only
(D) II and III only
(E) The answer is not given by (A), (B), (C), or (D)

Answer: A — None (all three statements are false)

I — FALSE. Best subset selection does not necessarily result in a nested set of best models; only forward stepwise selection and backward stepwise selection are guaranteed to do so.

II — FALSE. Residual sum of squares (RSS) is not a suitable metric because it decreases monotonically as the number of predictors increases (adding predictors can only reduce or keep RSS the same, never increase it).

III — FALSE. Forward stepwise selection can be used in high-dimensional settings where n ≤ p + 1, while backward stepwise selection cannot be used in such settings (it requires n > p to fit the full model).

Key concepts — Subset selection methods:

Best subset selection: fits all possible 2^p models; models with the same number of predictors can be compared, but the "best" models across different sizes are not necessarily nested
Forward stepwise selection: starts with no predictors, adds one at a time → produces a nested sequence; works even when n ≤ p + 1
Backward stepwise selection: starts with all predictors, removes one at a time → produces a nested sequence; requires n > p
RSS/R² are training-based metrics — not suitable for comparing models of different sizes (always favor larger models); use Cₚ, AIC, BIC, adjusted R², or cross-validation instead

New cards

Which of the following statements are true?

A. Principal components regression (PCR) produces simpler and more interpretable models than the lasso.
B. The first principal component direction of the data is that along which the response variable varies the most.
C. In partial least squares (PLS) one need not first standardize each of the predictor variables.
D. Partial Least Squares (PLS) attempts to find directions that help explain both the response and the predictors.
E. Partial Least Squares (PLS) is a generalization of principal components regression (PCR).

Answer: D only

A — FALSE. PCR does not produce simpler, more interpretable models than the lasso. PCR components are linear combinations of all predictors (no variable selection), while the lasso shrinks some coefficients to exactly zero, making it more interpretable.

B — FALSE. PCA is unsupervised — the first principal component direction is the direction along which the predictors (data) vary the most, not the response variable. The response plays no role in determining PCA directions.

C — FALSE. In PLS, predictor variables should be standardized first (just as in PCR), since PLS is sensitive to the scale of the variables.

D — TRUE. PLS is a supervised dimension reduction technique — it identifies directions that explain variance in the predictors while also being related to the response, unlike PCR which ignores the response when constructing components.

E — FALSE. PLS is not a generalization of PCR. They are two distinct approaches: PCR is unsupervised (ignores Y when finding directions); PLS is supervised (uses Y to find directions). Neither contains the other as a special case.

New cards

A modeler is considering revising a GLM for claim size with value of vehicle as an explanatory variable. Value of vehicle is currently being included in the model as a categorical variable with five levels and no interactions.

Determine which of the following statements is false.

A. The Hat matrix can be used to identify high leverage points.
B. A plot of the standardized residuals versus value of vehicle is one way of assessing multicollinearity.
C. Including value of vehicle instead as a continuous variable would decrease the model degrees of freedom.
D. Including several interaction terms involving value of vehicle and other variables may decrease model deviance.
E. None of A, B, C, or D is false.

Answer: B

A — TRUE. The diagonal elements of the Hat matrix are the leverages. The average leverage is (p+1)/N, where p is the number of slopes and N is the number of observations. As a rule of thumb for a high leverage point, one can use two or three times the average as cutoff values.

B — FALSE. If there is a pattern in the residuals plotted against vehicle worth, that indicates we have not explained all the variation in the response with vehicle worth — a transformation of vehicle worth may be useful (this addresses non-linearity, not multicollinearity).

The variance inflation factor (VIF) — the actual way of assessing multicollinearity — is calculated by regressing vehicle worth against all other explanatory variables:

VIFⱼ = 1 / {1 − R(j)²}

where R(j)² is the coefficient of determination from that regression. If one or more VIFs is large, that indicates multicollinearity.

C — TRUE. Degrees of freedom of the model = number of fitted betas. Assuming an intercept is already in the model, vehicle worth with 5 levels adds 4 degrees of freedom. A continuous variable instead adds only 1 degree of freedom — so switching to continuous decreases model degrees of freedom.

D — TRUE. Adding additional terms to a model usually improves the fit and thus decreases the deviance (the model without interactions is a special case of the model with interactions — nested models).

New cards

Determine which of the following statements describe the advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares.

I. Doing so will likely result in a simpler model.
II. Doing so will likely improve prediction accuracy.
III. The results are likely easier to interpret.

(A) I only
(B) II only
(C) III only
(D) I, II, and III
(E) The correct answer is not given by (A), (B), (C), or (D).

Answer: D — All statements are true

Alternative fitting procedures tend to remove irrelevant variables from the predictors, resulting in simpler models and results that are easier to interpret.

These procedures also result in a reduction of the variance of the predictions, thus increasing accuracy.

Key takeaway — Why alternatives to least squares (subset selection, shrinkage) help:

Simpler models (I): by removing/shrinking irrelevant predictors → fewer effective variables
Better prediction accuracy (II): by trading a small increase in bias for a larger reduction in variance (the bias-variance tradeoff) — especially valuable when p is large relative to n, or predictors are collinear
Easier interpretation (III): fewer variables in the model → clearer picture of which predictors matter

New cards

Determine which of the following statements is true.

(A) Linear regression is a flexible approach.
(B) Lasso is more flexible than a linear regression approach.
(C) Spline is a low flexibility approach.
(D) There are methods that have high flexibility and are also easy to interpret.
(E) None of (A), (B), (C), or (D) are true.

Answer: E — None of A, B, C, or D are true

A — FALSE. Linear regression is a relatively inflexible approach. The mean is restricted to a linear combination of the predictors.

B — FALSE. Lasso is a less flexible approach than linear regression. Lasso is more restrictive in estimating the coefficients, and sets some of the coefficients to zero.

C — FALSE. Spline is a relatively flexible approach.

D — FALSE. There is a tradeoff between flexibility and interpretability of a model. A model with high flexibility is not easily interpretable, while a model with low flexibility is easily interpretable.

New cards

From an investigation of the residuals of fitting a linear regression by ordinary least squares, it is clear that the spread of the residuals increases as the predicted values increase. Observed values of the dependent variable range from 0 to 100.

Determine which of the following statements is/are true with regard to transforming the dependent variable to make the variance of the residuals more constant.

I. Because the logarithm of zero is negative infinity, a logarithm transformation cannot be used.

II. A square root transformation may make the variance of the residuals more constant.

III. A logit transformation may make the variance of the residuals more constant.

Answer: II only

I — FALSE. A logarithm transformation can still be used. Adding a constant to the dependent variable, e.g., ln(1 + Y), before the transformation accommodates the possibility of the dependent variable being zero.

II — TRUE. A square root transformation, √Y, may help reduce the severity of heteroscedasticity.

III — FALSE. A logit transformation requires the observed values to be in the range [0,1]. Besides that, a logit transformation doesn't help in reducing heteroscedasticity — it does not shrink the range of values of the dependent variable.

New cards

Consider the following statements:

Principal Component Analysis (PCA) provides low-dimensional linear surfaces that are closest to the observations.
The first principal component is the line in p-dimensional space that is closest to the observations.
PCA finds a low dimension representation of a dataset that contains as much variation as possible.
PCA serves as a tool for data visualization.

Determine which of the statements are correct.

(A) Statements I, II, and III only
(B) Statements I, II, and IV only
(C) Statements I, III, and IV only
(D) Statements II, III, and IV only
(E) Statements I, II, III, and IV are all correct

Answer: E — All statements are true

I — TRUE. PCA provides low-dimensional linear surfaces (lines, planes, or hyperplanes) that are closest to the observations, measured using average squared Euclidean distance.

II — TRUE. The first principal component is the line in p-dimensional space that is closest to the observations — equivalently, it's the direction along which the data has the most variance.

III — TRUE. PCA finds a low-dimensional representation of a dataset that captures as much variation as possible — this is the core goal of dimension reduction via PCA.

IV — TRUE. PCA serves as a useful tool for data visualization, since projecting high-dimensional data onto the first 2 or 3 principal components allows the data to be plotted and visually examined.

New cards

Determine which one of the following statements regarding Generalized Additive Models (GAMs) is false.

(A) Natural Splines, Regression Splines, Smoothing Splines, Local Regression, Polynomial Regression, and Step Functions are all types of models that can be used as building blocks for GAMs.

(B) A limitation of GAMs is that interactions cannot be added to the model.

(D) GAMs are a useful representation if we are interested in inference, since you can examine the effect of the predictor variables on the response while holding all of the other predictor variables constant.

(E) GAMs allow for non-linear relationships between each predictor variable and the response.

Answer: B

A — TRUE. Natural Splines, Regression Splines, Smoothing Splines, Local Regression, Polynomial Regression, and Step Functions are all valid building blocks that can be used within GAMs to model individual predictors.

B — FALSE. While it is true that GAMs are restricted to be additive, like linear regressions, we can manually add interaction terms to the model.

C — TRUE. A smooth function indicates a small number of degrees of freedom.

D — TRUE. This is possible because the model is additive.

E — TRUE. Using polynomial functions (or splines/local regression) in GAMs allows the modeling of non-linear relationships between each predictor and the response.

New cards

Which of the following statements about the lasso is false.

(A) A penalty term λΣⱼ₌₁ᵖ|βⱼ| is added to the residual sum of squares; the sum is minimized.

(B) The larger the ratio ‖β̂ᴸλ‖₁ / ‖β̂‖₁, the less flexible the model.

(C) The lasso is equivalent to assuming a prior distribution for each of the slopes that is Double-Exponential (Laplace) with mean zero.

(D) One can use cross-validation to select the best value of the tuning parameter λ.

(E) We expect the lasso to perform better than ridge regression in a setting where a relatively small number of predictors have substantial coefficients.

Answer: B

A — TRUE. The lasso minimizes RSS + λΣⱼ₌₁ᵖ|βⱼ|, where the penalty term is added to the residual sum of squares.

B — FALSE. ‖β̂ᴸλ‖₁ is the sum of the absolute values of the fitted slopes for the lasso.

‖β̂‖₁ is the sum of the absolute values of the fitted slopes for least squares regression.

The ratio ‖β̂ᴸλ‖₁ / ‖β̂‖₁ is a measure of the shrinkage of the slopes towards zero; larger values of this ratio correspond to smaller values of the tuning parameter λ and thus a more flexible model.

As λ gets bigger, the method gets more restrictive (so a larger ratio means less shrinkage = more flexible, not less flexible as the statement claims).

C — TRUE. The lasso is equivalent to assuming a Double-Exponential (Laplace) prior distribution with mean zero for each of the slopes (from a Bayesian perspective).

D — TRUE. Cross-validation can be used to select the best value of the tuning parameter λ.

E — TRUE. The lasso tends to outperform ridge when a relatively small number of predictors have substantial coefficients (i.e., when the true model is sparse), since lasso can zero out the irrelevant predictors' coefficients.

New cards

You are given the following three statistical learning tools:

I. Linear Regression
II. Spline
III. Ridge Regression

Rank these statistical learning tools based on their flexibility in descending order.

(A) I > II > III
(B) II > III > I
(C) II > I > III
(D) III > I > II
(E) The correct answer is not given by (A), (B), (C), or (D).

Answer: C — II > I > III

Spline is the most flexible, followed by linear regression; and ridge regression is the least flexible statistical learning tool.

Key takeaway — Flexibility ranking:

Spline > Linear Regression > Ridge Regression > Lasso

Spline: flexible, non-parametric approach — can capture non-linear relationships via piecewise polynomial fits with knots
Linear Regression (OLS): unconstrained, but restricted to a linear relationship between predictors and response
Ridge Regression: adds an L2 penalty (λΣβⱼ²), shrinking coefficients toward zero — more restrictive than OLS since it constrains coefficient magnitudes
Lasso (if included) would fall below Ridge, since it can shrink coefficients all the way to zero — even more restrictive

New cards

Determine which of the following pairs of distribution in the linear exponential family and link function is the most appropriate to model the total number of auto accidents experienced by a driver in a year.

(A) Normal distribution with identity link function
(B) Normal distribution with log link function
(C) Poisson distribution with identity link function
(D) Binomial distribution with logit link function
(E) Negative binomial distribution with log link function

Answer: E

Number of auto accidents is an integer variable, so Poisson, binomial, or negative binomial distribution should be selected. This eliminates options A and B.

Option C is eliminated because a Poisson distribution should be paired with a log link function to avoid negative predictions (identity link can produce negative fitted means, which is invalid for count data).

Option D is eliminated because a binomial distribution has a restricted domain; the total number of accidents must be at most m, which may not be suitable for a particular data. Besides that, a Bernoulli distribution (a special case of binomial) is more suited to model probability.

Option E is the most appropriate pair. Negative Binomial is often preferred over Poisson when there's overdispersion (variance > mean) in the count data, which is common in real-world accident data

New cards

Determine which of the following statements are true regarding a simple linear regression model with a response variable y and an explanatory variable x. (Select 3 options)

(A) The least squares line passes through the point (x̄, ȳ).
(B) The sample correlation between x and y is equal to the coefficient of determination of the model.
(C) The choice of explanatory variable x affects the total sum of squares.
(D) The F-statistic of the model is equal to the square of the t-statistic of the slope parameter.
(E) A random pattern in the scatterplot of y against x indicates a coefficient of determination close to zero.

Answer: A, D, E

Option A is TRUE. Under ordinary least squares, the forecasting equation is ŷ = b₀ + b₁x where b₀ = ȳ − b₁x̄. Therefore, when x = x̄:

ŷ = b₀ + b₁x
= ȳ − b₁x̄ + b₁x̄
= ȳ

So the least squares line passes through (x̄, ȳ).

Option B is FALSE. For a simple linear regression model, the squared sample correlation between x and y is equal to the coefficient of determination of the model (r² = R², not r = R²).

Option C is FALSE. The total sum of squares is Σᵢ₌₁ⁿ(yᵢ − ȳ)². This value is not a function of the explanatory variable x — it only depends on the response variable y.

Option D is TRUE. This is true given that both tests (the overall F-test and the t-test on the slope) have the same set of hypotheses in simple linear regression (testing H₀: β₁ = 0).

Option E is TRUE. A random pattern in the scatterplot means that there isn't a linear relationship between x and y. Hence, the coefficient of determination, R², is close to zero.

New cards

You are given the following statements concerning the relationship between a response Y and p predictors, X = (X₁, X₂, ..., Xₚ):

Y = f(X) + ε

I. The accuracy of the prediction for Y depends on both the reducible error and the irreducible error.

II. The variability of ε can be reduced by using the most appropriate learning method to estimate f.

III. ε has a mean of zero.

Determine which of the statements is/are false.

(A) I only
(B) II only
(C) III only
(D) None of them are false.
(E) The answer is not given by (A), (B), (C), or (D).

Answer: B — II only

I — TRUE. The accuracy of the prediction for Y depends on both the reducible error (error due to f̂ not perfectly estimating f) and the irreducible error (error due to ε, which cannot be eliminated no matter how well f is estimated).

II — FALSE. The variability of ε cannot be reduced — this is the irreducible error. The reducible error, on the other hand, can be reduced by using the most appropriate learning method to estimate f.

III — TRUE. ε has a mean of zero (by assumption in this model).

Key takeaway — Reducible vs. Irreducible Error:

In the model Y = f(X) + ε, the expected squared prediction error decomposes as:

E[(Y − Ŷ)²] = [f(X) − f̂(X)]² + Var(ε)

Reducible error = [f(X) − f̂(X)]² — can be reduced by using a more accurate estimate f̂ (i.e., a better learning method/model)
Irreducible error = Var(ε) — arises from unmeasured variables or inherent randomness; cannot be reduced, no matter how well f is estimated
ε is assumed to have mean zero, so it represents random noise around the true function f(X)

New cards

Determine which of the following statements is not a drawback of using linear probability models over nonlinear functions of explanatory variables to model a Bernoulli response.

(A) Heteroscedasticity is an issue since the variance of the response varies with the observations.

(B) The fitted values from the model may be unreasonable and nonsensical.

(D) Explanatory variables that are highly correlated can result in severe multicollinearity.

(E) (A), (B), (C), and (D) are all drawbacks of using linear probability models over nonlinear functions of explanatory variables to model a Bernoulli response.

Answer: D

(A) is a drawback. With a Bernoulli response, the variance is related to the mean. This means that the variance of each observation varies, and this is known as heteroscedasticity.

(B) is a drawback. The fitted values can vary between negative and positive infinity. With a Bernoulli response, the mean varies from 0 to 1. Hence, fitted values outside this bound are unreasonable.

(C) is a drawback. Linear probability models assume that the error terms follow a normal distribution. For a Bernoulli response, the error terms follow a Bernoulli distribution. Any test using the t-distribution is inappropriate.

(D) is NOT a specific drawback of linear probability models. Multicollinearity is a potential issue for any regression model — it's not a drawback unique to using a linear probability model instead of a nonlinear model for a Bernoulli response.

New cards

Determine which of the following statements are true.

I. The minimal model has the highest possible deviance.

II. The deviance for normal distributions is proportional to the regression sum of squares.

III. The deviance is defined as the difference in loglikelihood between the saturated model and the fitted model.

(A) I only
(B) II only
(C) III only
(D) All but III
(E) All

Answer: A — I only

I — TRUE. The minimal model has just an intercept, and ŷᵢ = overall mean. The minimal model has the highest possible deviance.

II — FALSE. The deviance for normal distributions is proportional to the residual (error) sum of squares, not the regression sum of squares.

III — FALSE. The deviance is defined as twice the difference in loglikelihood between the saturated model and the fitted model. (Statement III omits the factor of 2.)

New cards

Which of the following is the basis form for a cubic spline with knots at 1, 4, and 9?

(x − y)₊ = (x − y) if x > y, and 0 otherwise.

(A) f(x) = β₁x + β₂x² + β₃x³ + β₄(x−1)₊³ + β₅(x−4)₊³ + β₆(x−9)₊³

(B) f(x) = β₀ + β₁x + β₂x² + β₃x³ + β₄(x−1)₊³ + β₅(x−4)₊³ + β₆(x−9)₊³

(D) f(x) = β₀ + β₁x + β₂x² + β₃x³ + β₄(x−1)₊³ + β₅(x−1)₊(x−4)₊² + β₆(x−1)₊(x−4)₊(x−9)₊

(E) None of A, B, C, or D

Answer: B

The values, first, and second derivatives are continuous at each of the knots.

A spline includes an intercept term β₀.

Key takeaway — Cubic spline basis with K knots:

f(x) = β₀ + β₁x + β₂x² + β₃x³ + Σₖ₌₁ᴷ βₖ₊₃(x − ξₖ)₊³

where ξₖ are the knot locations (here: 1, 4, and 9).

New cards

Which of the following statements about ridge regression is false.

(A) A penalty term λΣⱼ₌₁ᵖβⱼ² is added to the residual sum of squares; the sum is minimized.

(B) The larger the tuning parameter λ, the more the fitted coefficients are shrunk towards zero.

(D) Compared to lasso regression, ridge regression produces simpler and more interpretable models.

(E) It is best to first standardize each variable by dividing by the variable's estimated standard deviation.

Answer: D

A — TRUE. Ridge regression minimizes RSS + λΣⱼ₌₁ᵖβⱼ², where the penalty term is added to the residual sum of squares.

B — TRUE. The larger the tuning parameter λ, the more the fitted coefficients are shrunk towards zero.

C — TRUE. Ridge regression is more restrictive than ordinary least squares (it constrains the coefficient estimates via the penalty).

D — FALSE. Compared to ridge regression, the lasso produces simpler and more interpretable models that involve only a subset of the predictors; in the case of the lasso, some fitted parameters can be exactly zero (ridge coefficients are shrunk but never exactly zero).

E — TRUE. It is best to first standardize each variable by dividing by the variable's estimated standard deviation, since ridge regression coefficient estimates depend on the scale of the predictors.

New cards

Determine which one of the following statements about ridge regression is true.

(A) As the tuning parameter λ → ∞, some of the coefficients become and remain zero.

(B) The ridge regression coefficients can be calculated by determining the coefficients β̂₁ᴿ, β̂₂ᴿ, ..., β̂ₚᴿ that minimize:

Σᵢ₌₁ⁿ(yᵢ − β₀ − Σⱼ₌₁ᵖβⱼxᵢⱼ)² + λΣᵢ₌₁ᵖ|βᵢ|

(D) Ridge regression shrinks the coefficient estimates, which has the benefit of reducing the variance.

(E) The shrinkage penalty is applied to all of the coefficient estimates.

A — FALSE. Ridge regression does shrink the coefficient estimates, but this statement (coefficients become and remain exactly zero as λ → ∞) is only true of the Lasso.

B — FALSE. The formula shown (with Σ|βᵢ|, the L1 penalty) applies to the Lasso. For ridge regression, we minimize:

Σᵢ₌₁ⁿ(yᵢ − β₀ − Σⱼ₌₁ᵖβⱼxᵢⱼ)² + λΣᵢ₌₁ᵖβᵢ²

(using the squared penalty, Σβᵢ², not Σ|βᵢ|).

C — FALSE. Unlike standard least squares coefficients, ridge regression coefficients are not scale equivariant.

D — TRUE. Ridge regression shrinks the coefficient estimates, which has the benefit of reducing the variance (at the cost of a small increase in bias).

E — FALSE. The shrinkage penalty is applied to all the coefficient estimates except for the intercept.

New cards

Which of the following is NOT an advantage of Generalized Additive Models (GAMs)?

(A) GAMs allow one to fit a nonlinear function to each predictor.

(B) The nonlinear fits can potentially make more accurate predictions for the response Y.

(C) Because the model is additive, we can still examine the effect of each predictor on the response individually while holding all of the other variables fixed.

(D) The smoothness of each function on a predictor can be summarized via degrees of freedom.

(E) All of the above are advantages of GAMs.

Answer: E — All of these are advantages of GAMs

A — advantage. GAMs allow one to fit a nonlinear function to each predictor, capturing relationships that a linear model would miss.

B — advantage. The nonlinear fits can potentially make more accurate predictions for the response Y compared to a purely linear model.

C — advantage. Because the model is additive, we can still examine the effect of each predictor individually on the response while holding all other variables fixed — this preserves interpretability.

D — advantage. The smoothness of each function on a predictor can be summarized via degrees of freedom, giving a convenient single number to describe the flexibility of each term.

Since none of A–D describe a disadvantage, the correct answer is E (all are advantages).