MAS-I ISLR Conceptual Questions

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/156

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 10:59 PM on 6/19/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

157 Terms

1
New cards

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method

The sample size n is extremely large, and the number of predictors p is small

better - a more flexible approach will fit the data closer and with the large sample size a better fit than an inflexible approach would be obtained

2
New cards

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method

The number of predictors p is extremely large, and the number of observations n is small

worse - a flexible method would overfit the small number of observations

3
New cards

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method

The relationship between the predictors and response is highly non-linear

better - with more degrees of freedom, a flexible model would obtain a better fit

4
New cards

Indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method

The variance of the error terms (siqma^2) is extremely high

worse - flexible methods fit to the noise in the error terms and increase variance

5
New cards

Problems with a quantitative response

Regression problems

6
New cards

Problems with a qualitative response

Classification problems

logistic regression is typically used with a qualitative (two-class or binary) response; can be used as classification method, but also regression

7
New cards

As model flexibility increases, what happens to the training MSE and test MSE?

Training MSE will decrease

Test MSE may not

8
New cards

Small training MSE & large test MSE means

we are overfitting the data; may be picking up some patterns that are just caused by random chance

regardless of overfitting, we almost always expect the training MSE to be smaller than the test MSE

9
New cards

reducible error

we can potentially improve the accuracy of the estimate of f by using the most appropriate statistical learning technique to estimate f

[f(X) - fhat(X)]^2

10
New cards

irreducible error

no matter how well we estimate f, we cannote reduce the error introduced by ɛ

var(ɛ)

11
New cards

parametric

12
New cards

non-parametric

no assumption about the form of f is made; estimate f that gets as close to the data points as possible without being too rough or wiggly

requires a very large number of observations to accurately estimate f

13
New cards

parametric potential disadvantage

the model we choose will usually not match the true unknown form of f

14
New cards

non-parametric disadvantage

since the do not reduce the problem of estimating f to a small number of parameters, a very large number of observations is required in order to obtain an accurate estimate for f

15
New cards

non-parametric advantage

potential to accurately fit a wider range of possible shapes for f

16
New cards

From high to low interpretability & low to high flexibility

subset selection

lasso

least squares

generalized additive models

linear regression comes after lasso, but before GAM

17
New cards

why would we ever choose to use a more restrictive method instead of a very flexible approach?

if we are mainly interested in inference, then restrictive models are much more interpretable

(linear model)

18
New cards

inference - what kind of model do we want?

inflexible because easier to interpret

want to better understand the relationship between the response and the predictors

19
New cards

prediction - what kind of model do we want?

more flexible, but not always most flexible; interpretability not of interest

aim is to accurately predict the response for future observations

20
New cards

supervised learning

for each observation of the predictor measurement(s), there is an associated response measurement

21
New cards

unsupervised learning

no associated response

cluster analysis

22
New cards

The expected test MSE

the sum of the variance, squared bias, and variance of the error terms

can never lie below var(ɛ)

23
New cards

More flexible methods have higher or lower varaince? higher or lower bias?

More flexible methods have higher variance and lower bias

As flexibility increases, bias tends to initially decrease faster than the variance increases (test MSE decreases), but then bias evens out, and variance significantly increases (test MSE increases)

24
New cards

Bias-variance trade-off

the relationship between bias, variance, and test set MSE

challenge lies in finding a method for which both the variance and the squared bias are low

25
New cards

advantages for a flexible approach

obtaining a better fit for non-linear models, decreasing bias

26
New cards

disadvantages for a flexible approach

requires estimating a greater number of parameters, follow the noise too closely (overfit), increasing variance

27
New cards

Residual Sum of Squares

the sum of each residual squared for all the observations in the sample. This reflects the amount of variation in the dependent variable not explained by the regression equation

<p>the sum of each residual squared for all the observations in the sample. This reflects the amount of variation in the dependent variable not explained by the regression equation</p>
28
New cards

least squares coefficient estimates for simple linear regression

Beta hat sub one

Beta hat sub zero

they characterize the least squares line

29
New cards

True or False: since the coefficient for an interaction term is very small, there is very little evidence of an interaction effect

False. We must examine the p-value of the regression coefficient to determine if the interaction term is statistically significant or not

30
New cards

polynomial regression vs. linear regression; underlying true relationship is linear; would you expect one training RSS to be lower/higher than the other?

I would expect the polynomial regression to have a lower training RSS than the linear regression because it could make a tighter fit against data that matched with a wider irreducible error (var(ɛ))

31
New cards

polynomial regression vs. linear regression; underlying true relationship is linear; would you expect one test RSS to be lower/higher than the other?

I would expect the polynomial regression to have a higher test RSS as the overfit from training would have more error than the linear regression

32
New cards

polynomial regression vs. linear regression; underlying true relationship is not linear, it's unknown; would you expect one training RSS to be lower/higher than the other?

polynomial regression still have lower train RSS than the linear fit because of higher flexibility, no matter what the underlying true relationship is; the more flexible model will closer follow points and reduce train RSS

33
New cards

polynomial regression vs. linear regression; underlying true relationship is not linear, it's unknown; would you expect one test RSS to be lower/higher than the other?

Not enough info to tell; whichever model the underlying relationship is closer to could have the lower RSS

34
New cards

logistic regression

p(X) = e^betas / ( 1 + e^betas)

log(p(X) / (1-p(X)) = betas

35
New cards

odds --> probability

odds = prob / ( 1 - prob)

36
New cards

What is the probability that the first bootstrap observation is not the jth observation from the original sample?

the probbability that the jth observation is selected as the first bootstrap observation is 1/n

therefore the probability that the jth observation is not the first bootstrap observation is 1-(1/n)

37
New cards

what is the probability that the 2nd bootstrap observation is not the jth observation from the original sample

1-(1/n)

bootstrap sampling is sampling with replacement

38
New cards

probability that the jth observation is not in the bootstrap sample

(1-1/n)^n

39
New cards

probability the the jth observation is in the bootstrap sample

1-(1-1/n)^n

40
New cards

probability the the jth observation is in the bootstrap sample tends to...

(limit)

1-1/e

41
New cards

Explain how kk-fold cross-validation is implemented

The data is segmented into kk distinct, (usually) equal-sized 'folds'. A model is trained on k−1k−1 of the folds and tested on the remaining fold. This process is repeated kk times, such that each of the kk folds acts as the test data once. The test performance is recorded and averaged, giving the 'cross-validation' or 'out-of-sample' metric.

42
New cards

What are the advantages of k-fold cross-validation relative to the validation set approach?

k-fold CV has much lower variability, and all the data is used to both train and test model performance

the validation set approach can over-estimate the test error

43
New cards

What are the disadvantages of k-fold cross-validation relative to the validation set approach?

The validation set approach is conceptually easier to grasp and has a computational advantage - a model is trained once and tested once (less time consuming)

44
New cards

What are the advantages of k-fold cross-validation relative to LOOCV?

k-fold CV is less computationally demanding

bias-variance tradeoff (LOOCV has lower bias but higher variance)

45
New cards

What are the disadvantages of k-fold cross-validation relative to LOOCV?

k-fold cv has an element of randomness

loocv can require less computational power in some cases (least squares regression)

46
New cards

Suppose that we use some statistical learning method to make a prediction for the response YY for a particular value of the predictor XX. Carefully describe how we might estimate the standard deviation of our prediction.

The bootstrap approach would be appropriate here. If the original data contains n observations, we create B bootstrap samples from the data (sampling n observations with replacement, repeated B times). On each of these datasets, we would then train a supervised learning method and use it to make our estimate for the 'particular value of X'. Once we have these B estimates, we can calculate the standard deviation of them. Doing so provides the bootstrap estimate for the standard error of our estimate.

47
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

Which of the three models with k predictors has the smallest training RSS?

The smallest training RSS will be for the model with best subset approach. This is because the model will be chosen after considering all the possible models with k parameters for best subset. This is not true for either backward stepwise or forward stepwise.

48
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

Which of the three models with k predictors has the smallest test RSS?

Can't say

49
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

TRUE/FALSE: The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.

TRUE

50
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

TRUE/FALSE: The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.

TRUE

51
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

TRUE/FALSE: The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.

FALSE

52
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

TRUE/FALSE: The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.

FALSE

53
New cards

We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors.

TRUE/FALSE: The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.

FALSE

54
New cards

The lasso, relative to least squares, is

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

55
New cards

Ridge regression, relative to last squares, is

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

56
New cards

Non-linear methods, relative to least squares, is

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias

57
New cards

Ridge regression tends to give similar/different coefficient values to correlated variables

similar

58
New cards

Lasso tends to give similar/different coefficient values to correlated variables

different

59
New cards

basis function

60
New cards

smoothing splines

result from minimizing a residual sum of squares criterion subject to a smoothness penalty

61
New cards

local regression

similar to splines, but the regions are allowed to overlap in a very smooth way

62
New cards

polynomial regression

extends the linear model by adding extra predictors, obtained by raising each of the original predictors to a power

63
New cards

generalized additive models

extend the methods above to deal with multiple predictors

64
New cards

piecewise polynomial regression

involves fitting separate low-degree polynomials over different regions of X

picture = a piecewise cubic polynomial with a single knot at a point c

<p>involves fitting separate low-degree polynomials over different regions of X</p><p>picture = a piecewise cubic polynomial with a single knot at a point c</p>
65
New cards

knots

The points where the coefficients change

more knots = more flexible piecewise polynomial

66
New cards

cubic spline

k+4 df's

67
New cards

degree-d spline

a piecewise degree-d polynomial, with continuity in derivatives up to degree d-1 at each knot

68
New cards

T/F

The lasso is equivalent to assuming a prior distribution for each of the slopes that is double-exponential (laplace) with mean zero

True

69
New cards

T/F

One can use cross-validation to select the best value of the tuning parameter λ

True

70
New cards

T/F

We expect the lasso to perform better than ridge regression in a setting where a relatively number of predictors have substantial coefficients

True

71
New cards

T/F about Lasso

The larger the ratio (see picture; ignore OLS), the less flexible the model

False

Larger values of this ratio correspond to a smaller tuning parameter, λ, and thus a more flexible model

<p>False</p><p>Larger values of this ratio correspond to a smaller tuning parameter, λ, and thus a more flexible model</p>
72
New cards

T/F about Lasso

A penalty term (second term in picture) is added to the residual sum of squares; the sum is minimized

TrueA

<p>TrueA</p>
73
New cards

AIC is best for nested or non-nested models? With the same number or different number of parameters?

AIC is best for non-nested models with different number of parameters

74
New cards

Deviance is best for nested or non-nested models?

Deviance is best for nested models

75
New cards

The pearson chi-squared statistic, likelihood ratio chi-squared statistic and the pseduo-R^2 statistic are best for nested or non-nested models? With the same number of different number of parameters?

The pearson chi-squared statistic, likelihood ratio chi-squared statistic and the pseduo-R^2 statistic are best for non-nested models with the same number of parameters

76
New cards

T/F

If theta-hat is an unbiased estimator of theta and the variance of theta hat = 1/n, it must be a consistent estimator of theta

TRUE

If we have something that's unbiased, then it's going to be asymptotically unbiased

77
New cards

T/F

If theta-hat is a consistent estimator of theta, it must be an unbiased estimator of theta

FALSE

78
New cards

T/F

If theta-hat is a biased estimator of theta, it is not a consistent estimator of theta

FALSE

We can have a biased estimator that is consistent (i.e. xbar + 1/n)

79
New cards

T/F

If we have asymptotically unbiases then we have consistency

TRUE

80
New cards

T/F

Regression splines normally produce more stable results than polynomial regression

True

Regression splines can increase flexibility by increasing knots, but keeping the degree of X fixed at for example 3 in cubic splines

81
New cards

T/F

The more knots, the more flexible is a regression spline

True

82
New cards

T/F

The natural cubic spline avoids the variance in the tails of the (ordinary) cubic spline

True

The natural cubic spline has the constraint that it is linear below the first knot and above the last knot

83
New cards

Polynomial Regression Disadvantage

In order to get more flexibility, we have to use higher powers of X

84
New cards

T/F

The first principal component direction of the data is that along which the observations vary the most

TRUE

85
New cards

T/F

PCR often leads to overfitting

False

PCR usually reduces the number of dimensions; this reduces the tendency to overfit the the model

86
New cards

T/F

PCR assumes the the directions in which the features show the least variation are the directions that are associated with the target

False

the most variation

87
New cards

T/F

PLS is useful for performing feature selection

False

PLS identifies linear combinations of the features that not only approximate these features well, but also that are related to the response

Neither is PCR

88
New cards

An influential point has...

A large absolute value of the standardized residual AND a large leverage

89
New cards

T/F

It is common and expected that the residual series will be autocorrelated

True

90
New cards

T/F

Using least squares fitting, for a positive serial correlation in the residual series, the standard errors of the estimated parameters are likely underestimated

True

91
New cards

T/F

For a positive serial correlation in the residual series, the confidence intervals for parameters using Generalized Least Squares will be narrower than those using least squares.

False

Wider

GLS corrects the underestimate of the standard errors made by least squares

92
New cards

T/F

Usually k-fold cross-validation is less computationally intense than LOOCV with k < n

True

LOOCV fits the model n times

k-fold fits the model k times

93
New cards

T/F

LOOCV tends to overestimate the test error rate in comparison to k-fold cross-validation when k < n

False

k-fold overestimates more

94
New cards

Heteroskedastic

The variance of the error term given the independent variable, x, is not constant

95
New cards

Homoskedasticity

the pattern of the covariation is constant (the same) around the regression line, whether the values are small, medium, or large

96
New cards

T/F with respect to high-dimensional situations

Regularization or shrinkage plays a key role

True

97
New cards

T/F with respect to high-dimensional situations

For ridge regression or the lasso, choosing the tuning parameter is crucial

True

98
New cards

T/F with respect to high-dimensional situations

The test MSE increases as the dimensionality of the problem increases, unless the additional predictors are truly associated with the response

True

99
New cards

Local Regression T/F

At each point in the original data, one fits a weighted least squares regression using only a fraction of the observations

True

100
New cards

Local Regression T/F

One could use the Gaussian (Normal)kernel to determine the weights used in the local regression

False