Assessing Model Accuracy

In order to evaluate the performance of a statistical learning method on

a given data set, we need some way to measure how well its predictions

actually match the observed data. That is, we need to quantify the extent

to which the predicted response value for a given observation is close to

the true response value for that observation. In the regression setting, the

most commonly-used measure is the mean squared error (MSE), given by mean

squared

error MSE = 1

i=1

(yi − ˆf(xi))2, (2.5)

where ˆf(xi) is the prediction that ˆf gives for the ith observation. The MSE

will be small if the predicted responses are very close to the true responses,

and will be large if for some of the observations, the predicted and true

responses difer substantially.

The MSE in (2.5) is computed using the training data that was used to

ft the model, and so should more accurately be referred to as the training

MSE. But in general, we do not really care how well the method works training

MSE on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen

test data. Why is this what we care about? Suppose that we are interested test data in developing an algorithm to predict a stock’s price based on previous

stock returns. We can train the method using stock returns from the past

6 months. But we don’t really care how well our method predicts last week’s

stock price. We instead care about how well it will predict tomorrow’s price

or next month’s price. On a similar note, suppose that we have clinical

measurements (e.g. weight, blood pressure, height, age, family history of

disease) for a number of patients, as well as information about whether each

patient has diabetes. We can use these patients to train a statistical learning method to predict risk of diabetes based on clinical measurements. In

practice, we want this method to accurately predict diabetes risk for future

patients based on their clinical measurements. We are not very interested

in whether or not the method accurately predicts diabetes risk for patients

used to train the model, since we already know which of those patients

have diabetes.

To state it more mathematically, suppose that we ft our statistical learning method on our training observations {(x1, y1),(x2, y2),...,(xn, yn)},

and we obtain the estimate ˆf. We can then compute ˆf(x1), ˆf(x2),..., ˆf(xn).

2.2 Assessing Model Accuracy 29

0 20 40 60 80 100

2 4 6 8 10 12

2 5 10 20

0.0 0.5 1.0 1.5 2.0 2.5

Flexibility

Mean Squared Error

FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of

f are shown: the linear regression line (orange curve), and two smoothing spline

fts (blue and green curves). Right: Training MSE (grey curve), test MSE (red

curve), and minimum possible test MSE over all methods (dashed line). Squares

represent the training and test MSEs for the three fts shown in the left-hand

panel.

If these are approximately equal to y1, y2,...,yn, then the training MSE

given by (2.5) is small. However, we are really not interested in whether

ˆf(xi) ≈ yi; instead, we want to know whether ˆf(x0) is approximately equal

to y0, where (x0, y0) is a previously unseen test observation not used to train

the statistical learning method. We want to choose the method that gives

the lowest test MSE, as opposed to the lowest training MSE. In other words, test MSE if we had a large number of test observations, we could compute

Ave(y0 − ˆf(x0))2, (2.6)

the average squared prediction error for these test observations (x0, y0).

We’d like to select the model for which this quantity is as small as possible.

How can we go about trying to select a method that minimizes the test

MSE? In some settings, we may have a test data set available—that is,

we may have access to a set of observations that were not used to train

the statistical learning method. We can then simply evaluate (2.6) on the

test observations, and select the learning method for which the test MSE is

smallest. But what if no test observations are available? In that case, one

might imagine simply selecting a statistical learning method that minimizes

the training MSE (2.5). This seems like it might be a sensible approach,

since the training MSE and the test MSE appear to be closely related.

Unfortunately, there is a fundamental problem with this strategy: there

is no guarantee that the method with the lowest training MSE will also

have the lowest test MSE. Roughly speaking, the problem is that many

statistical methods specifcally estimate coefcients so as to minimize the

training set MSE. For these methods, the training set MSE can be quite

small, but the test MSE is often much larger.

Figure 2.9 illustrates this phenomenon on a simple example. In the lefthand panel of Figure 2.9, we have generated observations from (2.1) with

30 2. Statistical Learning

the true f given by the black curve. The orange, blue and green curves illustrate three possible estimates for f obtained using methods with increasing

levels of fexibility. The orange line is the linear regression ft, which is relatively infexible. The blue and green curves were produced using smoothing

splines, discussed in Chapter 7, with diferent levels of smoothness. It is smoothing

spline clear that as the level of fexibility increases, the curves ft the observed

data more closely. The green curve is the most fexible and matches the

data very well; however, we observe that it fts the true f (shown in black)

poorly because it is too wiggly. By adjusting the level of fexibility of the

smoothing spline ft, we can produce many diferent fts to this data.

We now move on to the right-hand panel of Figure 2.9. The grey curve

displays the average training MSE as a function of fexibility, or more

formally the degrees of freedom, for a number of smoothing splines. The degrees of

freedom degrees of freedom is a quantity that summarizes the fexibility of a curve;

it is discussed more fully in Chapter 7. The orange, blue and green squares

indicate the MSEs associated with the corresponding curves in the lefthand panel. A more restricted and hence smoother curve has fewer degrees

of freedom than a wiggly curve—note that in Figure 2.9, linear regression

is at the most restrictive end, with two degrees of freedom. The training

MSE declines monotonically as fexibility increases. In this example the

true f is non-linear, and so the orange linear ft is not fexible enough to

estimate f well. The green curve has the lowest training MSE of all three

methods, since it corresponds to the most fexible of the three curves ft in

the left-hand panel.

In this example, we know the true function f, and so we can also compute the test MSE over a very large test set, as a function of fexibility. (Of

course, in general f is unknown, so this will not be possible.) The test MSE

is displayed using the red curve in the right-hand panel of Figure 2.9. As

with the training MSE, the test MSE initially declines as the level of fexibility increases. However, at some point the test MSE levels of and then

starts to increase again. Consequently, the orange and green curves both

have high test MSE. The blue curve minimizes the test MSE, which should

not be surprising given that visually it appears to estimate f the best in the

left-hand panel of Figure 2.9. The horizontal dashed line indicates Var("),

the irreducible error in (2.3), which corresponds to the lowest achievable

test MSE among all possible methods. Hence, the smoothing spline represented by the blue curve is close to optimal.

In the right-hand panel of Figure 2.9, as the fexibility of the statistical

learning method increases, we observe a monotone decrease in the training

MSE and a U-shape in the test MSE. This is a fundamental property of

statistical learning that holds regardless of the particular data set at hand

and regardless of the statistical method being used. As model fexibility

increases, the training MSE will decrease, but the test MSE may not. When

a given method yields a small training MSE but a large test MSE, we are

said to be overftting the data. This happens because our statistical learning

procedure is working too hard to fnd patterns in the training data, and

may be picking up some patterns that are just caused by random chance

rather than by true properties of the unknown function f. When we overft

the training data, the test MSE will be very large because the supposed

2.2 Assessing Model Accuracy 31

0 20 40 60 80 100

2 4 6 8 10 12

2 5 10 20

0.0 0.5 1.0 1.5 2.0 2.5

Flexibility

Mean Squared Error

FIGURE 2.10. Details are as in Figure 2.9, using a diferent true f that is

much closer to linear. In this setting, linear regression provides a very good ft to

the data.

patterns that the method found in the training data simply don’t exist

in the test data. Note that regardless of whether or not overftting has

occurred, we almost always expect the training MSE to be smaller than

the test MSE because most statistical learning methods either directly or

indirectly seek to minimize the training MSE. Overftting refers specifcally

to the case in which a less fexible model would have yielded a smaller

test MSE.

Figure 2.10 provides another example in which the true f is approximately linear. Again we observe that the training MSE decreases monotonically as the model fexibility increases, and that there is a U-shape in

the test MSE. However, because the truth is close to linear, the test MSE

only decreases slightly before increasing again, so that the orange least

squares ft is substantially better than the highly fexible green curve. Finally, Figure 2.11 displays an example in which f is highly non-linear. The

training and test MSE curves still exhibit the same general patterns, but

now there is a rapid decrease in both curves before the test MSE starts to

increase slowly.

In practice, one can usually compute the training MSE with relative

ease, but estimating the test MSE is considerably more difcult because

usually no test data are available. As the previous three examples illustrate,

the fexibility level corresponding to the model with the minimal test MSE

can vary considerably among data sets. Throughout this book, we discuss a

variety of approaches that can be used in practice to estimate this minimum

point. One important method is cross-validation (Chapter 5), which is a crossmethod for estimating the test MSE using the training data. validation

2.2.2 The Bias-Variance Trade-Of

The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out

to be the result of two competing properties of statistical learning methods.

32 2. Statistical Learning

0 20 40 60 80 100

−10 0 10 20

2 5 10 20

0 5 10 15 20

Flexibility

Mean Squared Error

FIGURE 2.11. Details are as in Figure 2.9, using a diferent f that is far from

linear. In this setting, linear regression provides a very poor ft to the data.

Though the mathematical proof is beyond the scope of this book, it is

possible to show that the expected test MSE, for a given value x0, can

always be decomposed into the sum of three fundamental quantities: the

variance of ˆf(x0), the squared bias of ˆf(x0) and the variance of the error variance

bias terms ". That is,

y0 − ˆf(x0)

= Var( ˆf(x0)) + [Bias( ˆf(x0))]2 + Var("). (2.7)

Here the notation E

y0 − ˆf(x0)

defnes the expected test MSE at x0, expected

and refers to the average test MSE that we would obtain if we repeatedly test MSE

estimated f using a large number of training sets, and tested each at x0. The

overall expected test MSE can be computed by averaging E

y0 − ˆf(x0)

over all possible values of x0 in the test set.

Equation 2.7 tells us that in order to minimize the expected test error,

we need to select a statistical learning method that simultaneously achieves

low variance and low bias. Note that variance is inherently a nonnegative

quantity, and squared bias is also nonnegative. Hence, we see that the

expected test MSE can never lie below Var("), the irreducible error from

(2.3).

What do we mean by the variance and bias of a statistical learning

method? Variance refers to the amount by which ˆf would change if we

estimated it using a diferent training data set. Since the training data

are used to ft the statistical learning method, diferent training data sets

will result in a diferent ˆf. But ideally the estimate for f should not vary

too much between training sets. However, if a method has high variance

then small changes in the training data can result in large changes in ˆf. In

general, more fexible statistical methods have higher variance. Consider the

green and orange curves in Figure 2.9. The fexible green curve is following

the observations very closely. It has high variance because changing any

one of these data points may cause the estimate ˆf to change considerably.

2.2 Assessing Model Accuracy 33

2 5 10 20

0.0 0.5 1.0 1.5 2.0 2.5

Flexibility

2 5 10 20

0.0 0.5 1.0 1.5 2.0 2.5

Flexibility

2 5 10 20

0 5 10 15 20

Flexibility

MSE

Bias

Var

FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(!)

(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.

The vertical dotted line indicates the fexibility level corresponding to the smallest

test MSE.

In contrast, the orange least squares line is relatively infexible and has low

variance, because moving any single observation will likely cause only a

small shift in the position of the line.

On the other hand, bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much

simpler model. For example, linear regression assumes that there is a linear

relationship between Y and X1, X2,...,Xp. It is unlikely that any real-life

problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f. In

Figure 2.11, the true f is substantially non-linear, so no matter how many

training observations we are given, it will not be possible to produce an

accurate estimate using linear regression. In other words, linear regression

results in high bias in this example. However, in Figure 2.10 the true f

is very close to linear, and so given enough data, it should be possible for

linear regression to produce an accurate estimate. Generally, more fexible

methods result in less bias.

As a general rule, as we use more fexible methods, the variance will

increase and the bias will decrease. The relative rate of change of these

two quantities determines whether the test MSE increases or decreases. As

we increase the fexibility of a class of methods, the bias tends to initially

decrease faster than the variance increases. Consequently, the expected

test MSE declines. However, at some point increasing fexibility has little

impact on the bias but starts to signifcantly increase the variance. When

this happens the test MSE increases. Note that we observed this pattern

of decreasing test MSE followed by increasing test MSE in the right-hand

panels of Figures 2.9–2.11.

The three plots in Figure 2.12 illustrate Equation 2.7 for the examples in

Figures 2.9–2.11. In each case the blue solid curve represents the squared

bias, for diferent levels of fexibility, while the orange curve corresponds to

the variance. The horizontal dashed line represents Var("), the irreducible

error. Finally, the red curve, corresponding to the test set MSE, is the sum

34 2. Statistical Learning

of these three quantities. In all three cases, the variance increases and the

bias decreases as the method’s fexibility increases. However, the fexibility

level corresponding to the optimal test MSE difers considerably among the

three data sets, because the squared bias and variance change at diferent

rates in each of the data sets. In the left-hand panel of Figure 2.12, the

bias initially decreases rapidly, resulting in an initial sharp decrease in the

expected test MSE. On the other hand, in the center panel of Figure 2.12

the true f is close to linear, so there is only a small decrease in bias as fexibility increases, and the test MSE only declines slightly before increasing

rapidly as the variance increases. Finally, in the right-hand panel of Figure 2.12, as fexibility increases, there is a dramatic decline in bias because

the true f is very non-linear. There is also very little increase in variance

as fexibility increases. Consequently, the test MSE declines substantially

before experiencing a small increase as model fexibility increases.

The relationship between bias, variance, and test set MSE given in Equation 2.7 and displayed in Figure 2.12 is referred to as the bias-variance

trade-of. Good test set performance of a statistical learning method re- bias-variance

trade-of quires low variance as well as low squared bias. This is referred to as a

trade-of because it is easy to obtain a method with extremely low bias but

high variance (for instance, by drawing a curve that passes through every

single training observation) or a method with very low variance but high

bias (by ftting a horizontal line to the data). The challenge lies in fnding

a method for which both the variance and the squared bias are low. This

trade-of is one of the most important recurring themes in this book.

In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical

learning method. Nevertheless, one should always keep the bias-variance

trade-of in mind. In this book we explore methods that are extremely

fexible and hence can essentially eliminate bias. However, this does not

guarantee that they will outperform a much simpler method such as linear

regression. To take an extreme example, suppose that the true f is linear.

In this situation linear regression will have no bias, making it very hard

for a more fexible method to compete. In contrast, if the true f is highly

non-linear and we have an ample number of training observations, then

we may do better using a highly fexible approach, as in Figure 2.11. In

Chapter 5 we discuss cross-validation, which is a way to estimate the test

MSE using the training data.

To evaluate the performance of a statistical learning method, we use the mean squared error (MSE), which quantifies the accuracy of predictions. The training MSE, calculated on the training data, is of limited interest; rather, we seek the test MSE on unseen data, as it better reflects prediction accuracy for future observations. Overfitting occurs when a model captures noise in the training data, leading to a large test MSE despite a small training MSE. This results from the trade-off between bias and variance: as model flexibility increases, training MSE decreases, but test MSE may initially decrease and then increase, creating a U-shape in the curve. The goal is to find a model with low bias and low variance, which often requires a balance in flexibility. The bias-variance trade-off is crucial in model selection, where a simpler model is preferred if the true function is close to its assumptions.