Assessing Model Accuracy
In order to evaluate the performance of a statistical learning method on
a given data set, we need some way to measure how well its predictions
actually match the observed data. That is, we need to quantify the extent
to which the predicted response value for a given observation is close to
the true response value for that observation. In the regression setting, the
most commonly-used measure is the mean squared error (MSE), given by mean
squared
error MSE = 1
n
0n
i=1
(yi − ˆf(xi))2, (2.5)
where ˆf(xi) is the prediction that ˆf gives for the ith observation. The MSE
will be small if the predicted responses are very close to the true responses,
and will be large if for some of the observations, the predicted and true
responses difer substantially.
The MSE in (2.5) is computed using the training data that was used to
ft the model, and so should more accurately be referred to as the training
MSE. But in general, we do not really care how well the method works training
MSE on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen
test data. Why is this what we care about? Suppose that we are interested test data in developing an algorithm to predict a stock’s price based on previous
stock returns. We can train the method using stock returns from the past
6 months. But we don’t really care how well our method predicts last week’s
stock price. We instead care about how well it will predict tomorrow’s price
or next month’s price. On a similar note, suppose that we have clinical
measurements (e.g. weight, blood pressure, height, age, family history of
disease) for a number of patients, as well as information about whether each
patient has diabetes. We can use these patients to train a statistical learning method to predict risk of diabetes based on clinical measurements. In
practice, we want this method to accurately predict diabetes risk for future
patients based on their clinical measurements. We are not very interested
in whether or not the method accurately predicts diabetes risk for patients
used to train the model, since we already know which of those patients
have diabetes.
To state it more mathematically, suppose that we ft our statistical learning method on our training observations {(x1, y1),(x2, y2),...,(xn, yn)},
and we obtain the estimate ˆf. We can then compute ˆf(x1), ˆf(x2),..., ˆf(xn).
2.2 Assessing Model Accuracy 29
0 20 40 60 80 100
2 4 6 8 10 12
X
Y
2 5 10 20
0.0 0.5 1.0 1.5 2.0 2.5
Flexibility
Mean Squared Error
FIGURE 2.9. Left: Data simulated from f, shown in black. Three estimates of
f are shown: the linear regression line (orange curve), and two smoothing spline
fts (blue and green curves). Right: Training MSE (grey curve), test MSE (red
curve), and minimum possible test MSE over all methods (dashed line). Squares
represent the training and test MSEs for the three fts shown in the left-hand
panel.
If these are approximately equal to y1, y2,...,yn, then the training MSE
given by (2.5) is small. However, we are really not interested in whether
ˆf(xi) ≈ yi; instead, we want to know whether ˆf(x0) is approximately equal
to y0, where (x0, y0) is a previously unseen test observation not used to train
the statistical learning method. We want to choose the method that gives
the lowest test MSE, as opposed to the lowest training MSE. In other words, test MSE if we had a large number of test observations, we could compute
Ave(y0 − ˆf(x0))2, (2.6)
the average squared prediction error for these test observations (x0, y0).
We’d like to select the model for which this quantity is as small as possible.
How can we go about trying to select a method that minimizes the test
MSE? In some settings, we may have a test data set available—that is,
we may have access to a set of observations that were not used to train
the statistical learning method. We can then simply evaluate (2.6) on the
test observations, and select the learning method for which the test MSE is
smallest. But what if no test observations are available? In that case, one
might imagine simply selecting a statistical learning method that minimizes
the training MSE (2.5). This seems like it might be a sensible approach,
since the training MSE and the test MSE appear to be closely related.
Unfortunately, there is a fundamental problem with this strategy: there
is no guarantee that the method with the lowest training MSE will also
have the lowest test MSE. Roughly speaking, the problem is that many
statistical methods specifcally estimate coefcients so as to minimize the
training set MSE. For these methods, the training set MSE can be quite
small, but the test MSE is often much larger.
Figure 2.9 illustrates this phenomenon on a simple example. In the lefthand panel of Figure 2.9, we have generated observations from (2.1) with
30 2. Statistical Learning
the true f given by the black curve. The orange, blue and green curves illustrate three possible estimates for f obtained using methods with increasing
levels of fexibility. The orange line is the linear regression ft, which is relatively infexible. The blue and green curves were produced using smoothing
splines, discussed in Chapter 7, with diferent levels of smoothness. It is smoothing
spline clear that as the level of fexibility increases, the curves ft the observed
data more closely. The green curve is the most fexible and matches the
data very well; however, we observe that it fts the true f (shown in black)
poorly because it is too wiggly. By adjusting the level of fexibility of the
smoothing spline ft, we can produce many diferent fts to this data.
We now move on to the right-hand panel of Figure 2.9. The grey curve
displays the average training MSE as a function of fexibility, or more
formally the degrees of freedom, for a number of smoothing splines. The degrees of
freedom degrees of freedom is a quantity that summarizes the fexibility of a curve;
it is discussed more fully in Chapter 7. The orange, blue and green squares
indicate the MSEs associated with the corresponding curves in the lefthand panel. A more restricted and hence smoother curve has fewer degrees
of freedom than a wiggly curve—note that in Figure 2.9, linear regression
is at the most restrictive end, with two degrees of freedom. The training
MSE declines monotonically as fexibility increases. In this example the
true f is non-linear, and so the orange linear ft is not fexible enough to
estimate f well. The green curve has the lowest training MSE of all three
methods, since it corresponds to the most fexible of the three curves ft in
the left-hand panel.
In this example, we know the true function f, and so we can also compute the test MSE over a very large test set, as a function of fexibility. (Of
course, in general f is unknown, so this will not be possible.) The test MSE
is displayed using the red curve in the right-hand panel of Figure 2.9. As
with the training MSE, the test MSE initially declines as the level of fexibility increases. However, at some point the test MSE levels of and then
starts to increase again. Consequently, the orange and green curves both
have high test MSE. The blue curve minimizes the test MSE, which should
not be surprising given that visually it appears to estimate f the best in the
left-hand panel of Figure 2.9. The horizontal dashed line indicates Var("),
the irreducible error in (2.3), which corresponds to the lowest achievable
test MSE among all possible methods. Hence, the smoothing spline represented by the blue curve is close to optimal.
In the right-hand panel of Figure 2.9, as the fexibility of the statistical
learning method increases, we observe a monotone decrease in the training
MSE and a U-shape in the test MSE. This is a fundamental property of
statistical learning that holds regardless of the particular data set at hand
and regardless of the statistical method being used. As model fexibility
increases, the training MSE will decrease, but the test MSE may not. When
a given method yields a small training MSE but a large test MSE, we are
said to be overftting the data. This happens because our statistical learning
procedure is working too hard to fnd patterns in the training data, and
may be picking up some patterns that are just caused by random chance
rather than by true properties of the unknown function f. When we overft
the training data, the test MSE will be very large because the supposed
2.2 Assessing Model Accuracy 31
0 20 40 60 80 100
2 4 6 8 10 12
X
Y
2 5 10 20
0.0 0.5 1.0 1.5 2.0 2.5
Flexibility
Mean Squared Error
FIGURE 2.10. Details are as in Figure 2.9, using a diferent true f that is
much closer to linear. In this setting, linear regression provides a very good ft to
the data.
patterns that the method found in the training data simply don’t exist
in the test data. Note that regardless of whether or not overftting has
occurred, we almost always expect the training MSE to be smaller than
the test MSE because most statistical learning methods either directly or
indirectly seek to minimize the training MSE. Overftting refers specifcally
to the case in which a less fexible model would have yielded a smaller
test MSE.
Figure 2.10 provides another example in which the true f is approximately linear. Again we observe that the training MSE decreases monotonically as the model fexibility increases, and that there is a U-shape in
the test MSE. However, because the truth is close to linear, the test MSE
only decreases slightly before increasing again, so that the orange least
squares ft is substantially better than the highly fexible green curve. Finally, Figure 2.11 displays an example in which f is highly non-linear. The
training and test MSE curves still exhibit the same general patterns, but
now there is a rapid decrease in both curves before the test MSE starts to
increase slowly.
In practice, one can usually compute the training MSE with relative
ease, but estimating the test MSE is considerably more difcult because
usually no test data are available. As the previous three examples illustrate,
the fexibility level corresponding to the model with the minimal test MSE
can vary considerably among data sets. Throughout this book, we discuss a
variety of approaches that can be used in practice to estimate this minimum
point. One important method is cross-validation (Chapter 5), which is a crossmethod for estimating the test MSE using the training data. validation
2.2.2 The Bias-Variance Trade-Of
The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out
to be the result of two competing properties of statistical learning methods.
32 2. Statistical Learning
0 20 40 60 80 100
−10 0 10 20
X
Y
2 5 10 20
0 5 10 15 20
Flexibility
Mean Squared Error
FIGURE 2.11. Details are as in Figure 2.9, using a diferent f that is far from
linear. In this setting, linear regression provides a very poor ft to the data.
Though the mathematical proof is beyond the scope of this book, it is
possible to show that the expected test MSE, for a given value x0, can
always be decomposed into the sum of three fundamental quantities: the
variance of ˆf(x0), the squared bias of ˆf(x0) and the variance of the error variance
bias terms ". That is,
E
1
y0 − ˆf(x0)
22
= Var( ˆf(x0)) + [Bias( ˆf(x0))]2 + Var("). (2.7)
Here the notation E
1
y0 − ˆf(x0)
22
defnes the expected test MSE at x0, expected
and refers to the average test MSE that we would obtain if we repeatedly test MSE
estimated f using a large number of training sets, and tested each at x0. The
overall expected test MSE can be computed by averaging E
1
y0 − ˆf(x0)
22
over all possible values of x0 in the test set.
Equation 2.7 tells us that in order to minimize the expected test error,
we need to select a statistical learning method that simultaneously achieves
low variance and low bias. Note that variance is inherently a nonnegative
quantity, and squared bias is also nonnegative. Hence, we see that the
expected test MSE can never lie below Var("), the irreducible error from
(2.3).
What do we mean by the variance and bias of a statistical learning
method? Variance refers to the amount by which ˆf would change if we
estimated it using a diferent training data set. Since the training data
are used to ft the statistical learning method, diferent training data sets
will result in a diferent ˆf. But ideally the estimate for f should not vary
too much between training sets. However, if a method has high variance
then small changes in the training data can result in large changes in ˆf. In
general, more fexible statistical methods have higher variance. Consider the
green and orange curves in Figure 2.9. The fexible green curve is following
the observations very closely. It has high variance because changing any
one of these data points may cause the estimate ˆf to change considerably.
2.2 Assessing Model Accuracy 33
2 5 10 20
0.0 0.5 1.0 1.5 2.0 2.5
Flexibility
2 5 10 20
0.0 0.5 1.0 1.5 2.0 2.5
Flexibility
2 5 10 20
0 5 10 15 20
Flexibility
MSE
Bias
Var
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(!)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dotted line indicates the fexibility level corresponding to the smallest
test MSE.
In contrast, the orange least squares line is relatively infexible and has low
variance, because moving any single observation will likely cause only a
small shift in the position of the line.
On the other hand, bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much
simpler model. For example, linear regression assumes that there is a linear
relationship between Y and X1, X2,...,Xp. It is unlikely that any real-life
problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f. In
Figure 2.11, the true f is substantially non-linear, so no matter how many
training observations we are given, it will not be possible to produce an
accurate estimate using linear regression. In other words, linear regression
results in high bias in this example. However, in Figure 2.10 the true f
is very close to linear, and so given enough data, it should be possible for
linear regression to produce an accurate estimate. Generally, more fexible
methods result in less bias.
As a general rule, as we use more fexible methods, the variance will
increase and the bias will decrease. The relative rate of change of these
two quantities determines whether the test MSE increases or decreases. As
we increase the fexibility of a class of methods, the bias tends to initially
decrease faster than the variance increases. Consequently, the expected
test MSE declines. However, at some point increasing fexibility has little
impact on the bias but starts to signifcantly increase the variance. When
this happens the test MSE increases. Note that we observed this pattern
of decreasing test MSE followed by increasing test MSE in the right-hand
panels of Figures 2.9–2.11.
The three plots in Figure 2.12 illustrate Equation 2.7 for the examples in
Figures 2.9–2.11. In each case the blue solid curve represents the squared
bias, for diferent levels of fexibility, while the orange curve corresponds to
the variance. The horizontal dashed line represents Var("), the irreducible
error. Finally, the red curve, corresponding to the test set MSE, is the sum
34 2. Statistical Learning
of these three quantities. In all three cases, the variance increases and the
bias decreases as the method’s fexibility increases. However, the fexibility
level corresponding to the optimal test MSE difers considerably among the
three data sets, because the squared bias and variance change at diferent
rates in each of the data sets. In the left-hand panel of Figure 2.12, the
bias initially decreases rapidly, resulting in an initial sharp decrease in the
expected test MSE. On the other hand, in the center panel of Figure 2.12
the true f is close to linear, so there is only a small decrease in bias as fexibility increases, and the test MSE only declines slightly before increasing
rapidly as the variance increases. Finally, in the right-hand panel of Figure 2.12, as fexibility increases, there is a dramatic decline in bias because
the true f is very non-linear. There is also very little increase in variance
as fexibility increases. Consequently, the test MSE declines substantially
before experiencing a small increase as model fexibility increases.
The relationship between bias, variance, and test set MSE given in Equation 2.7 and displayed in Figure 2.12 is referred to as the bias-variance
trade-of. Good test set performance of a statistical learning method re- bias-variance
trade-of quires low variance as well as low squared bias. This is referred to as a
trade-of because it is easy to obtain a method with extremely low bias but
high variance (for instance, by drawing a curve that passes through every
single training observation) or a method with very low variance but high
bias (by ftting a horizontal line to the data). The challenge lies in fnding
a method for which both the variance and the squared bias are low. This
trade-of is one of the most important recurring themes in this book.
In a real-life situation in which f is unobserved, it is generally not possible to explicitly compute the test MSE, bias, or variance for a statistical
learning method. Nevertheless, one should always keep the bias-variance
trade-of in mind. In this book we explore methods that are extremely
fexible and hence can essentially eliminate bias. However, this does not
guarantee that they will outperform a much simpler method such as linear
regression. To take an extreme example, suppose that the true f is linear.
In this situation linear regression will have no bias, making it very hard
for a more fexible method to compete. In contrast, if the true f is highly
non-linear and we have an ample number of training observations, then
we may do better using a highly fexible approach, as in Figure 2.11. In
Chapter 5 we discuss cross-validation, which is a way to estimate the test
MSE using the training data.
To evaluate the performance of a statistical learning method, we use the mean squared error (MSE), which quantifies the accuracy of predictions. The training MSE, calculated on the training data, is of limited interest; rather, we seek the test MSE on unseen data, as it better reflects prediction accuracy for future observations. Overfitting occurs when a model captures noise in the training data, leading to a large test MSE despite a small training MSE. This results from the trade-off between bias and variance: as model flexibility increases, training MSE decreases, but test MSE may initially decrease and then increase, creating a U-shape in the curve. The goal is to find a model with low bias and low variance, which often requires a balance in flexibility. The bias-variance trade-off is crucial in model selection, where a simpler model is preferred if the true function is close to its assumptions.