ML

The typical workflow of developing a machine learning system is that you have an

idea and you train the model, and you almost always find that it doesn't work as

well as you wish yet. When I'm training a machine learning model, it pretty much

never works that well the first time. Key to the process of building machine

learning system is how to decide what to do next in order to improve his

performance. I've found across many different applications that looking at the bias

and variance of a learning algorithm gives you very good guidance on what to try

next. Let's take a look at what this means. You might remember this example from

the first course on linear regression. Where given this dataset, if you were to fit a

straight line to it, it doesn't do that well. We said that this algorithm has high bias

or that it underfits this dataset. If you were to fit a fourth-order polynomial, then it

has high-variance or it overfits. In the middle if you fit a quadratic polynomial, then

it looks pretty good. Then I said that was just right. Because this is a problem with

just a single feature x, we could plot the function f and look at it like this. But if you

had more features, you can't plot f and visualize whether it's doing well as easily.

Instead of trying to look at plots like this, a more systematic way to diagnose or to

find out if your algorithm has high bias or high variance will be to look at the

performance of your algorithm on the training set and on the cross validation set.

In particular, let's look at the example on the left. If you were to compute J_train,

how well does the algorithm do on the training set? Not that well. I'd say J train

here would be high because there are actually pretty large errors between the

examples and the actual predictions of the model. How about J_cv? J_cv would be

if we had a few new examples, maybe examples like that, that the algorithm had

not previously seen. Here the algorithm also doesn't do that well on examples that

it had not previously seen, so J_cv will also be high. One characteristic of an

algorithm with high bias, something that is under fitting, is that it's not even doing

that well on the training set. When J_train is high, that is your strong indicator that

this algorithm has high bias. Let's now look at the example on the right. If you

were to compute J_train, how well is this doing on the training set? Well, it's

actually doing great on the training set. Fits the training data really well. J_train

here will be low. But if you were to evaluate this model on other houses not in the

training set, then you find that J_cv, the cross-validation error, will be quite high. A

B&V

1characteristic signature or a characteristic Q that your algorithm has high variance

will be of J_cv is much higher than J_train. In other words, it does much better on

data it has seen than on data it has not seen. This turns out to be a strong

indicator that your algorithm has high variance. Again, the point of what we're

doing is that I'm computing J_train and J_cv and seeing if J train is high or if Jcv

is much higher than J_train. This gives you a sense, even if you can't plot to

function f, of whether your algorithm has high bias or high variance. Finally, the

case in the middle. If you look at J_train, it's pretty low, so this is doing quite well

on the training set. If you were to look at a few new examples, like those from, say,

your cross-validation set, you find that J_cv is also a pretty low. J_train not being

too high indicates this doesn't have a high bias problem and J_cv not being much

worse than J_train this indicates that it doesn't have a high variance problem

either. Which is why the quadratic model seems to be a pretty good one for this

application. To summarize, when d equals 1 for a linear polynomial, J_train was

high and J_cv was high. When d equals 4, J train was low, but J_cv is high. When

d equals 2, both were pretty low. Let's now take a different view on bias and

variance. In particular, on the next slide I'd like to show you how J_train and J_cv

variance as a function of the degree of the polynomial you're fitting. Let me draw

a figure where the horizontal axis, this d here, will be the degree of polynomial

that we're fitting to the data. Over on the left we'll correspond to a small value of

d, like d equals 1, which corresponds to fitting straight line. Over to the right we'll

correspond to, say, d equals 4 or even higher values of d. We're fitting this high

order polynomial. So if you were to plot J train or W, B as a function of the degree

of polynomial, what you find is that as you fit a higher and higher degree

polynomial, here I'm assuming we're not using regularization, but as you fit a

higher and higher order polynomial, the training error will tend to go down

because when you have a very simple linear function, it doesn't fit the training

data that well, when you fit a quadratic function or third order polynomial or

fourth-order polynomial, it fits the training data better and better. As the degree of

polynomial increases, J train will typically go down. Next, let's look at J_cv, which

is how well does it do on data that it did not get to fit to? What we saw was when d

equals one, when the degree of polynomial was very low, J_cv was pretty high

because it underfits, so it didn't do well on the cross validation set. Here on the

right as well, when the degree of polynomial is very large, say four, it doesn't do

well on the cross-validation set either, and so it's also high. But if d was in-

between say, a second-order polynomial, then it actually did much better. If you

B&V

2were to vary the degree of polynomial, you'd actually get a curve that looks like

this, which comes down and then goes back up. Where if the degree of

polynomial is too low, it underfits and so doesn't do the cross validation set, if it is

too high, it overfits and also doesn't do well on the cross validation set. Is only if

it's somewhere in the middle, that is just right, which is why the second-order

polynomial in our example ends up with a lower cross-validation error and neither

high bias nor high-variance. To summarize, how do you diagnose bias and

variance in your learning algorithm? If your learning algorithm has high bias or it

has undefeated data, the key indicator will be if J train is high. That corresponds

to this leftmost portion of the curve, which is where J train as high. Usually you

have J train and J_cv will be close to each other. How do you diagnose if you have

high variance? While the key indicator for high-variance will be if J_cv is much

greater than J train does double greater than sign in math refers to a much greater

than, so this is greater, and this means much greater. This rightmost portion of the

plot is where J_cv is much greater than J train. Usually J train will be pretty low,

but the key indicator is whether J_cv is much greater than J train. That's what

happens when we had fit a very high order polynomial to this small dataset. Even

though we've just seen bias in the areas, it turns out, in some cases, is possible to

simultaneously have high bias and have high-variance. You won't see this happen

that much for linear regression, but it turns out that if you're training a neural

network, there are some applications where unfortunately you have high bias and

high variance. One way to recognize that situation will be if J train is high, so

you're not doing that well on the training set, but even worse, the cross-validation

error is again, even much larger than the training set. The notion of high bias and

high variance, it doesn't really happen for linear models applied to 1D. But to give

intuition about what it looks like, it would be as if for part of the input, you had a

very complicated model that overfit, so it overfits to part of the inputs. But then for

some reason, for other parts of the input, it doesn't even fit the training data well,

and so it underfits for part of the input. In this example, which looks artificial

because it's a single feature input, we fit the training set really well and we overfit

in part of the input, and we don't even fit the training data well, and we underfit

the part of the input. That's how in some applications you can unfortunate end up

with both high bias and high variance. The indicator for that will be if the algorithm

does poorly on the training set, and it even does much worse than on the training

set. For most learning applications, you probably have primarily a high bias or high

variance problem rather than both at the same time. But it is possible sometimes

B&V

3they're both at the same time. I know that there's a lot of process, there are a lot

of concepts on the slides, but the key takeaways are, high bias means is not even

doing well on the training set, and high variance means, it does much worse on

the cross validation set than the training set. Whenever I'm training a machine

learning algorithm, I will almost always try to figure out to what extent the

algorithm has a high bias or underfitting versus a high-variance when overfitting

problem. This will give good guidance, as we'll see later this week, on how you

can improve the performance of the algorithm. But first, let's take a look at how

regularization effects the bias and variance of a learning algorithm because that

will help you better understand w