W2 - 2.4 - Model Selection

The previous lesson we talked about

confusion matrix or the classification accuracy.

Those are for categorical values, that's classification.

As I mentioned previously,

prediction generally refers to numerical values.

You're trying to predict

a value rather than a class label.

In this case, instead of using the previous matrix,

typically we'll talk about some kind of error,

okay, the prediction error.

What you're trying to look at our data is not about

exact match for w are trying to predict

the stock price or if you're trying to

predict the temperature when the speed.

Predicted is not a battle hitting the exact value.

So there has to be exact.

Otherwise this is incorrect.

Rather you're looking at the error,

how far away you are from the actual value.

That's the key notion and that they are

actually a few different measures you can use.

As you can see, they all are looking at some kind of

a difference between the predicted value

and the actual value.

But there are different ways.

Either you're looking at as the absolute value

squared, like absolute error,

square error or the mean absolute error,

mean square error.

But also this relative absolute error

and relative squared error,

as you can see, there are related,

but they actually look at the different terms.

This, again goes back to your application scenario

and you need to consider

whether you want to consider more the absolute error,

just like the actual value or say is relative.

For example, if I choose stock price like stocks.

One stock is usually in the range of

hundreds so that if you are,

like I said, if your arrows maybe ten,

that's the probably pretty good.

But if you're looking at another stock where the fat is

usually around tens and now you

have an error of ten that is considered much bigger.

In this case maybe relative error is more reasonable.

Anyway, those are just two examples,

but just always keep in mind that

there are different error measures

and they may be more suitable for

one versus the other depending on your scenario.

Play video starting at :2:26 and follow transcript2:26

There's another approach which

is particularly useful when you're trying

to compare across models.

It's a visual mechanism referred to as our C curve.

What is a plotting? You have your X-axis.

This is the false positive rate defined by

the number of false positives divided by a negative.

Then you have your true positive rate.

That's the true positive over actual number of positives.

Once you plot this, as you can see.

In this case, you want to have,

like a good model should have lower,

I go smaller false positive rate.

That means you want your X value to be smaller,

right, towards the left.

Well, with your Y axis,

you're looking at the true positive rate.

Of course you want that to be higher.

The ideal case would it be this like this purple curve,

it's goes straight up.

That is the lowest false positive rate and it goes

horizontally added the perfect like a true positive rate.

It's like no false positives and is our true positives.

Of course, that's the perfect case

with many real-world settings.

You're looking at some curve that's in-between.

What's also important that is about

this, this random classifier.

Typically you're talking about binary classification.

You're making a random guess,

then you are looking at this diagonal line.

That is why if I don't do anything intelligent,

my random classifier would give you

this line and now you're basically trying to

find a model that can push you

towards the top left region, right?

Because the top left means like

a higher true positive rate and

a lower false negative rate of false positive rate.

In this particular case, as you can see,

I will have this green curve,

I have orange curve and blue curve.

You can see it going this direction.

Green is better,

slightly better than my random classifier,

orange is better than green,

and then the blue one is better than my orange one.

So that's gives you

a very easy way to

visualize the differences and be able

to pick the model that performs better.

There's another one approach for model selection,

which is a more, use them all with

the statistical candle like probability calculation.

It's called a t-test.

In this case the setting is like this,

so you have two models.

Model 1, Model 2.

I did my k-fold cross-validation.

That means I run K different rounds

of testing and training and a testing split,

now I have the corresponding error values.

I have K different error values for each model,

and they also correspond to the same round of our test.

So once I have that now my goal

is to pick which model is better.

The straight forward would be that well,

let's just compute the mean error.

Means that you basically take

the average across all the K

runs for each model and then

say the one with the small error is better.

Usually, yes but then this notion about, well,

yes, but it's happening by chance so it does actually

sound statistically significant so this is a notion.

Because remember, with our K-fold cross-validation,

we'll in a way can just partitioning that

and then there may be some randomness in terms of well,

just happens that this set

works better or flop a particular model.

You want to understand, yes,

there is some difference or whatever

the difference between the mean errors of

the two models is that

statistically significant or does the 3D by chance?

That means, well, there's minor difference,

but it doesn't say that

Walmart is a particular much better than other ones.

You said you want to measure

how significant this model better than the other one?

Is that statistically significant?

That's the key notion of this t-test approach.

T-tests, in a way, is basically just testing

the hypothesis that these two models

are actually not better than

others so they are coming from the same distribution.

That's the hypothesis and if they are

indeed coming from the slide

drawn from the same distribution,

your base and measure the difference.

The notion is that if

the difference is more then it's more likely that

they come from the same one

but if the difference is pretty

big then the hypothesis that

they actually come from the same distribution is probably

not true then you reject that hypothesis.

That's the key idea.

So what do we do is we use this t-test,

so we're basically computing this t value,

which is the mean error difference

divided by the square of the variance here.

So the variance it's basically

comparing pairwise comparison because you have

K for falls and then

the error values corresponding to each of the K runs.

So the questions that I can

compute this t but then how do I make a decision?

So let's look at a concrete example here.

I have ten fold cross-validation.

So k-fold, now K is a ten.

Then I need for me to determine

the t-test value I usually consult a t table.

This is actually available in

almost any statistical text book.

Here, is one example,

you go to this link and then

you can check out their table.

This is just a subset of that table.

To look at, to use this table you need

two important values one of this is a

V. V is the degree of freedom.

Think about your 10-fold cross-validation.

If fixed one variable,

then you have the ninth degree of freedom.

That's the nine value for V.

Another one is that you need to tell

me your significance level,

so that means the light got for your particular scenario,

how significant you need STP?

Is it 0.10, 0.005, 0.001,

so as you can see, if there's a lower value than is more

important or the chance of that occurring is smaller.

In our problems setting,

there's this difference between two-sided or one-sided.

In our case, since we're

comparing one being better than the other.

They are like this is actually

two tests that have a two sided test.

They are added like materials are talking about it,

one sided test and you can check out that.

But in our setting,

we are basically checking is one better

than the other and the direction doesn't really matter,

so it's two side.

If in this case,

we would take 1-a/2,

Play video starting at :9:45 and follow transcript9:45

and then that will give us this 0.975,

and that is what we're trying to look for in this column.

As you can see, columns in my T-table are indexed by

the color significance of value.

In my case, I'll be looking at v equaling to 9,

which is the last row, and then 1-a/2.

That is the correspondent to the 0.975,

and this gives me a T value of 2.262.

That is a critical value for

that particular V and R4 pair.

That'll basically says that if the null hypothesis is

true Without a significant level

than the value of the critical value,

should it be below my or low.

No bigger than 2.262.

So if you compute

the T value for your particular problem setting.

Then you have basically compare T to this threshold.

If a T is above your 2.262 threshold,

you would reject your hypothesis.

Rejecting the hypothesis, meaning that

the hypothesis that the two models or

two model arrows are drawn from

the same distribution is not valid so you reject that.

In this case, was it? Yeah, one model is

actually statistically significant or

better than the other one.

To summarize, we have

now finished our discussion of classification,

which is one of the core of

like approach that we use in data mining.

To pick the key takeaway messages by first one lives,

think about how classification

is a supervised learning approach five,

so that means you have predefined class labels.

You have a training set

along with the corresponding class labels

for you to construct a model. That's important.

Then the specific measures why we talk about

the decision tree Bayesian

Classification Support Vector Machine neural network,

and in some bone muscles.

Those are all of the core methods.

They are widely used in

many real-world settings and have

been shown to be effective.

Of course, they are always differences in terms

of their actual performance for particular problems.

This doesn't really lead us to

this important discussion about evaluating your models.

It'd be able to compare and

select models that will work better.

Because usually you don't they take

just like one generic dataset,

compare it and pick the one that best.

Because you want to have

an evaluation setting that is specific for

your problem so that you are testing

the different investors for your application scenarios.

Specifically think about how we're using

confusion matrix to define the actual classes,

the particular classes, and

how often you're making mistakes or

you actually hitting the exact correct label.

There's this notion of accuracy if you're talking about

a classification because you have categorical values.

But if you're talking about numerical values

in the prediction scenario,

then you need to think about the error.

How big the difference between

the predicted value and the true value is.

We also talked about using ROC curve, right,

as a visual way of choosing the PESTEL model.

Also T-test, which is a statistical calculation to

the estimated those statistical significance

of a one model being better than other.

All right, that is all for classification.

It will continue later to talk about clustering

Transcript language: English