W2 - 2.4 - Model Selection
The previous lesson we talked about
confusion matrix or the classification accuracy.
Those are for categorical values, that's classification.
As I mentioned previously,
prediction generally refers to numerical values.
You're trying to predict
a value rather than a class label.
In this case, instead of using the previous matrix,
typically we'll talk about some kind of error,
okay, the prediction error.
What you're trying to look at our data is not about
exact match for w are trying to predict
the stock price or if you're trying to
predict the temperature when the speed.
Predicted is not a battle hitting the exact value.
So there has to be exact.
Otherwise this is incorrect.
Rather you're looking at the error,
how far away you are from the actual value.
That's the key notion and that they are
actually a few different measures you can use.
As you can see, they all are looking at some kind of
a difference between the predicted value
and the actual value.
But there are different ways.
Either you're looking at as the absolute value
squared, like absolute error,
square error or the mean absolute error,
mean square error.
But also this relative absolute error
and relative squared error,
as you can see, there are related,
but they actually look at the different terms.
This, again goes back to your application scenario
and you need to consider
whether you want to consider more the absolute error,
just like the actual value or say is relative.
For example, if I choose stock price like stocks.
One stock is usually in the range of
hundreds so that if you are,
like I said, if your arrows maybe ten,
that's the probably pretty good.
But if you're looking at another stock where the fat is
usually around tens and now you
have an error of ten that is considered much bigger.
In this case maybe relative error is more reasonable.
Anyway, those are just two examples,
but just always keep in mind that
there are different error measures
and they may be more suitable for
one versus the other depending on your scenario.
Play video starting at :2:26 and follow transcript2:26
There's another approach which
is particularly useful when you're trying
to compare across models.
It's a visual mechanism referred to as our C curve.
What is a plotting? You have your X-axis.
This is the false positive rate defined by
the number of false positives divided by a negative.
Then you have your true positive rate.
That's the true positive over actual number of positives.
Once you plot this, as you can see.
In this case, you want to have,
like a good model should have lower,
I go smaller false positive rate.
That means you want your X value to be smaller,
right, towards the left.
Well, with your Y axis,
you're looking at the true positive rate.
Of course you want that to be higher.
The ideal case would it be this like this purple curve,
it's goes straight up.
That is the lowest false positive rate and it goes
horizontally added the perfect like a true positive rate.
It's like no false positives and is our true positives.
Of course, that's the perfect case
with many real-world settings.
You're looking at some curve that's in-between.
What's also important that is about
this, this random classifier.
Typically you're talking about binary classification.
You're making a random guess,
then you are looking at this diagonal line.
That is why if I don't do anything intelligent,
my random classifier would give you
this line and now you're basically trying to
find a model that can push you
towards the top left region, right?
Because the top left means like
a higher true positive rate and
a lower false negative rate of false positive rate.
In this particular case, as you can see,
I will have this green curve,
I have orange curve and blue curve.
You can see it going this direction.
Green is better,
slightly better than my random classifier,
orange is better than green,
and then the blue one is better than my orange one.
So that's gives you
a very easy way to
visualize the differences and be able
to pick the model that performs better.
There's another one approach for model selection,
which is a more, use them all with
the statistical candle like probability calculation.
It's called a t-test.
In this case the setting is like this,
so you have two models.
Model 1, Model 2.
I did my k-fold cross-validation.
That means I run K different rounds
of testing and training and a testing split,
now I have the corresponding error values.
I have K different error values for each model,
and they also correspond to the same round of our test.
So once I have that now my goal
is to pick which model is better.
The straight forward would be that well,
let's just compute the mean error.
Means that you basically take
the average across all the K
runs for each model and then
say the one with the small error is better.
Usually, yes but then this notion about, well,
yes, but it's happening by chance so it does actually
sound statistically significant so this is a notion.
Because remember, with our K-fold cross-validation,
we'll in a way can just partitioning that
and then there may be some randomness in terms of well,
just happens that this set
works better or flop a particular model.
You want to understand, yes,
there is some difference or whatever
the difference between the mean errors of
the two models is that
statistically significant or does the 3D by chance?
That means, well, there's minor difference,
but it doesn't say that
Walmart is a particular much better than other ones.
You said you want to measure
how significant this model better than the other one?
Is that statistically significant?
That's the key notion of this t-test approach.
T-tests, in a way, is basically just testing
the hypothesis that these two models
are actually not better than
others so they are coming from the same distribution.
That's the hypothesis and if they are
indeed coming from the slide
drawn from the same distribution,
your base and measure the difference.
The notion is that if
the difference is more then it's more likely that
they come from the same one
but if the difference is pretty
big then the hypothesis that
they actually come from the same distribution is probably
not true then you reject that hypothesis.
That's the key idea.
So what do we do is we use this t-test,
so we're basically computing this t value,
which is the mean error difference
divided by the square of the variance here.
So the variance it's basically
comparing pairwise comparison because you have
K for falls and then
the error values corresponding to each of the K runs.
So the questions that I can
compute this t but then how do I make a decision?
So let's look at a concrete example here.
I have ten fold cross-validation.
So k-fold, now K is a ten.
Then I need for me to determine
the t-test value I usually consult a t table.
This is actually available in
almost any statistical text book.
Here, is one example,
you go to this link and then
you can check out their table.
This is just a subset of that table.
To look at, to use this table you need
two important values one of this is a
V. V is the degree of freedom.
Think about your 10-fold cross-validation.
If fixed one variable,
then you have the ninth degree of freedom.
That's the nine value for V.
Another one is that you need to tell
me your significance level,
so that means the light got for your particular scenario,
how significant you need STP?
Is it 0.10, 0.005, 0.001,
so as you can see, if there's a lower value than is more
important or the chance of that occurring is smaller.
In our problems setting,
there's this difference between two-sided or one-sided.
In our case, since we're
comparing one being better than the other.
They are like this is actually
two tests that have a two sided test.
They are added like materials are talking about it,
one sided test and you can check out that.
But in our setting,
we are basically checking is one better
than the other and the direction doesn't really matter,
so it's two side.
If in this case,
we would take 1-a/2,
Play video starting at :9:45 and follow transcript9:45
and then that will give us this 0.975,
and that is what we're trying to look for in this column.
As you can see, columns in my T-table are indexed by
the color significance of value.
In my case, I'll be looking at v equaling to 9,
which is the last row, and then 1-a/2.
That is the correspondent to the 0.975,
and this gives me a T value of 2.262.
That is a critical value for
that particular V and R4 pair.
That'll basically says that if the null hypothesis is
true Without a significant level
than the value of the critical value,
should it be below my or low.
No bigger than 2.262.
So if you compute
the T value for your particular problem setting.
Then you have basically compare T to this threshold.
If a T is above your 2.262 threshold,
you would reject your hypothesis.
Rejecting the hypothesis, meaning that
the hypothesis that the two models or
two model arrows are drawn from
the same distribution is not valid so you reject that.
In this case, was it? Yeah, one model is
actually statistically significant or
better than the other one.
To summarize, we have
now finished our discussion of classification,
which is one of the core of
like approach that we use in data mining.
To pick the key takeaway messages by first one lives,
think about how classification
is a supervised learning approach five,
so that means you have predefined class labels.
You have a training set
along with the corresponding class labels
for you to construct a model. That's important.
Then the specific measures why we talk about
the decision tree Bayesian
Classification Support Vector Machine neural network,
and in some bone muscles.
Those are all of the core methods.
They are widely used in
many real-world settings and have
been shown to be effective.
Of course, they are always differences in terms
of their actual performance for particular problems.
This doesn't really lead us to
this important discussion about evaluating your models.
It'd be able to compare and
select models that will work better.
Because usually you don't they take
just like one generic dataset,
compare it and pick the one that best.
Because you want to have
an evaluation setting that is specific for
your problem so that you are testing
the different investors for your application scenarios.
Specifically think about how we're using
confusion matrix to define the actual classes,
the particular classes, and
how often you're making mistakes or
you actually hitting the exact correct label.
There's this notion of accuracy if you're talking about
a classification because you have categorical values.
But if you're talking about numerical values
in the prediction scenario,
then you need to think about the error.
How big the difference between
the predicted value and the true value is.
We also talked about using ROC curve, right,
as a visual way of choosing the PESTEL model.
Also T-test, which is a statistical calculation to
the estimated those statistical significance
of a one model being better than other.
All right, that is all for classification.
It will continue later to talk about clustering
Transcript language: English
ā