W2 - 2.3 - Ensemble, Model Evaluation
So far we have talked about
quite a few classification methods.
Let's just quickly summarize them and look
how they are key features or characteristics.
We start with the Decision tree induction.
We know how it works but
generally like I said
decision trees are fairly efficient.
What you're doing it's more like top-down,
divide and conquer approach,
but it's a greedy approach which may not
give you the optimal solution but isn't
usually have very good solution.
It's a fairly efficient and also one advantage
of decision tree induction is
that when you have the tree,
you know exactly how the decisions are made.
That's why it's very easy to interpret
the decision rather than saying
that well it's a black box I don't know how it's done.
With a decision tree you know,
exactly how a particular decision
or classification decision is a
derived by following the path from the root node.
Which attribute you look at the
first based on that value,
what's the next attribute you check because of
all those combinations of attribute values,
and that is why you are making
that particular classification decision.
That actually is a very important and also
makes it very useful in many real-world applications.
The other one, Bayesian classification.
This is actually has a solid probabilistic foundation.
It's a very efficient to calculate this.
Basically you're just calculating probabilities.
It's also explainable, as always like we
have done in the previous lecture.
We actually went through the example,
you are computing the probability
of each of those classes,
giving the particular object attribute values.
That also gives you
an explicit way of saying why you make that decision.
Say it should be this class,
because it has a high probability
of this class and because
of all the individual
probabilities that you're leveraging.
Another key feature of Bayesian classification is,
and it's incremental. What does it mean?
As we mentioned it earlier,
incremental or classification or process
and basis is that if you have a new data,
do you need to redo your modeling process altogether or
can you take the model
that you have already constructed and just update it?
If you think about decision tree, for example,
decision tree is not incremental because the decisions or
the choices you make are each layer
would change if you have more data.
That's why if you have more data,
if you're using decision tree
induction method that you have to
redo everything like reconstruct the whole tree.
While with Bayesian classification,
since you are calculating
probability and with the probability the good thing is
that if you already have
your current probability and now you have
some new information all you need to just
take that and update it with the new information.
That's why it's incremental because you can take
your current basic classifier you
have other probability that they already calculated,
you only need to update it with
the new information so you have
the updated probabilities and that's all you need.
That's why it's incremental and
also incremental classification can be useful in
scenarios where particularly you
expect there will be new data are
being added and you expect
your model to adapt quickly when you have new data.
Next, we talk about Support
vector machines as a satellite.
This is generally very good performance.
If you look at some of the research papers around
many times if you're there comparing
across multiple types of methods,
support vector machine usually is like
a best-performing or one of the best performing method.
It has a pretty good performance.
Next one, Neural networks,
and we just mentioned the design can be complex.
It is really trying to capture
some inherent non-linear relationship
in this hidden layers structure.
It's complex, but it has been shown to be
very effective in many settings.
That is not to say that it is suitable for all methods,
but generally has very good performance.
But like neural networks compared to other methods,
it's much harder to interpret.
That means you may take
your inputs should go through many layers of
a hidden layer transformation
and then you have good output.
That means that your results are good,
but it's really difficult to explain
how you came up with that or what was the key factors.
There are some information in
your hidden layer and they are actually
very active research by
trying to tell the really focus on
this explainable AI and a neural network in particular,
like can you actually explain
how that is like a particular decisions are made.
So far this is still very limited,
so there's actually an area for continued research.
All right, so we have talked about different measures.
Those are, of course, individual measures
that you're building on one model like
to make your classification decision
but there's also a co-variate into
[inaudible] your why did they use approach
sits on top of the individual measures.
It's called ensemble, as the name suggests,
instead of using a single model,
I'm trying to build multiple models and then I'm
trying to use some multiple models
together to make my classification decision.
Think about this analogy,
so usually if you're a patient,
you have some symptoms and you want to,
maybe it's not a straightforward,
instead of consulting just a single doctor,
you may go to multiple doctors and
then trying to aggregate their information,
so that you know what you should do or
what the potential best of treatment would it be.
It's just natural and it's intuitively reasonable
for us to leverage
multiple models when we're doing
classification because here similarly,
we're trying to make a decision.
If one model is not good enough,
maybe I can try multiple and the leverage,
the joint wisdom of multiple models.
Depending on how you actually combine the models,
so there are two main categories.
The first one is called bagging.
The general is just like the simple aggregations.
That means that if I build multiple models,
I just view them equally important,
so this about the equal weights,
and because they are equally important,
so I may basically just like doing this majority vote.
For example, I asked five doctors,
four of them said that this is the type of treatment they
would recommend and the one has
something different then you
probably go with the majority, something like that.
This can usually just be accomplished by using
some random sample with replacement,
so the idea is that you have your training set,
every time you take a sample of the subset,
use that to build a model,
you repeat that in multiple times,
so now you have multiple models
and that's why I say it's equal weight because they
are all based on some sample set
and is not particularly wise better than other,
so you basically just take the majority vote but there
are scenarios where you may
think some of the models may be better.
This is similar to this medical doctors scenario.
Many cases you'll say, yeah, they're all good,
so I'm just weigh them equally but
if they are scenarios where say,
some subset of the doctors,
they actually are more experienced or they have shown
like success with this particular type
of disease or treatment,
then you may weigh
their decisions or recommendations more importantly,
so that's the general idea.
That's the boosting idea.
Instead of using equal weights,
you are trying to weigh them differently.
You're still building multiple models
but you want to say that
different models have different weights and that's how
you're going to combine their votes.
High level ideas about
how you adjust the votes and also particular is
that you want to weight them differently
based on the misclassified cases.
What we're trying to say here is that if you have, say,
one menu objects, maybe half of
them are reasonably easy or they are readily,
you can capture it or by
your previous models but
there are some cases that just harder,
so meaning that all my previous model,
they just make mistakes on those scenarios.
Now you want to weigh
those cases right higher because you will now
focus on those cases and now your train
model that will do better on those cases.
You'll combine the different models,
then of course you have those corresponding weights
based on the error rate,
you can then weight them differently.
That's the high-level idea about
ensemble using multiple models,
but then using bagging
if you think the equal weight will be
good or you can do
boosting if you want to weight them differently.
All right, so next,
let's talk about model evaluation.
As I said, with classification,
you have many methods you can choose.
Some of them may be by design,
say one's better than the other,
but other times they are all good, like reasonably good,
would it be applicable for your scenario,
then you're really talking about the actual performance.
So starting point is actually how you do your evaluation.
The basic idea, of course, as I said with classification,
you have the training set,
the training set is used to construct the model so you
would see the label already.
Then you have your testing set.
The general idea is important that with any evaluation,
you have this hold-out idea.
You would hold out a subset of your initial dataset so
that only a part of
it is being used or being shown to your model,
so your model have seen those already,
and then you're testing your model with the holdout set.
That's the step that has an
[inaudible] like a local [inaudible] by the model.
Generally, what you do, you do this splitting.
You can do say 80,
20 split so 80% for training,
20% for testing,
so that's your holdout part,
or you can do some other ratio
but it's important for you to have the holdout set.
Another related approach is
referred to as cross-validation,
usually is divided into k-fold or k partitions.
The idea is that I have my original set,
but I can then split it into k partitions.
Every time, I may do this k runs.
In each run, I would select one as
the holdout set and the remaining k-1 as my training set.
That way, you rotate through the k partitions,
you have k fold,
so that's why is called k-fold cross-validation.
Because other one that's also typically used,
this is called bootstrapping.
In this case, you are doing
sampling but with replacement.
For example, if you have n objects,
now you do n random sampling,
but its replaces that means
each object has 1/n chance of being selected,
but once you have sampled that set you put it back,
so that means that the same object may be selected again.
When you do this process,
you will get roughly
0.232 of your objects
will be selected as your training set,
and then the remaining part would be your testing set.
That's also referred to as
this 0.632 bootstrapping approach.
Anyway, all those different measures here is really
about splitting your training set and your testing set.
Now once you have yours set up,
you train your model and then you need to evaluate them.
Here we're talking about what kind of measures,
how do you measure the quality of the different models.
Typically, we use this notion of accuracy with
classification because you have
categories, the class labels.
We usually use this confusion matrix.
What is showing is that you have
the actual class label because that's what you
already have from your set.
Because they have the corresponding label,
this is your test set.
For the test set, you have the label for
evaluation even though the labels
are not shown to the model.
So you'll have your actual class,
this could be yes or no if it's a binary classification,
and then you have the predicted
class label by your model.
That can also be yes or no,
so you could basically have this two by two combination.
If you say it's actually yes
and then the predicted class name is yes,
that's great, that's true positive.
But then you would have the mistakes cases,
these false positive and false negatives,
and then you have the true negative.
So these are the four cases.
Apparently, the true positives
and the true negatives are the good ones,
you would like those to be perfect.
You want to minimize
the false negatives than the false positive because those
are the cases where you're making mistake
about the class label.
Think about it the fraud detection,
because I'm trying to determine whether
a particular transaction is a fraud activity or not.
It's a binary, so you can make a mistake or you can
say actually you correctly classified as fraud or not.
Now with this confusion matrix,
you can then compute a few different measures.
Sensitivity is defined as the true
positive over total number of positives.
You say in total, I say out of this,
I have say 1,000 positives and I found 900 of them.
So that's your sensitivity.
But then you look at the negative cases.
Among all the negative cases,
which ones are true negatives?
This is about specificity,
so that's about looking at the negative cases.
They're also this precision.
Precision is basis is that
my classifier flags some objects as the positive cases.
But this is a combination of
the true positives and
false positives because I made mistakes.
It's the ratio of true positives
divided by other positive
that are flagged by the classifier.
If you say I'm flagging
100 transactions as fraud transactions,
the best-case would be that all 100,
they are indeed fraud,
then I have a perfect precision.
But many times you make mistakes.
The false positive would then contribute to
a lower than perfect precision value.
When you consider, of course both the positive,
positive and the negative cases,
when we use this accuracy,
that means across all the different classes.
What other scenarios where I'm correct.
Craig deserve this could be true
positive and true negative.
You basically calculating your accuracy,
which is also equivalent to this formula
by combining your sensitivity and specificity.
Play video starting at :16:36 and follow transcript16:36
There's also actually an important point
to make about the four cases,
the true positive, true negative,
false positive, and false negatives.
In the previous case,
we're just combining all of them
and view them equally important.
But there are many real-world settings
where you need to let it go.
Really think through in terms of
the different scenarios and whether
one is scenario is more important than the other.
Let's think about this scenario fraud detection,
medical diagnosis.
Think about the fraud detection,
I'm flagging or a transaction as a fraud on that.
My false positive basically says that,
say it's a positive so I'm saying this fraud,
but turns out is an
[inaudible] so it's actually a false positive.
It's a normal case I flagged as a fraud by mistake.
Why do you look at the false negative?
That means I didn't report
this as fraud. That means I missed it.
It it was actually a fraud,
but I missed it.
That's all getting my mistake.
It's a normal, I can just let it back.
As you can see, this two,
of course both of them are error cases.
We would like to get them down to zero if we can.
But if not, then you're
talking about potential trade-offs.
You may have different models.
One model may be better in
terms of having a low false positive.
The other one may do better in terms of false negative.
Then you need to decide,
if I have to choose which one's more
important for me, if I have to trade off.
There's also this notion when you
have a multi-class classification.
Because it's a binary density it's.
yes, true or false, yes or no.
But if it's a multiclass, in general case,
you can think about it while
it's the exact match scenarios,
that means that if I only consider it a success,
if you can give me
the correct class label, analyze this force.
I don't tell it just says error.
I only care about the exact match.
But they are scenarios where it
find that you get to a class that's not exactly the same,
but it's close or closer to other ones.
Think about a scenario for activity recognition.
They are different activities that
a user may be performing or doing,
and as you're trying to make
this a multi-class classification,
you may make mistakes.
If you say you're mistaking one activity as another
versus one activity as a yet another type of activity,
those difference may actually matter.
That means it's okay if
your misclassify activity you
want to that's fine or not too bad.
But if it's activity one
being misclassified as activities three,
then it's not good.
You can actually wait that it differently.
Anyway. The takeaway message is really that there's
not one measure or
one calculation for you to do to compare everybody,
rather you need to understand
your scenario and then decide what to measure,
or metrics are more important in your scenario.
Then trying to decide,
like use that measure then to
compare across semesters and
pick the ones that are most suitable.