W2 - 2.3 - Ensemble, Model Evaluation

So far we have talked about

quite a few classification methods.

Let's just quickly summarize them and look

how they are key features or characteristics.

We start with the Decision tree induction.

We know how it works but

generally like I said

decision trees are fairly efficient.

What you're doing it's more like top-down,

divide and conquer approach,

but it's a greedy approach which may not

give you the optimal solution but isn't

usually have very good solution.

It's a fairly efficient and also one advantage

of decision tree induction is

that when you have the tree,

you know exactly how the decisions are made.

That's why it's very easy to interpret

the decision rather than saying

that well it's a black box I don't know how it's done.

With a decision tree you know,

exactly how a particular decision

or classification decision is a

derived by following the path from the root node.

Which attribute you look at the

first based on that value,

what's the next attribute you check because of

all those combinations of attribute values,

and that is why you are making

that particular classification decision.

That actually is a very important and also

makes it very useful in many real-world applications.

The other one, Bayesian classification.

This is actually has a solid probabilistic foundation.

It's a very efficient to calculate this.

Basically you're just calculating probabilities.

It's also explainable, as always like we

have done in the previous lecture.

We actually went through the example,

you are computing the probability

of each of those classes,

giving the particular object attribute values.

That also gives you

an explicit way of saying why you make that decision.

Say it should be this class,

because it has a high probability

of this class and because

of all the individual

probabilities that you're leveraging.

Another key feature of Bayesian classification is,

and it's incremental. What does it mean?

As we mentioned it earlier,

incremental or classification or process

and basis is that if you have a new data,

do you need to redo your modeling process altogether or

can you take the model

that you have already constructed and just update it?

If you think about decision tree, for example,

decision tree is not incremental because the decisions or

the choices you make are each layer

would change if you have more data.

That's why if you have more data,

if you're using decision tree

induction method that you have to

redo everything like reconstruct the whole tree.

While with Bayesian classification,

since you are calculating

probability and with the probability the good thing is

that if you already have

your current probability and now you have

some new information all you need to just

take that and update it with the new information.

That's why it's incremental because you can take

your current basic classifier you

have other probability that they already calculated,

you only need to update it with

the new information so you have

the updated probabilities and that's all you need.

That's why it's incremental and

also incremental classification can be useful in

scenarios where particularly you

expect there will be new data are

being added and you expect

your model to adapt quickly when you have new data.

Next, we talk about Support

vector machines as a satellite.

This is generally very good performance.

If you look at some of the research papers around

many times if you're there comparing

across multiple types of methods,

support vector machine usually is like

a best-performing or one of the best performing method.

It has a pretty good performance.

Next one, Neural networks,

and we just mentioned the design can be complex.

It is really trying to capture

some inherent non-linear relationship

in this hidden layers structure.

It's complex, but it has been shown to be

very effective in many settings.

That is not to say that it is suitable for all methods,

but generally has very good performance.

But like neural networks compared to other methods,

it's much harder to interpret.

That means you may take

your inputs should go through many layers of

a hidden layer transformation

and then you have good output.

That means that your results are good,

but it's really difficult to explain

how you came up with that or what was the key factors.

There are some information in

your hidden layer and they are actually

very active research by

trying to tell the really focus on

this explainable AI and a neural network in particular,

like can you actually explain

how that is like a particular decisions are made.

So far this is still very limited,

so there's actually an area for continued research.

All right, so we have talked about different measures.

Those are, of course, individual measures

that you're building on one model like

to make your classification decision

but there's also a co-variate into

[inaudible] your why did they use approach

sits on top of the individual measures.

It's called ensemble, as the name suggests,

instead of using a single model,

I'm trying to build multiple models and then I'm

trying to use some multiple models

together to make my classification decision.

Think about this analogy,

so usually if you're a patient,

you have some symptoms and you want to,

maybe it's not a straightforward,

instead of consulting just a single doctor,

you may go to multiple doctors and

then trying to aggregate their information,

so that you know what you should do or

what the potential best of treatment would it be.

It's just natural and it's intuitively reasonable

for us to leverage

multiple models when we're doing

classification because here similarly,

we're trying to make a decision.

If one model is not good enough,

maybe I can try multiple and the leverage,

the joint wisdom of multiple models.

Depending on how you actually combine the models,

so there are two main categories.

The first one is called bagging.

The general is just like the simple aggregations.

That means that if I build multiple models,

I just view them equally important,

so this about the equal weights,

and because they are equally important,

so I may basically just like doing this majority vote.

For example, I asked five doctors,

four of them said that this is the type of treatment they

would recommend and the one has

something different then you

probably go with the majority, something like that.

This can usually just be accomplished by using

some random sample with replacement,

so the idea is that you have your training set,

every time you take a sample of the subset,

use that to build a model,

you repeat that in multiple times,

so now you have multiple models

and that's why I say it's equal weight because they

are all based on some sample set

and is not particularly wise better than other,

so you basically just take the majority vote but there

are scenarios where you may

think some of the models may be better.

This is similar to this medical doctors scenario.

Many cases you'll say, yeah, they're all good,

so I'm just weigh them equally but

if they are scenarios where say,

some subset of the doctors,

they actually are more experienced or they have shown

like success with this particular type

of disease or treatment,

then you may weigh

their decisions or recommendations more importantly,

so that's the general idea.

That's the boosting idea.

Instead of using equal weights,

you are trying to weigh them differently.

You're still building multiple models

but you want to say that

different models have different weights and that's how

you're going to combine their votes.

High level ideas about

how you adjust the votes and also particular is

that you want to weight them differently

based on the misclassified cases.

What we're trying to say here is that if you have, say,

one menu objects, maybe half of

them are reasonably easy or they are readily,

you can capture it or by

your previous models but

there are some cases that just harder,

so meaning that all my previous model,

they just make mistakes on those scenarios.

Now you want to weigh

those cases right higher because you will now

focus on those cases and now your train

model that will do better on those cases.

You'll combine the different models,

then of course you have those corresponding weights

based on the error rate,

you can then weight them differently.

That's the high-level idea about

ensemble using multiple models,

but then using bagging

if you think the equal weight will be

good or you can do

boosting if you want to weight them differently.

All right, so next,

let's talk about model evaluation.

As I said, with classification,

you have many methods you can choose.

Some of them may be by design,

say one's better than the other,

but other times they are all good, like reasonably good,

would it be applicable for your scenario,

then you're really talking about the actual performance.

So starting point is actually how you do your evaluation.

The basic idea, of course, as I said with classification,

you have the training set,

the training set is used to construct the model so you

would see the label already.

Then you have your testing set.

The general idea is important that with any evaluation,

you have this hold-out idea.

You would hold out a subset of your initial dataset so

that only a part of

it is being used or being shown to your model,

so your model have seen those already,

and then you're testing your model with the holdout set.

That's the step that has an

[inaudible] like a local [inaudible] by the model.

Generally, what you do, you do this splitting.

You can do say 80,

20 split so 80% for training,

20% for testing,

so that's your holdout part,

or you can do some other ratio

but it's important for you to have the holdout set.

Another related approach is

referred to as cross-validation,

usually is divided into k-fold or k partitions.

The idea is that I have my original set,

but I can then split it into k partitions.

Every time, I may do this k runs.

In each run, I would select one as

the holdout set and the remaining k-1 as my training set.

That way, you rotate through the k partitions,

you have k fold,

so that's why is called k-fold cross-validation.

Because other one that's also typically used,

this is called bootstrapping.

In this case, you are doing

sampling but with replacement.

For example, if you have n objects,

now you do n random sampling,

but its replaces that means

each object has 1/n chance of being selected,

but once you have sampled that set you put it back,

so that means that the same object may be selected again.

When you do this process,

you will get roughly

0.232 of your objects

will be selected as your training set,

and then the remaining part would be your testing set.

That's also referred to as

this 0.632 bootstrapping approach.

Anyway, all those different measures here is really

about splitting your training set and your testing set.

Now once you have yours set up,

you train your model and then you need to evaluate them.

Here we're talking about what kind of measures,

how do you measure the quality of the different models.

Typically, we use this notion of accuracy with

classification because you have

categories, the class labels.

We usually use this confusion matrix.

What is showing is that you have

the actual class label because that's what you

already have from your set.

Because they have the corresponding label,

this is your test set.

For the test set, you have the label for

evaluation even though the labels

are not shown to the model.

So you'll have your actual class,

this could be yes or no if it's a binary classification,

and then you have the predicted

class label by your model.

That can also be yes or no,

so you could basically have this two by two combination.

If you say it's actually yes

and then the predicted class name is yes,

that's great, that's true positive.

But then you would have the mistakes cases,

these false positive and false negatives,

and then you have the true negative.

So these are the four cases.

Apparently, the true positives

and the true negatives are the good ones,

you would like those to be perfect.

You want to minimize

the false negatives than the false positive because those

are the cases where you're making mistake

about the class label.

Think about it the fraud detection,

because I'm trying to determine whether

a particular transaction is a fraud activity or not.

It's a binary, so you can make a mistake or you can

say actually you correctly classified as fraud or not.

Now with this confusion matrix,

you can then compute a few different measures.

Sensitivity is defined as the true

positive over total number of positives.

You say in total, I say out of this,

I have say 1,000 positives and I found 900 of them.

So that's your sensitivity.

But then you look at the negative cases.

Among all the negative cases,

which ones are true negatives?

This is about specificity,

so that's about looking at the negative cases.

They're also this precision.

Precision is basis is that

my classifier flags some objects as the positive cases.

But this is a combination of

the true positives and

false positives because I made mistakes.

It's the ratio of true positives

divided by other positive

that are flagged by the classifier.

If you say I'm flagging

100 transactions as fraud transactions,

the best-case would be that all 100,

they are indeed fraud,

then I have a perfect precision.

But many times you make mistakes.

The false positive would then contribute to

a lower than perfect precision value.

When you consider, of course both the positive,

positive and the negative cases,

when we use this accuracy,

that means across all the different classes.

What other scenarios where I'm correct.

Craig deserve this could be true

positive and true negative.

You basically calculating your accuracy,

which is also equivalent to this formula

by combining your sensitivity and specificity.

Play video starting at :16:36 and follow transcript16:36

There's also actually an important point

to make about the four cases,

the true positive, true negative,

false positive, and false negatives.

In the previous case,

we're just combining all of them

and view them equally important.

But there are many real-world settings

where you need to let it go.

Really think through in terms of

the different scenarios and whether

one is scenario is more important than the other.

Let's think about this scenario fraud detection,

medical diagnosis.

Think about the fraud detection,

I'm flagging or a transaction as a fraud on that.

My false positive basically says that,

say it's a positive so I'm saying this fraud,

but turns out is an

[inaudible] so it's actually a false positive.

It's a normal case I flagged as a fraud by mistake.

Why do you look at the false negative?

That means I didn't report

this as fraud. That means I missed it.

It it was actually a fraud,

but I missed it.

That's all getting my mistake.

It's a normal, I can just let it back.

As you can see, this two,

of course both of them are error cases.

We would like to get them down to zero if we can.

But if not, then you're

talking about potential trade-offs.

You may have different models.

One model may be better in

terms of having a low false positive.

The other one may do better in terms of false negative.

Then you need to decide,

if I have to choose which one's more

important for me, if I have to trade off.

There's also this notion when you

have a multi-class classification.

Because it's a binary density it's.

yes, true or false, yes or no.

But if it's a multiclass, in general case,

you can think about it while

it's the exact match scenarios,

that means that if I only consider it a success,

if you can give me

the correct class label, analyze this force.

I don't tell it just says error.

I only care about the exact match.

But they are scenarios where it

find that you get to a class that's not exactly the same,

but it's close or closer to other ones.

Think about a scenario for activity recognition.

They are different activities that

a user may be performing or doing,

and as you're trying to make

this a multi-class classification,

you may make mistakes.

If you say you're mistaking one activity as another

versus one activity as a yet another type of activity,

those difference may actually matter.

That means it's okay if

your misclassify activity you

want to that's fine or not too bad.

But if it's activity one

being misclassified as activities three,

then it's not good.

You can actually wait that it differently.

Anyway. The takeaway message is really that there's

not one measure or

one calculation for you to do to compare everybody,

rather you need to understand

your scenario and then decide what to measure,

or metrics are more important in your scenario.

Then trying to decide,

like use that measure then to

compare across semesters and

pick the ones that are most suitable.