Decision Trees

studied byStudied by 0 people
0.0(0)
learn
LearnA personalized and smart learning plan
exam
Practice TestTake a test on your terms and definitions
spaced repetition
Spaced RepetitionScientifically backed study method
heart puzzle
Matching GameHow quick can you match all your cards?
flashcards
FlashcardsStudy terms and definitions

1 / 78

encourage image

There's no tags or description

Looks like no one added any tags here yet for you.

79 Terms

1

classification techniques

  1. decision tree methods

  2. nearest neighbor

  3. neural networks

  4. naive bayes

  5. support vector machines

New cards
2

what is a decision tree?

  1. classifier w tree structure

  2. essentially a flowchart (each ex moves from root to leaf)

<ol><li><p>classifier w tree structure</p></li><li><p>essentially a flowchart (each ex moves from root to leaf)</p></li></ol><p></p>
New cards
3

DT definitions

  1. root node- start point

  2. decision node- test on single attribute

  3. leaf node- indicates value of class

  4. path- disjunction of conditions to make final decision

New cards
4

2 phases of tree generation

  1. tree construction

    • at start all training ex are at root

    • partition ex recursively based on attributes alg selects

  2. tree pruning

    • improves performance on new examples

    • best tree on training data may not be best on new test data due to overfitting

New cards
5

decision boundaries

-DTs form rectangular regions → each region is a leaf

-border b/n regions = decision boundary

-decision boundary = axis parallel

<p>-DTs form rectangular regions → each region is a leaf</p><p>-border b/n regions = decision boundary</p><p>-decision boundary = axis parallel</p>
New cards
6

expressiveness

flexibility of the classifier in forming decision boundaries

  • linear models- form boundaries w 1 line/plane

  • DTs- form rectangular regions

New cards
7

are linear models or DTs more expressive?

DTs- form many regions

  • but can only form axis parallel and linear can form 1 angled linear boundary

  • so depends on model which is more appropriate

New cards
8

downside to flexibility?

can lead to more overfitting

New cards
9

when are DTs good?

strong interactions b/n many features (accurate modeling=must have complete trees)

New cards
10

when are DTs bad?

target class based on continous funcs that involve 2 variables (unless simple)

New cards
11

DT learning concept x<y

-not good → cannot learn perfectly unless tree has infinite # leaves

-can only approximate w many small rectangles

New cards
12

tree construction

top-down recursive; divide and conquer

  • start w all training ex at root

  • partitioned recursively based on selected attribute

    • best attr selected using a metric

  • recursive process

    • partition ex at root based on ‘best’ feat

    • if binary split recurse on right then left child, etc

New cards
13

tree building is

greedy

  • each decision = locally optimal (not globally)

  • 1-level lookahead

  • heuristic methods = for difficult problems

  • benefit to greedy heuristic = fast

New cards
14

splitting on categorical attributes

  1. multi-way split- split on every value

  2. binary split- divides into 2 subsets

New cards
15

splitting on continous attributes

  1. partition dynamically into ordinal values using equal freq or width

  2. binary split- (A<v) or (A>=v); sort #s and try each split b.n successive #s

New cards
16

when does tree construction/partitioning/splitting stop?

  1. all samples at a node belong to same class

  2. no remaining attributes to split on (majority voting used to assign class)

  3. no ex left

  4. stop when not beneficial to split (basic method ignores and handles w pruning but pre-pruning does this)

New cards
17

picking locally optimal split

hunt’s alg- recursively partition training records into successively purer subsets

  • pure(homogeneous) set f all ex belong to same class

  • there are levels of purity

<p>hunt’s alg- recursively partition training records into successively <em>purer</em> subsets</p><ul><li><p>pure(homogeneous) set f all ex belong to same class</p></li><li><p>there are levels of purity</p></li></ul><p></p>
New cards
18

which split is best ex.

knowt flashcard image
New cards
19

purity is measured using

statistical metric

  • entropy

  • gini

  • classification error rate

New cards
20

avail DT algs

  1. hunt’s alg (one of earliest)

  2. CART (classification and regression trees)

    • regression tree- can predict numerical value so not just a classifier

    • sklearn

  3. ID3, C4.5 (pretty old)

New cards
21

converting DT to rules

1 rule = per leaf

<p>1 rule = per leaf</p>
New cards
22

DT algs in practice

-Weka = free open source java DM suite

-Python Scikit includes DT alg = based on CART; doesnt handle categorical data so need to convert to #s

New cards
23

entropy

-measures impurity/randomness of set of examples

-range from 0 (least random) to 1 (most)

-want close to 0 in classification

New cards
24

entropy is 0 if outcome is

certain

New cards
25

entropy is maximum if

we have no knowledge of system ( / any outcome is equally possible)

New cards
26

we want entropy to be

low → more orderly

New cards
27

entropy to select feature

-entropy defined on a sample (entire training set aka root or sample aka one leaf)

-consider entropy after DT split to identify best feature to split on (c samples for a c-way split)

-use information gain that measures reduction in entropy → split on feature w max info gain

<p>-entropy defined on a sample (entire training set aka root or sample aka one leaf)</p><p>-consider entropy after DT split to identify best feature to split on (c samples for a c-way split)</p><p>-use <em>information gain</em> that measures reduction in entropy → split on feature w max info gain</p>
New cards
28

information gain

entropy of parent - entropy of children

  • k-way split → k child nodes

  • sum child entropies weighted by relative freq (weighted avg)

<p>entropy of parent - entropy of children</p><ul><li><p>k-way split → k child nodes</p></li><li><p>sum child entropies weighted by relative freq (weighted avg)</p></li></ul><p></p>
New cards
29

log2x calculations

log2x = log10x/ log102

log2x=log10x/.301

New cards
30

GINI index

-node t with c splits

-like entropy (lower=better)

-split on feature w lowest GINI after split

-compute weighted GINI avg of childnre

-0 is best, 0.5 is worst

<p>-node <em>t</em> with <em>c</em> splits</p><p>-like entropy (lower=better)</p><p>-split on feature w lowest GINI after split</p><p>-compute weighted GINI avg of childnre </p><p>-0 is best, 0.5 is worst</p>
New cards
31

classification error rate

-at a node t

-classify node based on most freq value, count errors, and divide by # of ex

-measures misclassification error made by a node

  • for binary- 0 (best), 0.5 (worst)

  • for k splits- 0 (best), 1-1/k (worst)

    • 5 splits = 0.8 error rate

-must compute error rate of children after split

<p>-at a node <em>t</em></p><p>-classify node based on most freq value, count errors, and divide by # of ex</p><p>-measures misclassification error made by a node</p><ul><li><p>for binary- 0 (best), 0.5 (worst)</p></li><li><p>for <em>k </em>splits- 0 (best), 1-1/k (worst)</p><ul><li><p>5 splits = 0.8 error rate</p></li></ul></li></ul><p>-must compute error rate of children after split</p><p></p>
New cards
32

splitting metric comparison

comparison for 2 class problem as probability p of one class varies

<p>comparison for 2 class problem as probability <em>p</em> of one class varies</p>
New cards
33

error rate often used to evaluate results, but never for

selecting features

  • DT use a greedy strat to we need to use splitting metric for globally better results

New cards
34

occam’s razor

-among competing hypotheses → one with fewest assumptions should be selected

-complex models = greater chance it was fitted accidentally by errors in data (consider complexity when eval model)

New cards
35

errors

  1. training errors- errors on training set

  2. test errors- errors on test set

  3. generalization errors- expected error of model over random selection of ex from same distribution (measured using test set)

New cards
36

model overfitting occurs when

model too complex

  • may minimize training errors at expense of more test set/generalization errors

  • difficult to predict when this will occur

New cards
37

overfitting ex) Ohm’s law

says V=IR (current b/n 2 points=directly proportional to voltage and inversely proportional to resistance)

<p>says V=IR (current b/n 2 points=directly proportional to voltage and inversely proportional to resistance)</p>
New cards
38

underfitting

model is too simple → training and test errors large

New cards
39

overfitting

model is too complex → training error low but test error high

  • good for training error

New cards
40

the right fit

in figure- best generalization performance around 130 nodes (test set error rate=lowest)

<p>in figure- best generalization performance around 130 nodes (test set error rate=lowest)</p>
New cards
41

a synthetic data set

a data set artificially generated for which we know the correct underlying concept

  • if condition met → class +

  • if not → class -

  • data set defined using gaussian distribution

New cards
42

determine which tree is better

look at generalization (test set) error

in ex) 50 nodes cover many areas outside gaussian region and covers noise → 4 nodes prob better

<p>look at generalization (test set) error</p><p>in ex) 50 nodes cover many areas outside gaussian region and covers noise → 4 nodes prob better</p>
New cards
43

if you have more training examples to learn from you have more

data → more complex model

New cards
44

inc amt of training data allows

DT to identify new regions of concept to be learned (more complex model can be okay)

New cards
45

what will generally happen to test set performance w more training data?

improves → more data=good

New cards
46

more data leads to more complex (usually better) models,

but not overfitting

New cards
47

overfitting avoidance strategies

  1. pre-pruning- avoid overfitting in tree building process

    • difficult, rarely used

  2. post-pruning- allow tree to overfit data then prune it back

    • common

New cards
48

pre-pruning strategies

-early stopping criteria

  • if class distribution is independent of avail features

  • if splitting node does. not improve purity (use info gain / gini)

  • assign penalty for model complexity and stop when benefit does not outweigh penalty

New cards
49

pre-pruning with complexity estimate

generalization error(model) = training error(model) + a x complexity(model

  • training error(model)=optimistic

    • reasonably the best to hope for on test set

    • error rate will be higher

<p>generalization error(model) = training error(model) + a x complexity(model</p><ul><li><p>training error(model)=optimistic</p><ul><li><p>reasonably the best to hope for on test set</p></li><li><p>error rate will be higher</p></li></ul></li></ul><p></p>
New cards
50

pessimistic error estimate

generalization error rate

<p> generalization error rate</p>
New cards
51

minimum description length (MDL)

-imagine u need to transmit dataset; options:

  1. send entire data set

  2. send complex model with no/few errors

  3. send simpler model and misclassified ex

-best option=requires fewest bits=MDL

-tradeoff b/n complexity and errors = less arbitrary than for prepruning

<p>-imagine u need to transmit dataset; options:</p><ol><li><p>send entire data set</p></li><li><p>send complex model with no/few errors</p></li><li><p>send simpler model and misclassified ex</p></li></ol><p>-best option=requires fewest bits=MDL</p><p>-tradeoff b/n complexity and errors = less arbitrary than for prepruning</p><p></p>
New cards
52

post-pruning

-grow DT w few limits and then trim in bottom-up fashion

  • reduced error pruning- replace subtree w leaf node if generalization error will improve (improvement assessed w validation set thats separate from training/test sets)

  • disadvantage- less data

-alternatively use MDL / statistic estimates to decide whether to simplify

New cards
53

classifier performance

how constructed classifier performs at classifying new examples

New cards
54

confusion matrix

-provides detailed results

-can be used to generate many metrics

-for k class problem → k x k matrix

<p>-provides detailed results</p><p>-can be used to generate many metrics</p><p>-for <em>k</em> class problem → <em>k</em> x <em>k </em>matrix</p>
New cards
55

accuracy and error rate

knowt flashcard image
New cards
56

accuracy not that useful when

classes = imbalanced

ie. 9990 neg class examples and 10 pos → if model always predicts neg, accuracy is 9990/1000=99.9%

New cards
57

minority class often more important because

costs more to misclassify minority (positive)

ie. medical diagnosis- FN worse than FP

New cards
58

precision, recall, f-measure

-useful when class imbalance

-precision- accuracy when predict pos

-recall- fraction of pos you ‘catch’ (predict right)

-pos class = primary interest

-tradeoff b/n precision and recall

<p>-useful when class imbalance</p><p>-precision- accuracy when predict pos</p><p>-recall- fraction of pos you ‘catch’ (predict right)</p><p>-pos class = primary interest</p><p>-tradeoff b/n precision and recall</p>
New cards
59

cross validation (xVal)

-partition data into k disjoint subsets

-train on k-1 partitions, test on remaining 1

  • iterate k times so every ex is 1 of test sets

-most common is 10 fold xVal

-3-fold xVal in ex) 3 iterations, 2/3 train, 1/3 test each iteration; report performance over all k(3) runs

<p>-partition data into <em>k</em> disjoint subsets</p><p>-train on <em>k-1</em> partitions, test on remaining 1</p><ul><li><p>iterate <em>k</em> times so every ex is 1 of test sets</p></li></ul><p>-most common is 10 fold xVal</p><p>-3-fold xVal in ex) 3 iterations, 2/3 train, 1/3 test each iteration; report performance over all <em>k(3)</em> runs</p><p></p>
New cards
60

leave-one-out xVal

if n labeled ex, perform n-fold xVal

ie. if 1000 ex then use 1000 iterations each with 999 training and 1 test ex

New cards
61

why use xVal?

  1. better use of data for training the model

    • uses more data for training → more accurate

    • tests on every ex → more reliable

  2. multiple estimates = better reliability

    • individual estimates form a distribution

    • can use distributions to determine if 1 model better than another (t-test)

New cards
62

xVal disadvantage

extra computation

New cards
63

why do you need a validation set?

-training and test sets = not always sufficient

-problem when generating many models → parameter tuning / diff algs

  • model w best test set performance may not be best because- only an estimate, small diffs may be due to chance → multiple comparisons problem

New cards
64

multiple comparisons problem

-random guessing → expect atleast 1 analyst to get results much better than random

-using test data to choose model instead of to build it

New cards
65

fixing multiple comparisons

use another data set to pick ‘best model’ → validation set (just like another test set but only used to pick best model)

New cards
66

using validation set process

-build multiple models w training data

-use validation set to decide best

-evaluate best model on test set and report

New cards
67

hyper-parameters

parameters that control the classification alg

ie. for DTs- splitting criteria, max depth, etc

New cards
68

validation set needed to choose

hyperparameter values

New cards
69

xVal can handle 3 partitions

-split data to training and test sets

-training set forms training and validation sets → build multiple models and pick best parameter

-choose model w best and evaluate test set

New cards
70

expressiveness

relates to flexibility and complexity of decision boundaries

New cards
71

for DTs all boundaries are axi-parallel which

somewhat limits expressive power

  • most other classifiers more expressive (exception- linear classifier normally limited to 1 line/plane but not limited to axis parallel)

😔

New cards
72

computational efficiency

  1. time to build DT

    • exponentially large search space → greedy heuristics allow quick construction

    • pruning = more time than build but still quick

  2. time to classify a test ex

    • given a model → ex classified very quick

    • modest # of comparisons (most trees less than 10 levels)

-overall DTs very fast to build + apply

😄

New cards
73

explainability and comprehensibility

  1. DTs very explainable

    • can explain/justify individual classifications (follow path to lead and examine splits)

  2. DTs generate comprehensible global model

    • overview of entire model and how it works

    • diff from justifying individual decisions (nearest neighbor is explainable but no global model)

😄

New cards
74

applicability

  1. no prior assumptions about data → no need to apply transformations to features 😄

  2. handles discrete/continuous features → not ideal if many continuous features 😄

  3. can natively handle >2 classes (multiclass) → linear regression and SVM natively handle 2 😄

  4. doesn’t normally handle regression tasks → variants like CART via regression trees; an idea is to avg numerical class values at a leaf 😔

New cards
75

feature issues

  1. handling missing feature values

    • not required to impute missing value to func

    • can treat missing values as special value → diff from numerical methods that build global func that requires each feature to have value

  2. handling irrelevant/redundant features

    • good at this bc feature selection methods

    • irrelevance- ignored

    • redundancy- only 1 used in path to leaf

    • other methods like nearest neighbor have big problems

😁

New cards
76

data fragmentation

  1. # of ex smaller as go down tree

    • at some point=too few to make statistically valid decision

  2. worse when more features, fewer ex

  3. not all methods susceptible to this (methods that dont use divide and conquer)- linear regression only uses 1 line so divides once

😔

New cards
77

data fragmentation problem happens since

keep partitioning data in DT

New cards
78

deterministic vs probabilistic classifiers

  1. probabilistic- produces a probability for each possible class (sums to 1.0)

  2. DTs are deterministic but produce probabilities as side effect (ratio of classes at leaf)

New cards
79

global vs local model

  1. global- fits a single model to entire dataset

    • less susceptible to overfitting

    • linear classifiers considered global

  2. local- partitions input space into smaller regions and fits a model to ex in each region

    • DTs considered local

New cards
robot