Decision Trees

0.0(0)
studied byStudied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/78

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 4:42 AM on 10/17/24
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

79 Terms

1
New cards

classification techniques

  1. decision tree methods

  2. nearest neighbor

  3. neural networks

  4. naive bayes

  5. support vector machines

2
New cards

what is a decision tree?

  1. classifier w tree structure

  2. essentially a flowchart (each ex moves from root to leaf)

<ol><li><p>classifier w tree structure</p></li><li><p>essentially a flowchart (each ex moves from root to leaf)</p></li></ol><p></p>
3
New cards

DT definitions

  1. root node- start point

  2. decision node- test on single attribute

  3. leaf node- indicates value of class

  4. path- disjunction of conditions to make final decision

4
New cards

2 phases of tree generation

  1. tree construction

    • at start all training ex are at root

    • partition ex recursively based on attributes alg selects

  2. tree pruning

    • improves performance on new examples

    • best tree on training data may not be best on new test data due to overfitting

5
New cards

decision boundaries

-DTs form rectangular regions → each region is a leaf

-border b/n regions = decision boundary

-decision boundary = axis parallel

<p>-DTs form rectangular regions → each region is a leaf</p><p>-border b/n regions = decision boundary</p><p>-decision boundary = axis parallel</p>
6
New cards

expressiveness

flexibility of the classifier in forming decision boundaries

  • linear models- form boundaries w 1 line/plane

  • DTs- form rectangular regions

7
New cards

are linear models or DTs more expressive?

DTs- form many regions

  • but can only form axis parallel and linear can form 1 angled linear boundary

  • so depends on model which is more appropriate

8
New cards

downside to flexibility?

can lead to more overfitting

9
New cards

when are DTs good?

strong interactions b/n many features (accurate modeling=must have complete trees)

10
New cards

when are DTs bad?

target class based on continous funcs that involve 2 variables (unless simple)

11
New cards

DT learning concept x<y

-not good → cannot learn perfectly unless tree has infinite # leaves

-can only approximate w many small rectangles

12
New cards

tree construction

top-down recursive; divide and conquer

  • start w all training ex at root

  • partitioned recursively based on selected attribute

    • best attr selected using a metric

  • recursive process

    • partition ex at root based on ‘best’ feat

    • if binary split recurse on right then left child, etc

13
New cards

tree building is

greedy

  • each decision = locally optimal (not globally)

  • 1-level lookahead

  • heuristic methods = for difficult problems

  • benefit to greedy heuristic = fast

14
New cards

splitting on categorical attributes

  1. multi-way split- split on every value

  2. binary split- divides into 2 subsets

15
New cards

splitting on continous attributes

  1. partition dynamically into ordinal values using equal freq or width

  2. binary split- (A<v) or (A>=v); sort #s and try each split b.n successive #s

16
New cards

when does tree construction/partitioning/splitting stop?

  1. all samples at a node belong to same class

  2. no remaining attributes to split on (majority voting used to assign class)

  3. no ex left

  4. stop when not beneficial to split (basic method ignores and handles w pruning but pre-pruning does this)

17
New cards

picking locally optimal split

hunt’s alg- recursively partition training records into successively purer subsets

  • pure(homogeneous) set f all ex belong to same class

  • there are levels of purity

<p>hunt’s alg- recursively partition training records into successively <em>purer</em> subsets</p><ul><li><p>pure(homogeneous) set f all ex belong to same class</p></li><li><p>there are levels of purity</p></li></ul><p></p>
18
New cards

which split is best ex.

knowt flashcard image
19
New cards

purity is measured using

statistical metric

  • entropy

  • gini

  • classification error rate

20
New cards

avail DT algs

  1. hunt’s alg (one of earliest)

  2. CART (classification and regression trees)

    • regression tree- can predict numerical value so not just a classifier

    • sklearn

  3. ID3, C4.5 (pretty old)

21
New cards

converting DT to rules

1 rule = per leaf

<p>1 rule = per leaf</p>
22
New cards

DT algs in practice

-Weka = free open source java DM suite

-Python Scikit includes DT alg = based on CART; doesnt handle categorical data so need to convert to #s

23
New cards

entropy

-measures impurity/randomness of set of examples

-range from 0 (least random) to 1 (most)

-want close to 0 in classification

24
New cards

entropy is 0 if outcome is

certain

25
New cards

entropy is maximum if

we have no knowledge of system ( / any outcome is equally possible)

26
New cards

we want entropy to be

low → more orderly

27
New cards

entropy to select feature

-entropy defined on a sample (entire training set aka root or sample aka one leaf)

-consider entropy after DT split to identify best feature to split on (c samples for a c-way split)

-use information gain that measures reduction in entropy → split on feature w max info gain

<p>-entropy defined on a sample (entire training set aka root or sample aka one leaf)</p><p>-consider entropy after DT split to identify best feature to split on (c samples for a c-way split)</p><p>-use <em>information gain</em> that measures reduction in entropy → split on feature w max info gain</p>
28
New cards

information gain

entropy of parent - entropy of children

  • k-way split → k child nodes

  • sum child entropies weighted by relative freq (weighted avg)

<p>entropy of parent - entropy of children</p><ul><li><p>k-way split → k child nodes</p></li><li><p>sum child entropies weighted by relative freq (weighted avg)</p></li></ul><p></p>
29
New cards

log2x calculations

log2x = log10x/ log102

log2x=log10x/.301

30
New cards

GINI index

-node t with c splits

-like entropy (lower=better)

-split on feature w lowest GINI after split

-compute weighted GINI avg of childnre

-0 is best, 0.5 is worst

<p>-node <em>t</em> with <em>c</em> splits</p><p>-like entropy (lower=better)</p><p>-split on feature w lowest GINI after split</p><p>-compute weighted GINI avg of childnre </p><p>-0 is best, 0.5 is worst</p>
31
New cards

classification error rate

-at a node t

-classify node based on most freq value, count errors, and divide by # of ex

-measures misclassification error made by a node

  • for binary- 0 (best), 0.5 (worst)

  • for k splits- 0 (best), 1-1/k (worst)

    • 5 splits = 0.8 error rate

-must compute error rate of children after split

<p>-at a node <em>t</em></p><p>-classify node based on most freq value, count errors, and divide by # of ex</p><p>-measures misclassification error made by a node</p><ul><li><p>for binary- 0 (best), 0.5 (worst)</p></li><li><p>for <em>k </em>splits- 0 (best), 1-1/k (worst)</p><ul><li><p>5 splits = 0.8 error rate</p></li></ul></li></ul><p>-must compute error rate of children after split</p><p></p>
32
New cards

splitting metric comparison

comparison for 2 class problem as probability p of one class varies

<p>comparison for 2 class problem as probability <em>p</em> of one class varies</p>
33
New cards

error rate often used to evaluate results, but never for

selecting features

  • DT use a greedy strat to we need to use splitting metric for globally better results

34
New cards

occam’s razor

-among competing hypotheses → one with fewest assumptions should be selected

-complex models = greater chance it was fitted accidentally by errors in data (consider complexity when eval model)

35
New cards

errors

  1. training errors- errors on training set

  2. test errors- errors on test set

  3. generalization errors- expected error of model over random selection of ex from same distribution (measured using test set)

36
New cards

model overfitting occurs when

model too complex

  • may minimize training errors at expense of more test set/generalization errors

  • difficult to predict when this will occur

37
New cards

overfitting ex) Ohm’s law

says V=IR (current b/n 2 points=directly proportional to voltage and inversely proportional to resistance)

<p>says V=IR (current b/n 2 points=directly proportional to voltage and inversely proportional to resistance)</p>
38
New cards

underfitting

model is too simple → training and test errors large

39
New cards

overfitting

model is too complex → training error low but test error high

  • good for training error

40
New cards

the right fit

in figure- best generalization performance around 130 nodes (test set error rate=lowest)

<p>in figure- best generalization performance around 130 nodes (test set error rate=lowest)</p>
41
New cards

a synthetic data set

a data set artificially generated for which we know the correct underlying concept

  • if condition met → class +

  • if not → class -

  • data set defined using gaussian distribution

42
New cards

determine which tree is better

look at generalization (test set) error

in ex) 50 nodes cover many areas outside gaussian region and covers noise → 4 nodes prob better

<p>look at generalization (test set) error</p><p>in ex) 50 nodes cover many areas outside gaussian region and covers noise → 4 nodes prob better</p>
43
New cards

if you have more training examples to learn from you have more

data → more complex model

44
New cards

inc amt of training data allows

DT to identify new regions of concept to be learned (more complex model can be okay)

45
New cards

what will generally happen to test set performance w more training data?

improves → more data=good

46
New cards

more data leads to more complex (usually better) models,

but not overfitting

47
New cards

overfitting avoidance strategies

  1. pre-pruning- avoid overfitting in tree building process

    • difficult, rarely used

  2. post-pruning- allow tree to overfit data then prune it back

    • common

48
New cards

pre-pruning strategies

-early stopping criteria

  • if class distribution is independent of avail features

  • if splitting node does. not improve purity (use info gain / gini)

  • assign penalty for model complexity and stop when benefit does not outweigh penalty

49
New cards

pre-pruning with complexity estimate

generalization error(model) = training error(model) + a x complexity(model

  • training error(model)=optimistic

    • reasonably the best to hope for on test set

    • error rate will be higher

<p>generalization error(model) = training error(model) + a x complexity(model</p><ul><li><p>training error(model)=optimistic</p><ul><li><p>reasonably the best to hope for on test set</p></li><li><p>error rate will be higher</p></li></ul></li></ul><p></p>
50
New cards

pessimistic error estimate

generalization error rate

<p> generalization error rate</p>
51
New cards

minimum description length (MDL)

-imagine u need to transmit dataset; options:

  1. send entire data set

  2. send complex model with no/few errors

  3. send simpler model and misclassified ex

-best option=requires fewest bits=MDL

-tradeoff b/n complexity and errors = less arbitrary than for prepruning

<p>-imagine u need to transmit dataset; options:</p><ol><li><p>send entire data set</p></li><li><p>send complex model with no/few errors</p></li><li><p>send simpler model and misclassified ex</p></li></ol><p>-best option=requires fewest bits=MDL</p><p>-tradeoff b/n complexity and errors = less arbitrary than for prepruning</p><p></p>
52
New cards

post-pruning

-grow DT w few limits and then trim in bottom-up fashion

  • reduced error pruning- replace subtree w leaf node if generalization error will improve (improvement assessed w validation set thats separate from training/test sets)

  • disadvantage- less data

-alternatively use MDL / statistic estimates to decide whether to simplify

53
New cards

classifier performance

how constructed classifier performs at classifying new examples

54
New cards

confusion matrix

-provides detailed results

-can be used to generate many metrics

-for k class problem → k x k matrix

<p>-provides detailed results</p><p>-can be used to generate many metrics</p><p>-for <em>k</em> class problem → <em>k</em> x <em>k </em>matrix</p>
55
New cards

accuracy and error rate

knowt flashcard image
56
New cards

accuracy not that useful when

classes = imbalanced

ie. 9990 neg class examples and 10 pos → if model always predicts neg, accuracy is 9990/1000=99.9%

57
New cards

minority class often more important because

costs more to misclassify minority (positive)

ie. medical diagnosis- FN worse than FP

58
New cards

precision, recall, f-measure

-useful when class imbalance

-precision- accuracy when predict pos

-recall- fraction of pos you ‘catch’ (predict right)

-pos class = primary interest

-tradeoff b/n precision and recall

<p>-useful when class imbalance</p><p>-precision- accuracy when predict pos</p><p>-recall- fraction of pos you ‘catch’ (predict right)</p><p>-pos class = primary interest</p><p>-tradeoff b/n precision and recall</p>
59
New cards

cross validation (xVal)

-partition data into k disjoint subsets

-train on k-1 partitions, test on remaining 1

  • iterate k times so every ex is 1 of test sets

-most common is 10 fold xVal

-3-fold xVal in ex) 3 iterations, 2/3 train, 1/3 test each iteration; report performance over all k(3) runs

<p>-partition data into <em>k</em> disjoint subsets</p><p>-train on <em>k-1</em> partitions, test on remaining 1</p><ul><li><p>iterate <em>k</em> times so every ex is 1 of test sets</p></li></ul><p>-most common is 10 fold xVal</p><p>-3-fold xVal in ex) 3 iterations, 2/3 train, 1/3 test each iteration; report performance over all <em>k(3)</em> runs</p><p></p>
60
New cards

leave-one-out xVal

if n labeled ex, perform n-fold xVal

ie. if 1000 ex then use 1000 iterations each with 999 training and 1 test ex

61
New cards

why use xVal?

  1. better use of data for training the model

    • uses more data for training → more accurate

    • tests on every ex → more reliable

  2. multiple estimates = better reliability

    • individual estimates form a distribution

    • can use distributions to determine if 1 model better than another (t-test)

62
New cards

xVal disadvantage

extra computation

63
New cards

why do you need a validation set?

-training and test sets = not always sufficient

-problem when generating many models → parameter tuning / diff algs

  • model w best test set performance may not be best because- only an estimate, small diffs may be due to chance → multiple comparisons problem

64
New cards

multiple comparisons problem

-random guessing → expect atleast 1 analyst to get results much better than random

-using test data to choose model instead of to build it

65
New cards

fixing multiple comparisons

use another data set to pick ‘best model’ → validation set (just like another test set but only used to pick best model)

66
New cards

using validation set process

-build multiple models w training data

-use validation set to decide best

-evaluate best model on test set and report

67
New cards

hyper-parameters

parameters that control the classification alg

ie. for DTs- splitting criteria, max depth, etc

68
New cards

validation set needed to choose

hyperparameter values

69
New cards

xVal can handle 3 partitions

-split data to training and test sets

-training set forms training and validation sets → build multiple models and pick best parameter

-choose model w best and evaluate test set

70
New cards

expressiveness

relates to flexibility and complexity of decision boundaries

71
New cards

for DTs all boundaries are axi-parallel which

somewhat limits expressive power

  • most other classifiers more expressive (exception- linear classifier normally limited to 1 line/plane but not limited to axis parallel)

😔

72
New cards

computational efficiency

  1. time to build DT

    • exponentially large search space → greedy heuristics allow quick construction

    • pruning = more time than build but still quick

  2. time to classify a test ex

    • given a model → ex classified very quick

    • modest # of comparisons (most trees less than 10 levels)

-overall DTs very fast to build + apply

😄

73
New cards

explainability and comprehensibility

  1. DTs very explainable

    • can explain/justify individual classifications (follow path to lead and examine splits)

  2. DTs generate comprehensible global model

    • overview of entire model and how it works

    • diff from justifying individual decisions (nearest neighbor is explainable but no global model)

😄

74
New cards

applicability

  1. no prior assumptions about data → no need to apply transformations to features 😄

  2. handles discrete/continuous features → not ideal if many continuous features 😄

  3. can natively handle >2 classes (multiclass) → linear regression and SVM natively handle 2 😄

  4. doesn’t normally handle regression tasks → variants like CART via regression trees; an idea is to avg numerical class values at a leaf 😔

75
New cards

feature issues

  1. handling missing feature values

    • not required to impute missing value to func

    • can treat missing values as special value → diff from numerical methods that build global func that requires each feature to have value

  2. handling irrelevant/redundant features

    • good at this bc feature selection methods

    • irrelevance- ignored

    • redundancy- only 1 used in path to leaf

    • other methods like nearest neighbor have big problems

😁

76
New cards

data fragmentation

  1. # of ex smaller as go down tree

    • at some point=too few to make statistically valid decision

  2. worse when more features, fewer ex

  3. not all methods susceptible to this (methods that dont use divide and conquer)- linear regression only uses 1 line so divides once

😔

77
New cards

data fragmentation problem happens since

keep partitioning data in DT

78
New cards

deterministic vs probabilistic classifiers

  1. probabilistic- produces a probability for each possible class (sums to 1.0)

  2. DTs are deterministic but produce probabilities as side effect (ratio of classes at leaf)

79
New cards

global vs local model

  1. global- fits a single model to entire dataset

    • less susceptible to overfitting

    • linear classifiers considered global

  2. local- partitions input space into smaller regions and fits a model to ex in each region

    • DTs considered local

Explore top flashcards