Ch5: Decision trees

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/23

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

24 Terms

New cards

What is the difference between supervised and unsupervised data mining?

Supervised data mining uses known labels to build predictive models (classification/regression)
While unsupervised data mining finds patterns or structures in data without predefined labels (e.g., clustering, association rules).

→ results of unsupervised method can be used for input of supervised method

New cards

What is a supervised method in data mining?

A learning method that uses a training set of instances (input data) with known target values to learn a function mapping features to outputs.

New cards

What are the two main subclasses in supervised learning, and how do they differ?

A: Classification assigns inputs to discrete categories (e.g., purchase or not), while regression predicts continuous numeric outcomes (e.g., customer spend).

New cards

In a decision tree, what do nodes, edges, and leaves represent?

Nodes: features (attributes)
Edges: distinct value of that attribute
Leaves: predicted class labels/decision

New cards

which functions can decision trees represent?

A: Any boolean function of the input attributes:

AND function, OR function, XOR function

New cards

How to choose best attribute for decision tree?

We want the resulted groups to be as pure as possible:

Homogeneous wrt target variable
If every member of a group has the same value for target, then the group is pure

→ min uncertainty + max Information Gain

New cards

Why do we use formulas based on purity measure?

When choosing the best attribute, technical problems can be encountered:

real data rarely splits into pure groups
some attributes are not binary
some attributes are continuous

→ addressed by purity measure

New cards

Entropy

∑ₖ - pₖ log(pₖ), where pₖ is the proportion of examples in class k.

It measures impurity or disorder of a single dataset.
High entropy: uniform distribution&flat histogram (more uncertainty)
Low entropy: varied distribution & histogram w/ many lows and highs

→ want low!

Other principles exist, like Gini impurity

New cards

Information gain

IG(parent, children) = entropy(p) - Σ(c_j ) * entropy(c_j)

→ want attribute that decreases maximally uncertainty

New cards

What is the ID3 algorithm in decision trees?

The ID3 (Iterative Dichotomiser 3) algorithm builds a decision tree by:

Calculating entropy of each attribute in the training data.
Selecting the attribute with the highest information gain (i.e. that reduces uncertainty the most).
Splitting the data on that attribute and creating a decision node.
Recursively repeating the process on the subsets.

New cards

Termination criteria: when do we stop?

When there are no more attributes to be selected.

New cards

What is the problem with using information gain alone when selecting attributes?

Attributes with many values may have high information gain due to overfitting; they split the data too finely.

Especially for ID3:

can overfit
no guarantee of globally optimal tree (local optima)
can fail to generalize

New cards

Solutions to overfitting w/decision trees

Limit the length of the tree
Limit the number of leaves
Stop growing when the split is not statistically significant
Grow full tree & then prune subtrees

New cards

Differences btw C4.5 and ID3?

C4.5:

Is an improvement of ID3
Handles continuous & discrete attributes through gain ratio → splits based on ratios
Handles training data w/ missing values
Prunes the tree after creation

New cards

What are alternatives to a single decision tree?

Tree Bagging: Builds several trees using random training samples; predictions are averaged or voted.

Random Forest: Like bagging, but also selects a random subset of features at each split to reduce correlation between trees.

New cards

What are pros & cons of using ensembles over a single decision tree?

Advantages

Handles noisy training data
Interpretability of the model
They are fast & robust

Limitations

Decision tree will overfit
Less useful for continuous functions (regression)

New cards

How to evaluate performances (for classification)?

Training set: Trains the model using input-output pairs.
Validation set: Separate set, tunes hyper-parameters and evaluates model during training.
Test set: Assesses final model performance on unseen data.

New cards

Define a confusion matrix and its components.

True Positive (TP): Correctly predicted positive instances
True Negative (TN): Correctly predicted negative instances
False Positive (FP): Incorrectly predicted positives (actually negative)
False Negative (FN): Incorrectly predicted negatives (actually positive)

New cards

Define and explain precision.

A: Precision = TP / (TP + FP).

It measures the proportion of predicted positives that are actually positive.

New cards

Define and explain recall.

A: Recall = TP / (TP + FN).

It measures the proportion of actual positives correctly identified by the model.

New cards

What is the F1 score and when is it useful?

A: F1 = 2 * (Precision Recall) / (Precision + Recall).

It balances precision and recall, useful when classes are imbalanced.

New cards

Q: What is accuracy, and why might it be misleading?

A: Accuracy = (TP + TN) / (TP + TN + FP + FN).

Fraction of instances correctly classified by the model.

It may be misleading in imbalanced datasets because it ignores the distribution of false positives/negatives.

New cards

What does the ROC curve represent?

A: A curve showing the trade-off between true positive rate and false positive rate at various classification thresholds.

New cards

What is AUC and how is it interpreted?

A: AUC (Area Under the Curve) quantifies the overall ability of the classifier to discriminate between classes; 1.0 = perfect, 0.5 = random.