1/23
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is the difference between supervised and unsupervised data mining?
Supervised data mining uses known labels to build predictive models (classification/regression)
While unsupervised data mining finds patterns or structures in data without predefined labels (e.g., clustering, association rules).
→ results of unsupervised method can be used for input of supervised method
What is a supervised method in data mining?
A learning method that uses a training set of instances (input data) with known target values to learn a function mapping features to outputs.
What are the two main subclasses in supervised learning, and how do they differ?
A: Classification assigns inputs to discrete categories (e.g., purchase or not), while regression predicts continuous numeric outcomes (e.g., customer spend).
In a decision tree, what do nodes, edges, and leaves represent?
A:
Nodes: features (attributes)
Edges: distinct value of that attribute
Leaves: predicted class labels/decision
which functions can decision trees represent?
A: Any boolean function of the input attributes:
AND function, OR function, XOR function
How to choose best attribute for decision tree?
We want the resulted groups to be as pure as possible:
Homogeneous wrt target variable
If every member of a group has the same value for target, then the group is pure
→ min uncertainty + max Information Gain
Why do we use formulas based on purity measure?
When choosing the best attribute, technical problems can be encountered:
real data rarely splits into pure groups
some attributes are not binary
some attributes are continuous
→ addressed by purity measure
Entropy
∑ₖ - pₖ log(pₖ), where pₖ is the proportion of examples in class k.
It measures impurity or disorder of a single dataset.
High entropy: uniform distribution&flat histogram (more uncertainty)
Low entropy: varied distribution & histogram w/ many lows and highs
→ want low!
Other principles exist, like Gini impurity
Information gain
IG(parent, children) = entropy(p) - Σ(cj ) * entropy(cj)
→ want attribute that decreases maximally uncertainty
What is the ID3 algorithm in decision trees?
The ID3 (Iterative Dichotomiser 3) algorithm builds a decision tree by:
Calculating entropy of each attribute in the training data.
Selecting the attribute with the highest information gain (i.e. that reduces uncertainty the most).
Splitting the data on that attribute and creating a decision node.
Recursively repeating the process on the subsets.
Termination criteria: when do we stop?
When there are no more attributes to be selected.
What is the problem with using information gain alone when selecting attributes?
Attributes with many values may have high information gain due to overfitting; they split the data too finely.
Especially for ID3:
can overfit
no guarantee of globally optimal tree (local optima)
can fail to generalize
Solutions to overfitting w/decision trees
Limit the length of the tree
Limit the number of leaves
Stop growing when the split is not statistically significant
Grow full tree & then prune subtrees
Differences btw C4.5 and ID3?
C4.5:
Is an improvement of ID3
Handles continuous & discrete attributes through gain ratio → splits based on ratios
Handles training data w/ missing values
Prunes the tree after creation
What are alternatives to a single decision tree?
Tree Bagging: Builds several trees using random training samples; predictions are averaged or voted.
Random Forest: Like bagging, but also selects a random subset of features at each split to reduce correlation between trees.
What are pros & cons of using ensembles over a single decision tree?
Advantages
Handles noisy training data
Interpretability of the model
They are fast & robust
Limitations
Decision tree will overfit
Less useful for continuous functions (regression)
How to evaluate performances (for classification)?
Training set: Trains the model using input-output pairs.
Validation set: Separate set, tunes hyper-parameters and evaluates model during training.
Test set: Assesses final model performance on unseen data.
Define a confusion matrix and its components.
True Positive (TP): Correctly predicted positive instances
True Negative (TN): Correctly predicted negative instances
False Positive (FP): Incorrectly predicted positives (actually negative)
False Negative (FN): Incorrectly predicted negatives (actually positive)
Define and explain precision.
A: Precision = TP / (TP + FP).
It measures the proportion of predicted positives that are actually positive.
Define and explain recall.
A: Recall = TP / (TP + FN).
It measures the proportion of actual positives correctly identified by the model.
What is the F1 score and when is it useful?
A: F1 = 2 * (Precision Recall) / (Precision + Recall).
It balances precision and recall, useful when classes are imbalanced.
Q: What is accuracy, and why might it be misleading?
A: Accuracy = (TP + TN) / (TP + TN + FP + FN).
Fraction of instances correctly classified by the model.
It may be misleading in imbalanced datasets because it ignores the distribution of false positives/negatives.
What does the ROC curve represent?
A: A curve showing the trade-off between true positive rate and false positive rate at various classification thresholds.
What is AUC and how is it interpreted?
A: AUC (Area Under the Curve) quantifies the overall ability of the classifier to discriminate between classes; 1.0 = perfect, 0.5 = random.