6-Model Evaluation

Phases of Data Mining Process

  • Stages:

    • Business Understanding

    • Data Understanding

    • Data Preparation

    • Data Modeling

    • Evaluation

    • Deployment

Model Evaluation Essentials

  • Evaluation Points:

    • Final evaluation on testing set

    • Hyperparameter tuning evaluation on the validation set

Benchmarks for Model Performance

  • Random Guess Model:

    • Refers to the simplest prediction method without any learning.

  • Majority-Class Classifier:

    • Predicts the most common class label in the training set.

    • Example: In direct mail marketing, if only 1% of households respond, the model will classify every household as a non-responder (default prediction).

    • Most intuitive and easiest but not a good one - too much emphasis on the majority factor thats not the main focus.

Confusion Matrix

  • Definition:

    • A table used to describe the performance of a classification model.

  • Entries:

    • True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).

  • Key Information from Confusion Matrix:

    1. Total predicted examples for each class.

    2. Counts of correctly versus incorrectly predicted examples.

Defining Positive Class

  • Binary Classification Labels:

    • Defined using 1 for positive and 0 for negative.

  • Examples of Labeling:

    • Spam (1) vs. Non-Spam (0)

    • Disease presence (1) vs. absence (0)

    • Default status (1) vs. non-default (0)

Accuracy Calculation

  • Formula:

    • [ \text{Accuracy} = \frac{TP + TN}{\text{Total Instances}} ]

    • [ \text{Error Rate} = 1 - \text{Accuracy} ]

  • Contextual Example:

    • In an imbalanced dataset scenario (e.g., less than 1% response rates), high accuracy might be misleading.

Limitations of Accuracy

  • Issue with Imbalanced Data:

    • A high accuracy can be achieved even by trivial models that ignore positive classes.

  • Case Study:

    • Decision tree model: 97.8% accuracy

    • Majority-class model: 99% accuracy (but without learning).

Precision and Recall Measures

  • Precision:

    • [ \text{Precision} = \frac{TP}{TP + FP} ]

    • Proportion of true positive predictions in all predicted positive cases.

  • Recall:

    • [ \text{Recall} = \frac{TP}{TP + FN} ]

    • Proportion of true positives in all actual positive cases.

  • Importance:

    • Particularly relevant in scenarios where the detection of positives is crucial.

Decision Thresholds

  • Definition:

    • The threshold that determines the whole confusion matrix.

    • Varying the threshold will give different confusion matrices and thus different precisions and recalls.

Precision and Recall Dynamics

  • Inverse Relationship and Adjustment of Decision Thresholds:

    • Improving precision tends to reduce recall and vice versa.

  • Graphical Representation:

    • A Precision-Recall curve reflects performance across various thresholds showing the trade-off. Always downward and never upward due to inverse relationship, which illustrates that as we increase precision by adjusting the decision threshold, we inevitably sacrifice some level of recall.

ROC (Receiver Operating Characteristic) Curve Analysis

  • Definition:

    • A graphical representation of model performance across all thresholds, plotting True Positive Rate (TPR) against False Positive Rate (FPR).

  • Changing the decision threshold may change the location of the point.

  • An ROC curve is made by changing the decision threshold.

  • Use of Confusion Matrix Elements:

    • Incorporates TP, TN, FP, FN in analysis.

ROC Curve Points

  • Significance of Points:

    • (0,0): Everything classified as negative. Decision threshold line is 1.

    • (1,1): Everything classified as positive. Decision threshold line is 0.

    • (1,0): Perfect model with no incorrect predictions. 100% accuracy.

    • (0,1): Poor model with all false positives. 0% accuracy.

AUC (Area Under ROC Curve)

  • Definition:

    • Measures the entire area under the ROC curve.

  • Interpreting AUC:

    • Greater area indicates better model performance; AUC range is [0-1]. AUC less than 0.5 indicates performance worse than random guessing.

Comparison of PR and ROC Curves

  1. Focus on Class Labels:

    • PR Curves: Primarily evaluate the performance of a model by focusing on the positive class only. This is particularly useful in imbalanced datasets where the positive class (minority) is of higher interest.

    • ROC Curves: Assess the performance of a model across both classes (positive and negative), providing a more general view of how the model performs regardless of class distribution.

  2. True Positive Rate vs. Precision:

    • Precision-Recall Curves: The y-axis represents precision (TP / (TP + FP)), which shows the accuracy of positive predictions only, relative to the total predicted positives.

    • ROC Curves: The y-axis measures the True Positive Rate (TPR), also known as sensitivity or recall, which quantifies how well the model identifies actual positive cases.

  3. Sensitivity to Class Imbalance:

    • PR Curves: More informative in cases of class imbalance since they directly illustrate the trade-offs between precision and recall when focusing specifically on the positive class.

    • ROC Curves: Can give an overly optimistic estimation of model performance in imbalanced conditions, as high TPR could be misleading due to the excess number of negatives.

  4. Business Alignment:

    • PR Curves: Often align better with business needs when the true positive rate is crucial, such as in fraud detection or medical diagnosis where missing a positive instance can have severe consequences.

    • ROC Curves: Provide a broader perspective which may not focus on specific business priorities, sometimes diluting the significance of positive classes when they are not the main concern.