Stages:
Business Understanding
Data Understanding
Data Preparation
Data Modeling
Evaluation
Deployment
Evaluation Points:
Final evaluation on testing set
Hyperparameter tuning evaluation on the validation set
Random Guess Model:
Refers to the simplest prediction method without any learning.
Majority-Class Classifier:
Predicts the most common class label in the training set.
Example: In direct mail marketing, if only 1% of households respond, the model will classify every household as a non-responder (default prediction).
Most intuitive and easiest but not a good one - too much emphasis on the majority factor thats not the main focus.
Definition:
A table used to describe the performance of a classification model.
Entries:
True Positive (TP), True Negative (TN), False Positive (FP), False Negative (FN).
Key Information from Confusion Matrix:
Total predicted examples for each class.
Counts of correctly versus incorrectly predicted examples.
Binary Classification Labels:
Defined using 1 for positive and 0 for negative.
Examples of Labeling:
Spam (1) vs. Non-Spam (0)
Disease presence (1) vs. absence (0)
Default status (1) vs. non-default (0)
Formula:
[ \text{Accuracy} = \frac{TP + TN}{\text{Total Instances}} ]
[ \text{Error Rate} = 1 - \text{Accuracy} ]
Contextual Example:
In an imbalanced dataset scenario (e.g., less than 1% response rates), high accuracy might be misleading.
Issue with Imbalanced Data:
A high accuracy can be achieved even by trivial models that ignore positive classes.
Case Study:
Decision tree model: 97.8% accuracy
Majority-class model: 99% accuracy (but without learning).
Precision:
[ \text{Precision} = \frac{TP}{TP + FP} ]
Proportion of true positive predictions in all predicted positive cases.
Recall:
[ \text{Recall} = \frac{TP}{TP + FN} ]
Proportion of true positives in all actual positive cases.
Importance:
Particularly relevant in scenarios where the detection of positives is crucial.
Definition:
The threshold that determines the whole confusion matrix.
Varying the threshold will give different confusion matrices and thus different precisions and recalls.
Inverse Relationship and Adjustment of Decision Thresholds:
Improving precision tends to reduce recall and vice versa.
Graphical Representation:
A Precision-Recall curve reflects performance across various thresholds showing the trade-off. Always downward and never upward due to inverse relationship, which illustrates that as we increase precision by adjusting the decision threshold, we inevitably sacrifice some level of recall.
Definition:
A graphical representation of model performance across all thresholds, plotting True Positive Rate (TPR) against False Positive Rate (FPR).
Changing the decision threshold may change the location of the point.
An ROC curve is made by changing the decision threshold.
Use of Confusion Matrix Elements:
Incorporates TP, TN, FP, FN in analysis.
Significance of Points:
(0,0): Everything classified as negative. Decision threshold line is 1.
(1,1): Everything classified as positive. Decision threshold line is 0.
(1,0): Perfect model with no incorrect predictions. 100% accuracy.
(0,1): Poor model with all false positives. 0% accuracy.
Definition:
Measures the entire area under the ROC curve.
Interpreting AUC:
Greater area indicates better model performance; AUC range is [0-1]. AUC less than 0.5 indicates performance worse than random guessing.
Focus on Class Labels:
PR Curves: Primarily evaluate the performance of a model by focusing on the positive class only. This is particularly useful in imbalanced datasets where the positive class (minority) is of higher interest.
ROC Curves: Assess the performance of a model across both classes (positive and negative), providing a more general view of how the model performs regardless of class distribution.
True Positive Rate vs. Precision:
Precision-Recall Curves: The y-axis represents precision (TP / (TP + FP)), which shows the accuracy of positive predictions only, relative to the total predicted positives.
ROC Curves: The y-axis measures the True Positive Rate (TPR), also known as sensitivity or recall, which quantifies how well the model identifies actual positive cases.
Sensitivity to Class Imbalance:
PR Curves: More informative in cases of class imbalance since they directly illustrate the trade-offs between precision and recall when focusing specifically on the positive class.
ROC Curves: Can give an overly optimistic estimation of model performance in imbalanced conditions, as high TPR could be misleading due to the excess number of negatives.
Business Alignment:
PR Curves: Often align better with business needs when the true positive rate is crucial, such as in fraud detection or medical diagnosis where missing a positive instance can have severe consequences.
ROC Curves: Provide a broader perspective which may not focus on specific business priorities, sometimes diluting the significance of positive classes when they are not the main concern.