Chapter 11 – Performance Evaluation (11.3) and Related Concepts
11.1 Data Mining Overview
- Goal: Data mining process consists of iterative stages to transform data into actionable knowledge.
- Stages (in order):
- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment
- Flow and interdependencies:
- Data preparation is interconnected with modeling and evaluation, indicating feedback between stages as insights emerge.
- Key takeaway: A structured, iterative workflow is essential for building predictive systems, with feedback loops between preparation, modeling, and evaluation.
11.2 Similarity Measures
- Context: Similarity measures are used to compare objects/entities; slide provides a visual example (text alternative available).
- Example plot (from slide): x-axis ranges 0–12, y-axis ranges 0–6, with three plotted points at
(3,4),
(4,5),
(10,1) - Note: The slide emphasizes visual representation of similarity measures; no explicit formulas provided in transcript.
- Data partitioning (core idea):
- Purpose: Assess predictive performance on unseen data and guard against overfitting.
- Two-way random partitioning: create a training set and a validation set (commonly used splits: 60/40 or 70/30).
- Rationale for random partitioning: avoid bias in training/validation split selection.
- Training set: used to train the data mining model; contains a larger portion of data; aims to learn relationships between predictors and the target variable.
- Validation set: used to provide an unbiased assessment of predictive performance; compare predicted vs actual target values; helps fine-tune and compare models.
- Common performance purpose of the validation set: identify the optimal model and provide unbiased evaluation.
- Three-way random partitioning (supervised data mining):
- Adds a test set in addition to training and validation.
- Test set: not involved in either model building or model selection; used to evaluate how well the final model would perform on new data.
- Typically implemented with the holdout method rather than k-fold cross-validation.
- Cross-validation vs holdout:
- Holdout methods (two-way or three-way) partition the data once.
- k-fold cross-validation partitions the data into k subsets and rotates the validation set across folds to obtain more robust estimates.
- Handling rare target classes (oversampling):
- Problem: models may ignore rare target classes if overall accuracy is high due to majority class.
- Oversampling approach: oversample the rare class in the training data to overweight it relative to other classes.
- Important constraints:
- Only the training set is oversampled; the validation and test sets retain the original class distribution.
- Objective: produce models that are more useful for predicting all target classes, not just the majority class.
- Overfitting and model complexity:
- Overfitting: a model that fits in-sample data too closely, including noise, leading to poor generalization to new observations.
- As model complexity increases, training error typically decreases, but validation error decreases initially and then rises after a point.
- The rise in validation error beyond the optimal point signals overfitting.
- The concept of an optimal model complexity is often identified via validation performance.
- Detecting overfitting with partitioning and cross-validation:
- Use training data to fit models and validation data to monitor generalization performance.
- A model that performs well on training data but poorly on validation data is overfitting.
- Performance measures and evaluation framework:
- Performance measures derived from a confusion matrix (see next sub-section).
- These measures help compare competing models and guide model tuning.
- Confusion matrix (classification outcomes):
- Structure (assuming class 1 is the target/success):
- True positives (TP): class 1 correctly classified as class 1
- True negatives (TN): class 0 correctly classified as class 0
- False positives (FP): class 0 incorrectly classified as class 1
- False negatives (FN): class 1 incorrectly classified as class 0
- The confusion matrix is the basis for most classification performance metrics.
- Baseline / naïve rule:
- A common baseline is to compare model performance against a naive rule that always assigns the most predominant class.
- This helps quantify the value added by the model.
- FashionTech example (confusion matrix and interpretations):
- Given:
- TP = 30, FN = 18, FP = 19, TN = 133 (Total N = 200)
- Calculations:
- Accuracy: ext{Accuracy} = rac{TP + TN}{N} = rac{30 + 133}{200} = rac{163}{200} = 0.815 \, (81.5 ext{ }\%)
- Sensitivity (Recall): ext{Sensitivity} = rac{TP}{TP + FN} = rac{30}{30 + 18} = rac{30}{48} \approx 0.625
- Specificity: \text{Specificity} = \frac{TN}{TN + FP} = \frac{133}{133 + 19} = \frac{133}{152} \approx 0.875
- Precision (Positive Predictive Value): \text{Precision} = \frac{TP}{TP + FP} = \frac{30}{30 + 19} = \frac{30}{49} \approx 0.612
- Interpretation:
- The model correctly identifies 30 of 48 actual positives (Sensitivity ~ 0.625).
- It correctly identifies 133 of 152 actual negatives (Specificity ~ 0.875).
- Overall accuracy is about 81.5% on the 200-observation validation set.
- Cutoff value and predicted class probability:
- Class membership is determined by comparing the predicted probability of the target class to a cutoff value (default commonly 0.5).
- Adjusting the cutoff changes the confusion matrix and all derived performance measures (sensitivity, specificity, precision, etc.).
- Rationale: different misclassification costs or class distributions may justify moving the cutoff away from 0.5.
- Cutoff example table (illustrative values from the slide):
- Cutoff = 0.15
- Misclassification rate: 0.20
- Accuracy: 0.80
- Sensitivity: 1.00
- Precision: 0.714
- Specificity: 0.60
- Cutoff = 0.25
- Misclassification rate: 0.10
- Accuracy: 0.90
- Sensitivity: 1.00
- Precision: 0.833
- Specificity: 0.80
- Cutoff = 0.50
- Misclassification rate: 0.30
- Accuracy: 0.70
- Sensitivity: 0.60
- Precision: 0.75
- Specificity: 0.30
- Cutoff = 0.75
- Misclassification rate: 0.60
- Accuracy: 0.40
- Sensitivity: 1.00
- Precision: 0.85
- Specificity: 0.40
- Cutoff = 0.85
- Misclassification rate: 1.00
- Accuracy: 0.60
- Sensitivity: 1.00
- Precision: 0.60
- Specificity: 0.20
- Practical takeaway: Lower cutoffs tend to increase sensitivity (catch more positives) but reduce specificity, while higher cutoffs increase specificity but can reduce sensitivity.
- Receiver Operating Characteristic (ROC) curve and AUC
- ROC curve plots: x-axis = 1 − Specificity, y-axis = Sensitivity, across all cutoff values.
- Baseline: diagonal line represents random classification using prior probabilities.
- Perfect point on ROC: (0, 1) indicates 100% sensitivity and 100% specificity.
- A good model has a ROC curve above the baseline; the larger the area between the ROC curve and the baseline, the better the model.
- Area Under the Curve (AUC): a single scalar measure of overall performance, ranging from 0 to 1.
- Interpretation: AUC = 1 is perfect; AUC = 0.5 corresponds to random guessing.
- FashionTech example: AUC = 0.9457 (high, indicates strong discriminative ability).
- Lift charts and decile-wise lift
- Cumulative lift chart:
- Purpose: show the improvement of the model over random selection in capturing target class cases.
- Axes: x-axis = number (or percent) of cases selected; y-axis = cumulative number of target class cases identified by the model.
- Baseline: orange diagonal line representing random selection.
- Lift curve: blue curve showing model performance; the ratio called lift is higher when the model captures more target cases with fewer cases.
- Interpretation: A lift curve above the baseline indicates good predictive performance; the higher the lift, the better the model identifies the target class with fewer observations.
- Decile-wise lift chart:
- Divides data into 10 equal-sized intervals (deciles).
- Bar chart where y-axis shows the ratio of target class cases identified by the model to those identified by random selection within each decile.
- Use: identify at which deciles the model is most effective and where performance deteriorates.
- Example (from slide): decile 1 lift ≈ 2.3, decile 2 ≈ 3.2, decile 3 ≈ 1.5, deciles 4–6 ≈ 1.1–0.3, deciles 7–10 near 0, with a general decreasing trend as decile increases.
- Model evaluation with RMSE-based error measures (for regression/prediction tasks)
- Root Mean Square Error (RMSE):
- Definition: RMSE = \sqrt{\frac{1}{n} \sum{i=1}^{n} (\hat{y}i - y_i)^2}
- Mean Error (ME):
- Definition: ME = \frac{1}{n} \sum{i=1}^{n} (\hat{y}i - y_i)
- Interpretation: sign indicates bias; positive ME means underprediction, negative ME means overprediction.
- Mean Absolute Deviation (MAD):
- Definition: MAD = \frac{1}{n} \sum{i=1}^{n} |\hat{y}i - y_i|
- Mean Percentage Error (MPE):
- Definition: MPE = \frac{1}{n} \sum{i=1}^{n} \frac{\hat{y}i - yi}{yi}
- Mean Absolute Percentage Error (MAPE):
- Definition: MAPE = \frac{1}{n} \sum{i=1}^{n} \left| \frac{\hat{y}i - yi}{yi} \right|
- Example—FashionTech model prediction comparison (two models)
- Prediction data set: 200 observations with actual values ActVal and two model predictions PredVal1, PredVal2.
- Model 1 vs Model 2 reported metrics:
- RMSE: Model 1 = 171.3489, Model 2 = 174.1758
- ME: Model 1 = 11.2530, Model 2 = 12.0480
- MAD: Model 1 = 115.1650, Model 2 = 117.9920
- MPE: Model 1 = -2.05\%, Model 2 = -2.08\%
- MAPE: Model 1 = 15.51\%, Model 2 = 15.95\%
- Summary of predictive performance framework
- Use a combination of partitioning, cross-validation, and held-out test data to obtain unbiased estimates of predictive performance.
- Compare models using confusion-matrix-based metrics (accuracy, misclassification rate, sensitivity/recall, specificity, precision) and threshold-dependent measures.
- Use ROC/AUC, lift curves, and decile lifts to assess ranking quality and ability to identify target cases at scale.
- Consider the impact of class imbalance and misclassification costs when choosing evaluation metrics and cutoff thresholds.
11.4 Principal Component Analysis (PCA)
- Purpose of PCA:
- PCA reduces dimensionality of data by projecting onto principal components that capture the maximum variance in the data.
- Visualization example (text alternative):
- Plot shows two principal components, PC1 and PC2, as axes:
- PC1 arrow extends diagonally upward to the right, starting near (0,0) and ending around (19, 5.8).
- PC2 arrow is nearly vertical, starting near (12, 0) and pointing upward toward (9, 6).
- Data distribution: most data concentrates between X1 values 10–15 and X2 values 1.5–5.
- Interpretation of PCA plot:
- PC1 captures the direction of greatest variance along an axis roughly aligned with increasing X1.
- PC2 captures the second most variance, oriented roughly perpendicular to PC1, adding information about the spread in the X2 direction.
- The concentration of points around the center with spread along the PC axes suggests that the data can be represented with a smaller number of components without substantial loss of information.
- Practical takeaway:
- PCA helps reduce dimensionality for visualization, noise reduction, and as a preprocessing step for modeling when many correlated features are present.
Connections to broader data mining theory and practice
- Data mining process and evaluation framework align with foundational principles:
- Validation and test data are essential for estimating generalization performance.
- Cross-validation provides robust estimates when data are limited.
- Handling class imbalance via oversampling (in training) improves model usefulness across target classes.
- Model selection should balance bias (underfitting) and variance (overfitting) by selecting an optimal model complexity.
- Practical evaluation tools:
- Confusion matrix-based metrics summarize predictive performance for classification tasks.
- Threshold tuning (cutoff adjustment) provides a way to align model predictions with business costs and risk preferences.
- ROC/AUC, lift charts, and decile-wise lift complement accuracy-based metrics by focusing on ranking and targeting performance.
- Real-world relevance and ethics/practical implications:
- Proper validation avoids deploying models that look good on historical data but fail in production, reducing harm in decision-making contexts.
- Handling class imbalance ethically ensures minority groups receive attention in predictive systems (e.g., fraud, rare events).
- Threshold selection should reflect cost-sensitive values and fairness considerations where applicable.
- Overall accuracy (classification):
ext{Accuracy} = rac{TP + TN}{TP + TN + FP + FN} - Misclassification (error) rate:
ext{Misclassification Rate} = 1 - ext{Accuracy} = rac{FP + FN}{TP + TN + FP + FN} - Sensitivity (Recall):
ext{Sensitivity} = rac{TP}{TP + FN} - Specificity:
ext{Specificity} = rac{TN}{TN + FP} - Precision (Positive Predictive Value):
ext{Precision} = rac{TP}{TP + FP} - Confusion matrix (for reference):
- TP: true positives
- FP: false positives
- TN: true negatives
- FN: false negatives
- Threshold / cutoff for class membership:
- If score = predicted probability of class 1, classify as 1 if score ≥ cutoff, else 0.
- Lift (top-p% of cases):
- Let S(p) be the set of top p% cases by predicted score.
- Lift(p) is the ratio of positives found in S(p) to those that would be found by random selection:
ext{Lift}(p) = rac{ ext{# positives in } S(p) }{ p imes ext{Total positives} }
- ROC and AUC (conceptual):
- ROC curve: plot sensitivity vs 1 − specificity across cutoffs.
- AUC: ext{AUC} = ext{Area under ROC curve} \in [0,1]
- PCA (conceptual):
- PCA seeks projection onto principal components that maximize captured variance; the first principal component (PC1) explains the largest variance, the second (PC2) explains the next largest, subject to being orthogonal to PC1.