AH

Chapter 11 – Performance Evaluation (11.3) and Related Concepts

11.1 Data Mining Overview

  • Goal: Data mining process consists of iterative stages to transform data into actionable knowledge.
  • Stages (in order):
    • Business understanding
    • Data understanding
    • Data preparation
    • Modeling
    • Evaluation
    • Deployment
  • Flow and interdependencies:
    • Data preparation is interconnected with modeling and evaluation, indicating feedback between stages as insights emerge.
  • Key takeaway: A structured, iterative workflow is essential for building predictive systems, with feedback loops between preparation, modeling, and evaluation.

11.2 Similarity Measures

  • Context: Similarity measures are used to compare objects/entities; slide provides a visual example (text alternative available).
  • Example plot (from slide): x-axis ranges 0–12, y-axis ranges 0–6, with three plotted points at
    (3,4),
    (4,5),
    (10,1)
  • Note: The slide emphasizes visual representation of similarity measures; no explicit formulas provided in transcript.

11.3 Performance Evaluation

  • Data partitioning (core idea):
    • Purpose: Assess predictive performance on unseen data and guard against overfitting.
    • Two-way random partitioning: create a training set and a validation set (commonly used splits: 60/40 or 70/30).
    • Rationale for random partitioning: avoid bias in training/validation split selection.
    • Training set: used to train the data mining model; contains a larger portion of data; aims to learn relationships between predictors and the target variable.
    • Validation set: used to provide an unbiased assessment of predictive performance; compare predicted vs actual target values; helps fine-tune and compare models.
    • Common performance purpose of the validation set: identify the optimal model and provide unbiased evaluation.
  • Three-way random partitioning (supervised data mining):
    • Adds a test set in addition to training and validation.
    • Test set: not involved in either model building or model selection; used to evaluate how well the final model would perform on new data.
    • Typically implemented with the holdout method rather than k-fold cross-validation.
  • Cross-validation vs holdout:
    • Holdout methods (two-way or three-way) partition the data once.
    • k-fold cross-validation partitions the data into k subsets and rotates the validation set across folds to obtain more robust estimates.
  • Handling rare target classes (oversampling):
    • Problem: models may ignore rare target classes if overall accuracy is high due to majority class.
    • Oversampling approach: oversample the rare class in the training data to overweight it relative to other classes.
    • Important constraints:
    • Only the training set is oversampled; the validation and test sets retain the original class distribution.
    • Objective: produce models that are more useful for predicting all target classes, not just the majority class.
  • Overfitting and model complexity:
    • Overfitting: a model that fits in-sample data too closely, including noise, leading to poor generalization to new observations.
    • As model complexity increases, training error typically decreases, but validation error decreases initially and then rises after a point.
    • The rise in validation error beyond the optimal point signals overfitting.
    • The concept of an optimal model complexity is often identified via validation performance.
  • Detecting overfitting with partitioning and cross-validation:
    • Use training data to fit models and validation data to monitor generalization performance.
    • A model that performs well on training data but poorly on validation data is overfitting.
  • Performance measures and evaluation framework:
    • Performance measures derived from a confusion matrix (see next sub-section).
    • These measures help compare competing models and guide model tuning.
  • Confusion matrix (classification outcomes):
    • Structure (assuming class 1 is the target/success):
    • True positives (TP): class 1 correctly classified as class 1
    • True negatives (TN): class 0 correctly classified as class 0
    • False positives (FP): class 0 incorrectly classified as class 1
    • False negatives (FN): class 1 incorrectly classified as class 0
    • The confusion matrix is the basis for most classification performance metrics.
  • Baseline / naïve rule:
    • A common baseline is to compare model performance against a naive rule that always assigns the most predominant class.
    • This helps quantify the value added by the model.
  • FashionTech example (confusion matrix and interpretations):
    • Given:
    • TP = 30, FN = 18, FP = 19, TN = 133 (Total N = 200)
    • Calculations:
    • Accuracy: ext{Accuracy} = rac{TP + TN}{N} = rac{30 + 133}{200} = rac{163}{200} = 0.815 \, (81.5 ext{ }\%)
    • Sensitivity (Recall): ext{Sensitivity} = rac{TP}{TP + FN} = rac{30}{30 + 18} = rac{30}{48} \approx 0.625
    • Specificity: \text{Specificity} = \frac{TN}{TN + FP} = \frac{133}{133 + 19} = \frac{133}{152} \approx 0.875
    • Precision (Positive Predictive Value): \text{Precision} = \frac{TP}{TP + FP} = \frac{30}{30 + 19} = \frac{30}{49} \approx 0.612
    • Interpretation:
    • The model correctly identifies 30 of 48 actual positives (Sensitivity ~ 0.625).
    • It correctly identifies 133 of 152 actual negatives (Specificity ~ 0.875).
    • Overall accuracy is about 81.5% on the 200-observation validation set.
  • Cutoff value and predicted class probability:
    • Class membership is determined by comparing the predicted probability of the target class to a cutoff value (default commonly 0.5).
    • Adjusting the cutoff changes the confusion matrix and all derived performance measures (sensitivity, specificity, precision, etc.).
    • Rationale: different misclassification costs or class distributions may justify moving the cutoff away from 0.5.
  • Cutoff example table (illustrative values from the slide):
    • Cutoff = 0.15
    • Misclassification rate: 0.20
    • Accuracy: 0.80
    • Sensitivity: 1.00
    • Precision: 0.714
    • Specificity: 0.60
    • Cutoff = 0.25
    • Misclassification rate: 0.10
    • Accuracy: 0.90
    • Sensitivity: 1.00
    • Precision: 0.833
    • Specificity: 0.80
    • Cutoff = 0.50
    • Misclassification rate: 0.30
    • Accuracy: 0.70
    • Sensitivity: 0.60
    • Precision: 0.75
    • Specificity: 0.30
    • Cutoff = 0.75
    • Misclassification rate: 0.60
    • Accuracy: 0.40
    • Sensitivity: 1.00
    • Precision: 0.85
    • Specificity: 0.40
    • Cutoff = 0.85
    • Misclassification rate: 1.00
    • Accuracy: 0.60
    • Sensitivity: 1.00
    • Precision: 0.60
    • Specificity: 0.20
    • Practical takeaway: Lower cutoffs tend to increase sensitivity (catch more positives) but reduce specificity, while higher cutoffs increase specificity but can reduce sensitivity.
  • Receiver Operating Characteristic (ROC) curve and AUC
    • ROC curve plots: x-axis = 1 − Specificity, y-axis = Sensitivity, across all cutoff values.
    • Baseline: diagonal line represents random classification using prior probabilities.
    • Perfect point on ROC: (0, 1) indicates 100% sensitivity and 100% specificity.
    • A good model has a ROC curve above the baseline; the larger the area between the ROC curve and the baseline, the better the model.
    • Area Under the Curve (AUC): a single scalar measure of overall performance, ranging from 0 to 1.
    • Interpretation: AUC = 1 is perfect; AUC = 0.5 corresponds to random guessing.
    • FashionTech example: AUC = 0.9457 (high, indicates strong discriminative ability).
  • Lift charts and decile-wise lift
    • Cumulative lift chart:
    • Purpose: show the improvement of the model over random selection in capturing target class cases.
    • Axes: x-axis = number (or percent) of cases selected; y-axis = cumulative number of target class cases identified by the model.
    • Baseline: orange diagonal line representing random selection.
    • Lift curve: blue curve showing model performance; the ratio called lift is higher when the model captures more target cases with fewer cases.
    • Interpretation: A lift curve above the baseline indicates good predictive performance; the higher the lift, the better the model identifies the target class with fewer observations.
    • Decile-wise lift chart:
    • Divides data into 10 equal-sized intervals (deciles).
    • Bar chart where y-axis shows the ratio of target class cases identified by the model to those identified by random selection within each decile.
    • Use: identify at which deciles the model is most effective and where performance deteriorates.
    • Example (from slide): decile 1 lift ≈ 2.3, decile 2 ≈ 3.2, decile 3 ≈ 1.5, deciles 4–6 ≈ 1.1–0.3, deciles 7–10 near 0, with a general decreasing trend as decile increases.
  • Model evaluation with RMSE-based error measures (for regression/prediction tasks)
    • Root Mean Square Error (RMSE):
    • Definition: RMSE = \sqrt{\frac{1}{n} \sum{i=1}^{n} (\hat{y}i - y_i)^2}
    • Mean Error (ME):
    • Definition: ME = \frac{1}{n} \sum{i=1}^{n} (\hat{y}i - y_i)
    • Interpretation: sign indicates bias; positive ME means underprediction, negative ME means overprediction.
    • Mean Absolute Deviation (MAD):
    • Definition: MAD = \frac{1}{n} \sum{i=1}^{n} |\hat{y}i - y_i|
    • Mean Percentage Error (MPE):
    • Definition: MPE = \frac{1}{n} \sum{i=1}^{n} \frac{\hat{y}i - yi}{yi}
    • Mean Absolute Percentage Error (MAPE):
    • Definition: MAPE = \frac{1}{n} \sum{i=1}^{n} \left| \frac{\hat{y}i - yi}{yi} \right|
  • Example—FashionTech model prediction comparison (two models)
    • Prediction data set: 200 observations with actual values ActVal and two model predictions PredVal1, PredVal2.
    • Model 1 vs Model 2 reported metrics:
    • RMSE: Model 1 = 171.3489, Model 2 = 174.1758
    • ME: Model 1 = 11.2530, Model 2 = 12.0480
    • MAD: Model 1 = 115.1650, Model 2 = 117.9920
    • MPE: Model 1 = -2.05\%, Model 2 = -2.08\%
    • MAPE: Model 1 = 15.51\%, Model 2 = 15.95\%
  • Summary of predictive performance framework
    • Use a combination of partitioning, cross-validation, and held-out test data to obtain unbiased estimates of predictive performance.
    • Compare models using confusion-matrix-based metrics (accuracy, misclassification rate, sensitivity/recall, specificity, precision) and threshold-dependent measures.
    • Use ROC/AUC, lift curves, and decile lifts to assess ranking quality and ability to identify target cases at scale.
    • Consider the impact of class imbalance and misclassification costs when choosing evaluation metrics and cutoff thresholds.

11.4 Principal Component Analysis (PCA)

  • Purpose of PCA:
    • PCA reduces dimensionality of data by projecting onto principal components that capture the maximum variance in the data.
  • Visualization example (text alternative):
    • Plot shows two principal components, PC1 and PC2, as axes:
    • PC1 arrow extends diagonally upward to the right, starting near (0,0) and ending around (19, 5.8).
    • PC2 arrow is nearly vertical, starting near (12, 0) and pointing upward toward (9, 6).
    • Data distribution: most data concentrates between X1 values 10–15 and X2 values 1.5–5.
  • Interpretation of PCA plot:
    • PC1 captures the direction of greatest variance along an axis roughly aligned with increasing X1.
    • PC2 captures the second most variance, oriented roughly perpendicular to PC1, adding information about the spread in the X2 direction.
    • The concentration of points around the center with spread along the PC axes suggests that the data can be represented with a smaller number of components without substantial loss of information.
  • Practical takeaway:
    • PCA helps reduce dimensionality for visualization, noise reduction, and as a preprocessing step for modeling when many correlated features are present.

Connections to broader data mining theory and practice

  • Data mining process and evaluation framework align with foundational principles:
    • Validation and test data are essential for estimating generalization performance.
    • Cross-validation provides robust estimates when data are limited.
    • Handling class imbalance via oversampling (in training) improves model usefulness across target classes.
    • Model selection should balance bias (underfitting) and variance (overfitting) by selecting an optimal model complexity.
  • Practical evaluation tools:
    • Confusion matrix-based metrics summarize predictive performance for classification tasks.
    • Threshold tuning (cutoff adjustment) provides a way to align model predictions with business costs and risk preferences.
    • ROC/AUC, lift charts, and decile-wise lift complement accuracy-based metrics by focusing on ranking and targeting performance.
  • Real-world relevance and ethics/practical implications:
    • Proper validation avoids deploying models that look good on historical data but fail in production, reducing harm in decision-making contexts.
    • Handling class imbalance ethically ensures minority groups receive attention in predictive systems (e.g., fraud, rare events).
    • Threshold selection should reflect cost-sensitive values and fairness considerations where applicable.

Equations and key formulas (LaTeX)

  • Overall accuracy (classification):
    ext{Accuracy} = rac{TP + TN}{TP + TN + FP + FN}
  • Misclassification (error) rate:
    ext{Misclassification Rate} = 1 - ext{Accuracy} = rac{FP + FN}{TP + TN + FP + FN}
  • Sensitivity (Recall):
    ext{Sensitivity} = rac{TP}{TP + FN}
  • Specificity:
    ext{Specificity} = rac{TN}{TN + FP}
  • Precision (Positive Predictive Value):
    ext{Precision} = rac{TP}{TP + FP}
  • Confusion matrix (for reference):
    • TP: true positives
    • FP: false positives
    • TN: true negatives
    • FN: false negatives
  • Threshold / cutoff for class membership:
    • If score = predicted probability of class 1, classify as 1 if score ≥ cutoff, else 0.
  • Lift (top-p% of cases):
    • Let S(p) be the set of top p% cases by predicted score.
    • Lift(p) is the ratio of positives found in S(p) to those that would be found by random selection:
      ext{Lift}(p) = rac{ ext{# positives in } S(p) }{ p imes ext{Total positives} }
  • ROC and AUC (conceptual):
    • ROC curve: plot sensitivity vs 1 − specificity across cutoffs.
    • AUC: ext{AUC} = ext{Area under ROC curve} \in [0,1]
  • PCA (conceptual):
    • PCA seeks projection onto principal components that maximize captured variance; the first principal component (PC1) explains the largest variance, the second (PC2) explains the next largest, subject to being orthogonal to PC1.