Data Mining Overview
Data Mining Definition: The process of discovering patterns and knowledge from large amounts of data.
Types of Data Mining Tasks:
Descriptive Tasks: Provide an overview of the data, summarize its main characteristics.
Predictive Tasks: Make predictions about unknown future events based on known past information.
Model Learning:
Models learn from a training set of data.
The model is assessed based on its performance on a test set.
Feature Selection and Construction
Attribute/Feature Selection: Process of selecting the most relevant variables for model building.
Feature Construction: Creating new features from existing ones to improve model performance.
Model Evaluation
Accuracy:
Definition: The proportion of correct predictions made by the model.
Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
While common, accuracy alone is often insufficient for healthcare applications.
Confusion Matrix: A table used to evaluate the performance of a classification model by comparing predicted and actual results.
Consists of 4 cells:
True Positives (TP): Cases predicted as positive and are indeed positive.
True Negatives (TN): Cases predicted as negative and are indeed negative.
False Positives (FP): Cases predicted as positive but are actually negative (Type I error).
False Negatives (FN): Cases predicted as negative but are actually positive (Type II error).
Sensitivity (True Positive Rate):
Definition: The likelihood that a diseased patient has a positive test.
Can be expressed as conditional probability: P(positive test | disease is present)
Formula: Sensitivity = TP / (TP + FN)
Specificity (True Negative Rate):
Definition: The likelihood that a healthy patient has a negative test.
Can be expressed as conditional probability: P(negative test | disease is absent)
Formula: Specificity = TN / (TN + FP)
Measure of Test positive
The desirable test will have
High TPR
High TNR
Decision Threshold in Classification Models
Continuous Output: Most prediction models give continuous values that need a threshold to define positive or negative cases.
Threshold Setting Implications: The values of sensitivity and specificity are dependent on the particular cut-off value or threshold chosen to distinguish normal and abnormal results.
Threshold Trade-offs:
Lowering the number of false positives can increase specificity but decrease sensitivity by increasing false negatives.
Lowering the number of false negatives can increase sensitivity but decrease specificity by increasing false positives.
Clinical Context for Threshold Selection:
If the disease is serious and life-saving therapy is available, minimize false negatives (higher sensitivity preferred).
If the disease is not serious and the therapy has risks, minimize false positives (higher specificity preferred).
Example: Home pregnancy tests have high specificity ("If the test result is positive, you're almost certainly pregnant") but lower sensitivity ("If the test result is negative, you may not be pregnant").
Model Interpretability
Black Box Models:
Definition: Models that are not easily interpretable by humans.
Examples: Artificial Neural Networks, Support Vector Machines
Disadvantage: Difficult to validate the reasoning behind predictions
White Box Models:
Definition: Models that provide clear reasoning for predictions.
Examples: Decision Trees
Advantage: Predictions can be justified and validated by domain experts (e.g., medical doctors)
Importance of Explainability in Healthcare:
In healthcare systems, model predictions need justification and validation by medical professionals.
Example of importance: Military tank classification system that appeared accurate during testing but actually learned to identify weather conditions rather than tanks (sunny vs. overcast days) - this flaw would have been immediately noticed in an interpretable model.