Module Recap and Exam Preparation Notes
Recap of Module Topics
- Today's lecture will include a summary of all topics covered in the module.
- Encourage students to complete homework before Thursday's discussion.
- Students will be shown a past exam paper and will work on two specific questions until Thursday.
Session Structure
- The upcoming session on Thursday will be student-led to ensure engagement and understanding.
- Emphasizes the importance of student effort in preparation.
Overview of Medical Data
- Medical data is crucial in healthcare delivery, including in primary care, hospitals, and referrals.
- Functions of medical data:
- Categorizing problems.
- Understanding disease development and spread.
- Supporting decisions on treatments.
- Types of medical data:
- Narrative text data (historically dominant).
- Numerical measurements (e.g., blood pressure, glucose levels).
- Signal recordings (e.g., EEG).
- Images (e.g., MRI scans).
Difference Between Data, Information, and Knowledge
- Data: Raw entities or values (e.g., temperature, medical history).
- Information: Interpretation of data (e.g., determining if a temperature reading is high).
- Knowledge: Insights gained through reasoning, studies, and comparisons of data (e.g., link between high sugar levels and diabetes risk).
Data Mining Process
- Definition: Extracting meaningful information from data.
- Knowledge Discovery in Databases (KDD) process:
- Raw data collection, pre-processing, data mining (clustering/classification), and evaluation.
- Emphasis on structured data: rows (records) and columns (attributes).
Data Mining Tasks
- Differentiation between descriptive and predictive data mining tasks.
- Classification: Learning patterns to predict outcomes.
- Training set: Known outcomes used to build the model.
- Test set: Unseen data where outcomes are predicted based on the model.
- Evaluation metrics: confusion matrix (true positive, false positive, true negative, false negative) for model accuracy.
Model Evaluation
- Accuracy calculation:
Accuracy = \frac{TP + TN}{TP + TN + FP + FN} - Sensitivity (true positive rate) and specificity (true negative rate) also calculated to assess model performance, especially in imbalanced classes.
- Trade-offs: Adjusting thresholds can impact sensitivity and specificity.
- Accuracy calculation:
Importance of Interpretability in Classification Algorithms
- White box models (e.g., decision trees) vs. black box models (e.g., neural networks).
- Preference for white box models in healthcare for clinician understanding.
Clustering Techniques
- Definition: Grouping similar data points based on characteristics, no target variable ('unsupervised learning').
- Examples:
- Hierarchical clustering (building dendrograms).
- K-means and PAM (Partitioning Around Medoids) clustering (distance-based clustering).
- Fuzzy C-means (allowing overlap between clusters).
Decision Making Under Uncertainty
- Introduction to Bayes' Theorem for updating probabilities based on evidence.
- Relation to prior probabilities, sensitivity, and specificity.
- Application in a multi-test scenario to improve diagnostic accuracy.
Key Takeaways and Homework
- Students encouraged to review materials, especially case studies or examples covered in class.
- Homework:
- Download and review the 2021 past exam paper from Moodle.
- Attempt questions three and four for Thursday's session discussion.