lecture 7
Data Mining for eHealth - Comprehensive Notes
Data Mining Overview
Data Mining: "The process of discovering patterns in data. The process must be automatic or (more usually) semi-automatic. The patterns discovered must be meaningful." (Witten and Frank, 2005)
Data mining focuses on extracting meaningful information from medical data
Various types of medical data can be collected in healthcare environments
There's a growing abundance of data through connected devices (IoT)
A knowledge gap exists between data availability and systems for data analytics
Knowledge Discovery in Databases (KDD) Process
KDD (Knowledge Discovery in Databases): A structured process that involves data collection, preprocessing, data mining techniques application, and evaluation/interpretation to extract knowledge from data.
The process includes:
Data Collection: Gathering relevant datasets
Preprocessing:
Dealing with missing values
Creating additional attributes
Performing attribute selection
Data Mining Techniques: Applying algorithms to extract patterns
Evaluation and Interpretation: Analyzing results to gain knowledge
Data Structure in Healthcare
Structured Data:
Rows = Records/Examples/Instances (e.g., patient data)
Columns = Features/Attributes (e.g., age, symptoms)
Example attributes from breast cancer study: tumor location, size, age at diagnosis, irradiated status, recurrence
Attribute Selection:
In theory: more attributes should result in more accurate patterns
In practice: irrelevant data may "confuse" algorithms
Selection methods:
Manual selection: using domain knowledge from experts
Automatic selection: reduces dimensionality by deleting unsuitable attributes based on heuristics
Attribute Construction:
Creating new attributes to make regularities more apparent
Uses simple relational rules to reduce complexity in datasets
Can achieve dimensionality reduction
Data Mining Tasks
Descriptive Data Mining
Clustering: The process of identifying groups where data points within a cluster are similar while those across clusters are dissimilar.
Finds finite sets of categories (clusters) to describe data
Groups examples so similarity within clusters is maximized and similarity between clusters is minimized
Applications:
Market basket analysis (items bought together)
Healthcare: identifying patients with similar treatment needs
Example: Grouping patients based on LDL, HDL, and BMI measurements
Predictive Data Mining
Regression: Finding a model (function) that maps a given input (attributes values) to a numeric prediction.
Deals with time components, crucial for forecasting data
Applications:
Forecasting economy growth based on market indicators
Predicting patient recovery trajectory after injury or surgery
Classification: Finding a model that is able to predict the value of the class attribute of an example based on the values of a set of attributes.
Each record belongs to a predefined class
Each example consists of predictor attributes and a class attribute
Requires training sets for model building and test sets for evaluation
Goal: discover relationships that allow prediction of class values
Process flow:
Use training set (known-class examples) to build a model
Apply model to test set (unknown-class examples) to predict classes
Classification Algorithms in Healthcare
Artificial Neural Networks (ANNs):
Inspired by complex neural connections in biological systems
Built from densely interconnected simple units
Each unit takes real-valued inputs and produces a single real-valued output
Well-suited for noisy, complex sensor data often found in healthcare contexts
Support Vector Machines (SVMs):
Binary linear classifier
Goal: find the maximum margin hyperplane (plane with greatest separation between classes)
Identifies support vectors to define decision boundaries
Decision Trees:
Comprehensible graphical representation of a classification model
Internal nodes correspond to attribute tests (decision nodes)
Leaf nodes correspond to predicted class values
Classification process:
Tree is traversed top-down from root to leaf
Branches selected according to attribute test outcomes
Class value at leaf node is the predicted class
Rule Induction (was mentioned in your notes but not detailed in the provided material)
Importance of Interpretability and Trustworthiness
Crucial for adoption of predictive systems in healthcare
Need for clear reasoning behind model predictions to build trust with healthcare professionals
Decision trees offer better interpretability compared to "black box" models like neural networks
Additional Information
The "Bioinformatics Knowledge Gap" chart shows the increasing disparity between available data and our ability to extract knowledge from it
The lecture mentions specific evaluation methods for classification model performance will be discussed in a future class
Attribute maintenance is important (e.g., using date of birth instead of age for better data management)