HS

lecture 7

Data Mining for eHealth - Comprehensive Notes

Data Mining Overview

  • Data Mining: "The process of discovering patterns in data. The process must be automatic or (more usually) semi-automatic. The patterns discovered must be meaningful." (Witten and Frank, 2005)

  • Data mining focuses on extracting meaningful information from medical data

  • Various types of medical data can be collected in healthcare environments

  • There's a growing abundance of data through connected devices (IoT)

  • A knowledge gap exists between data availability and systems for data analytics

Knowledge Discovery in Databases (KDD) Process

  • KDD (Knowledge Discovery in Databases): A structured process that involves data collection, preprocessing, data mining techniques application, and evaluation/interpretation to extract knowledge from data.

The process includes:

  1. Data Collection: Gathering relevant datasets

  2. Preprocessing:

    • Dealing with missing values

    • Creating additional attributes

    • Performing attribute selection

  3. Data Mining Techniques: Applying algorithms to extract patterns

  4. Evaluation and Interpretation: Analyzing results to gain knowledge

Data Structure in Healthcare

  • Structured Data:

    • Rows = Records/Examples/Instances (e.g., patient data)

    • Columns = Features/Attributes (e.g., age, symptoms)

    • Example attributes from breast cancer study: tumor location, size, age at diagnosis, irradiated status, recurrence

  • Attribute Selection:

    • In theory: more attributes should result in more accurate patterns

    • In practice: irrelevant data may "confuse" algorithms

    • Selection methods:

      • Manual selection: using domain knowledge from experts

      • Automatic selection: reduces dimensionality by deleting unsuitable attributes based on heuristics

  • Attribute Construction:

    • Creating new attributes to make regularities more apparent

    • Uses simple relational rules to reduce complexity in datasets

    • Can achieve dimensionality reduction

Data Mining Tasks

Descriptive Data Mining

  • Clustering: The process of identifying groups where data points within a cluster are similar while those across clusters are dissimilar.

    • Finds finite sets of categories (clusters) to describe data

    • Groups examples so similarity within clusters is maximized and similarity between clusters is minimized

    • Applications:

      • Market basket analysis (items bought together)

      • Healthcare: identifying patients with similar treatment needs

    • Example: Grouping patients based on LDL, HDL, and BMI measurements

Predictive Data Mining

  • Regression: Finding a model (function) that maps a given input (attributes values) to a numeric prediction.

    • Deals with time components, crucial for forecasting data

    • Applications:

      • Forecasting economy growth based on market indicators

      • Predicting patient recovery trajectory after injury or surgery

  • Classification: Finding a model that is able to predict the value of the class attribute of an example based on the values of a set of attributes.

    • Each record belongs to a predefined class

    • Each example consists of predictor attributes and a class attribute

    • Requires training sets for model building and test sets for evaluation

    • Goal: discover relationships that allow prediction of class values

    • Process flow:

      1. Use training set (known-class examples) to build a model

      2. Apply model to test set (unknown-class examples) to predict classes

Classification Algorithms in Healthcare

  • Artificial Neural Networks (ANNs):

    • Inspired by complex neural connections in biological systems

    • Built from densely interconnected simple units

    • Each unit takes real-valued inputs and produces a single real-valued output

    • Well-suited for noisy, complex sensor data often found in healthcare contexts

  • Support Vector Machines (SVMs):

    • Binary linear classifier

    • Goal: find the maximum margin hyperplane (plane with greatest separation between classes)

    • Identifies support vectors to define decision boundaries

  • Decision Trees:

    • Comprehensible graphical representation of a classification model

    • Internal nodes correspond to attribute tests (decision nodes)

    • Leaf nodes correspond to predicted class values

    • Classification process:

      • Tree is traversed top-down from root to leaf

      • Branches selected according to attribute test outcomes

      • Class value at leaf node is the predicted class

  • Rule Induction (was mentioned in your notes but not detailed in the provided material)

Importance of Interpretability and Trustworthiness

  • Crucial for adoption of predictive systems in healthcare

  • Need for clear reasoning behind model predictions to build trust with healthcare professionals

  • Decision trees offer better interpretability compared to "black box" models like neural networks

Additional Information

  • The "Bioinformatics Knowledge Gap" chart shows the increasing disparity between available data and our ability to extract knowledge from it

  • The lecture mentions specific evaluation methods for classification model performance will be discussed in a future class

  • Attribute maintenance is important (e.g., using date of birth instead of age for better data management)