lecture 7

Data Mining Overview

Data Mining: "The process of discovering patterns in data. The process must be automatic or (more usually) semi-automatic. The patterns discovered must be meaningful." (Witten and Frank, 2005)
Data mining focuses on extracting meaningful information from medical data
Various types of medical data can be collected in healthcare environments
There's a growing abundance of data through connected devices (IoT)
A knowledge gap exists between data availability and systems for data analytics

Knowledge Discovery in Databases (KDD) Process

KDD (Knowledge Discovery in Databases): A structured process that involves data collection, preprocessing, data mining techniques application, and evaluation/interpretation to extract knowledge from data.

The process includes:

Data Collection: Gathering relevant datasets
Preprocessing:
- Dealing with missing values
- Creating additional attributes
- Performing attribute selection
Data Mining Techniques: Applying algorithms to extract patterns
Evaluation and Interpretation: Analyzing results to gain knowledge

Data Structure in Healthcare

Structured Data:
- Rows = Records/Examples/Instances (e.g., patient data)
- Columns = Features/Attributes (e.g., age, symptoms)
- Example attributes from breast cancer study: tumor location, size, age at diagnosis, irradiated status, recurrence
Attribute Selection:
- In theory: more attributes should result in more accurate patterns
- In practice: irrelevant data may "confuse" algorithms
- Selection methods:
  - Manual selection: using domain knowledge from experts
  - Automatic selection: reduces dimensionality by deleting unsuitable attributes based on heuristics
Attribute Construction:
- Creating new attributes to make regularities more apparent
- Uses simple relational rules to reduce complexity in datasets
- Can achieve dimensionality reduction

Data Mining Tasks

Descriptive Data Mining

Clustering: The process of identifying groups where data points within a cluster are similar while those across clusters are dissimilar.
- Finds finite sets of categories (clusters) to describe data
- Groups examples so similarity within clusters is maximized and similarity between clusters is minimized
- Applications:
  - Market basket analysis (items bought together)
  - Healthcare: identifying patients with similar treatment needs
- Example: Grouping patients based on LDL, HDL, and BMI measurements

Predictive Data Mining

Regression: Finding a model (function) that maps a given input (attributes values) to a numeric prediction.
- Deals with time components, crucial for forecasting data
- Applications:
  - Forecasting economy growth based on market indicators
  - Predicting patient recovery trajectory after injury or surgery
Classification: Finding a model that is able to predict the value of the class attribute of an example based on the values of a set of attributes.
- Each record belongs to a predefined class
- Each example consists of predictor attributes and a class attribute
- Requires training sets for model building and test sets for evaluation
- Goal: discover relationships that allow prediction of class values
- Process flow:
  1. Use training set (known-class examples) to build a model
  2. Apply model to test set (unknown-class examples) to predict classes

Classification Algorithms in Healthcare

Artificial Neural Networks (ANNs):
- Inspired by complex neural connections in biological systems
- Built from densely interconnected simple units
- Each unit takes real-valued inputs and produces a single real-valued output
- Well-suited for noisy, complex sensor data often found in healthcare contexts
Support Vector Machines (SVMs):
- Binary linear classifier
- Goal: find the maximum margin hyperplane (plane with greatest separation between classes)
- Identifies support vectors to define decision boundaries
Decision Trees:
- Comprehensible graphical representation of a classification model
- Internal nodes correspond to attribute tests (decision nodes)
- Leaf nodes correspond to predicted class values
- Classification process:
  - Tree is traversed top-down from root to leaf
  - Branches selected according to attribute test outcomes
  - Class value at leaf node is the predicted class
Rule Induction (was mentioned in your notes but not detailed in the provided material)

Importance of Interpretability and Trustworthiness

Crucial for adoption of predictive systems in healthcare
Need for clear reasoning behind model predictions to build trust with healthcare professionals
Decision trees offer better interpretability compared to "black box" models like neural networks

Additional Information

The "Bioinformatics Knowledge Gap" chart shows the increasing disparity between available data and our ability to extract knowledge from it
The lecture mentions specific evaluation methods for classification model performance will be discussed in a future class
Attribute maintenance is important (e.g., using date of birth instead of age for better data management)