BE-530: Machine Learning in Python - Lecture 1 Introduction

Course Introduction, Scope, and Logistics

Course Details:
- Department: University of Louisville Bioengineering Department.
- Course Number: BE-530.
- Course Title: Machine Learning (ML) in Python.
- Term: Summer – 2026.
- Lecture Title: Lecture 1 Welcome: What This Course Trains You To Do.
Core Objective:
- The course aims to build end-to-end Machine Learning capability that is defensible.
- It maintains a biomedical and applied focus where all claims must be supported by evidence.
Defensible Workflow Components:
- Framing: Defining the problem effectively.
- Evaluation: Rigorously testing the model's performance.
- Reporting: Clearly documenting results.
- Reproducibility: Ensuring findings can be replicated.
Course Rhythm and Resources:
- The course design is predictable on a week-to-week basis.
- Blackboard: Serves as the central hub for readings, announcements, submissions, and grades.
- Deliverables: Weekly deliverables and their respective due dates are posted via Blackboard.
Grading Weightage:
- Homework + Quizzes: $40\%$ of the final grade.
- Projects (4 total): $60\%$ of the final grade.
- Rewards: Grades are based on methodology, justification, and reproducibility rather than just raw performance.
Submission Requirements:
- All submissions must be runnable and interpretable.
- Notebooks/Scripts: Must run correctly; outputs should be saved unless a clean run is specifically requested by the instructor.
- Reports: Must follow the required format and include plots or tables as evidence of performance.
- Documentation: Every submission must include a “How to Run” note and a record of environment versions to ensure reproducibility.

Fundamental Definitions and Principles of Machine Learning

Course Definition of ML:
- Machine Learning involves learning a mapping from features $X$ to outcomes $y$ that generalizes to new data.
- Inputs ( $X$ ): Information available at the time of prediction.
- Output ( $y$ ): The value or state we want to predict.
- Model Prediction ( $\hat{y}$ ): The estimate or label produced by the model.
The Goal of ML:
- The primary goal is achieving high performance on unseen, future-like data.
- Generalization: Defined as performance on new, previously unseen data.
- Generalization vs. Memorization:
  - Training performance is not the ultimate goal.
  - Memorization: Characterized by high performance on training data but low performance on new data.
  - Generalization: Characterized by stable performance on unseen data.
  - Evaluation Logic: The sole purpose of evaluation is to estimate generalization.
The Supervised Learning Loop (Intuition):
- Model parameters are adjusted to minimize errors within the training dataset.
- Process Steps:
  1. Choose a model family (e.g., linear models, tree-based models, neural networks).
  2. Define the Loss (the penalty assigned for errors).
  3. Optimize parameters to minimize that loss.
  4. Evaluate on held-out data to estimate generalization capacity.
Loss vs. Metrics:
- Loss: Specifically used to guide the updating of model parameters during training.
- Metric: Used to summarize model behavior for human decision-making and reporting. One model can be evaluated using multiple metrics simultaneously.

Biomedical Data: Features (X) and Labels (y)

Biomedical Feature Categories ( $X$ ):
- Electronic Health Records (EHR) / Structured and Tabular Data: Includes time-stamped laboratory results, vital signs, and medications. Features often present as irregular time series with varying units, reference ranges, and missing data points. Examples include creatinine levels, Blood Pressure (BP), and insulin doses.
- Imaging: 2D or 3D pixel arrays from modalities like CT scan, MRI, or X-ray. Features may include learned embeddings from Convolutional Neural Networks (CNN) or Vision Transformers (ViT), used for segmentation or classification.
- Physiologic Signals: Continuous waveforms such as ECG (Electrocardiogram) or EEG (Electroencephalogram). Features are extracted using fixed-length windows (with potential overlap) or processed through deep learning models.
- Clinical Text: Clinical notes and reports. Features are derived through Natural Language Processing (NLP) using tokens, named entities, or section information, as well as transformer-based embeddings. Examples from text include "no acute distress," "pneumonia," and "EF 35%".
Supervised Task Types:
- Classification: The output is a categorical label (e.g., disease present: yes/no).
- Regression: The output is a continuous number (e.g., a risk score or a specific lab value).
- Note: The same dataset can support different task types, but the researcher must be explicit about the choice.

Workflow Standards, Framing, and Bias

Workflow Standards and Failure Prevention:
- Most ML failures result from poor evaluation and workflow errors.
- Leakage/Peeking: Occurs when information that would not be available at prediction time is inadvertently included in training, creating fake performance improvements.
- Mandatory Standards:
  - Explicitly document data split and Cross-Validation (CV) protocols to avoid test peeking.
  - Fit all preprocessing (e.g., scaling) on training data only; the use of Scikit-learn Pipelines is required.
  - Capture all environment versions and provide run instructions.
  - Maintain logs/notes to track changes between model runs.
Shared Vocabulary:
- Feature: The model input variable.
- Label: The target outcome.
- Baseline: The simplest possible comparator to judge if ML adds value.
- Leakage: Information used during training that is unavailable at the point of prediction.
The Framing Checklist (5 Questions):
1. What is the Unit of Analysis? (e.g., patient-level, visit-level, time-window-level, or image-level).
2. What Inputs are available at prediction time? (Use timestamps; enforce "known by time $t$ ").
3. How is the Output/Label defined? (Measurement quality, observation window, handling of missing labels).
4. What is the Time Horizon?
5. What is the Baseline Comparator?
Label Quality and Challenges:
- Label Definition: Clearly define what counts as a positive case and specify the observation window.
- Missing Labels: Decide if they are treated as negative or unknown.
- Label Noise: High noise requires more cautious claims.
- Distribution Shift: Always question if the model will be used in the same setting it was trained on.
- Error Analysis: Systematic analysis to identify what needs fixing (features, label definitions, thresholds, or workflow).

Data Splitting and Baselines

Roles of Data Splits:
- Train Set: Used to fit model parameters.
- Validation Set: Used to tune hyperparameters and decision thresholds. Choices should be made here or via cross-validation.
- Test Set: Provides the final, unbiased estimate of performance. One must never iterate on the test set. Peeking at the test set makes results optimistically biased and unreliable for generalization.
Holdout vs. Cross-Validation (CV):
- Holdout: A single split. It is simple and fast.
- Cross-Validation: Rotates through validation folds and averages the estimates. It is more data-efficient and useful when datasets are small or results vary significantly between splits (unstable estimates).
Baselines:
- These determine if ML adds any value over simple methods.
- Classification Baselines: Use the majority class prevalence or rule-based logic.
- Regression Baselines: Use the mean or median of the outcome.
- Always report the baseline alongside the best model performance.

Performance Metrics and the Confusion Matrix

The Confusion Matrix (Error Types):
- TP (True Positive): Correctly predicted positive.
- FP (False Positive): A false alarm; predicted positive but actually negative.
- TN (True Negative): Correctly predicted negative.
- FN (False Negative): A missed positive; predicted negative but actually positive.
The Accuracy Trap:
- Accuracy can be misleading under class imbalance. If $95\%$ of patients are negative, a model that predicts everyone is negative will have $95\%$ accuracy but $0\%$ recall.
Formulas for Metrics:
- Sensitivity / Recall / TPR (True Positive Rate): $\text{Recall} = \frac{TP}{TP + FN}$ (Important for screening tests).
- Specificity / TNR (True Negative Rate): $\text{Specificity} = \frac{TN}{TN + FP}$
- Precision / PPV (Positive Predictive Value): $\text{Precision} = \frac{TP}{TP + FP}$
- NPV (Negative Predictive Value): $\text{NPV} = \frac{TN}{TN + FN}$
- FPR (False Positive Rate): $\text{FPR} = \frac{FP}{FP + TN}$
- AUROC (Area Under the ROC Curve): Measures the area under the plot of TPR vs. FPR across various thresholds.
Threshold Tradeoffs:
- Choosing a threshold changes the balance between FP and FN.
- Lowering Threshold: Increases recall but also increases FP.
- Raising Threshold: Increases precision but also increases FN.
- Threshold selection should be based on intervention costs and system capacity.
ROC vs. Precision-Recall (PR):
- ROC: Plots TPR vs. FPR.
- PR: Plots Precision vs. Recall. Use PR when dealing with rare positives (imbalanced data), as ROC can often hide poor performance in these scenarios.

Worked Examples and Collaborative Activities

Worked Example: Patient Deterioration Framing:
- Task: Predict patient deterioration within $6\,hour$ .
- Unit: Patient-timepoint.
- Inputs: Vitals and Labs up to time $t$ .
- Label: An event occurring in the window ( $t, t + 6\,hour$ ).
- Baseline: Prevalence (overall rate of positive class) or a threshold rule.
Activity 1: Framing Readmission:
- Task: Predict unplanned readmission within $30\,day$ .
- Unit of Analysis: Discharge encounter (patient-discharge); one row per discharge event.
- Time ( $T$ ): Discharge time and date.
- Inputs: Demographics, comorbidities, prior utilization, and hospitalization summary up to time $T$ .
- Label: Readmission occurring in ( $T, T + 30\,day$ ), excluding planned readmissions.
- Baseline: Prevalence (predicting "no" for everyone) or a rule like the LACE threshold.
- Leakage Risk: Including information from after discharge, such as a "30-day readmission flag," insurance claims filed after the discharge, or clinical notes mentioning the follow-up admission.
Leakage Audit Checklist:
1. Are there future timestamps in the features?
2. Is there patient overlap across data splits?
3. Were transformations (scaling/imputation) fit on the full dataset instead of just the training split?
4. Is the label derived from one of the features (circular reasoning)?
Worked Example: Accuracy Trap Step-by-Step:
- Scenario: $1000$ total patients; $50$ are positive ( $5\%$ prevalence).
- Model A (Predict all negative): $TP = 0$ , $FN = 50$ , $TN = 950$ , $FP = 0$ .
  - $\text{Accuracy} = \frac{0 + 950}{1000} = 0.95$
  - $\text{Recall} = \frac{0}{0 + 50} = 0$
- Model B (Detects some positives): $TP = 30$ , $FN = 20$ , $FP = 80$ , $TN = 870$ .
  - $\text{Accuracy} = 0.90$
  - $\text{Recall} = 0.60$
  - $\text{Precision} \approx 0.27$
- Conclusion: Model B is more useful for screening despite lower accuracy.

Code Implementation Basics

Pipeline-Safe Preprocessing (Mini Code Preview): ```Python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression
Enforces consistent preprocessing learned on train only
pipe = make_pipeline( StandardScaler(), LogisticRegression(max_iter=1000) ) ```
Computing Metrics (Mini Code Preview): ```Python from sklearn.metrics import confusion_matrix, precision_score, recall_score
cm = confusion_matrix(y_true, y_pred) prec = precision_score(y_true, y_pred) rec = recall_score(y_true, y_pred)
print("CM:\n", cm) print("precision:", prec, "recall:", rec) ```

The flashcard terms cover important concepts and definitions related to machine learning and biomedical data processing, including:

End-to-end Machine Learning
Defensible Workflow Components
Generalization
Supervised Learning Loop
Loss Function
True Positives, False Positives, True Negatives, False Negatives
Confusion Matrix
Sensitivity (Recall)
Precision
ROC Curve
Cross-Validation
Baseline Model
Label Noise
Data Leakage
Feature Selection
Hyperparameters
Reproducibility
Biomedical Data
Clinical Text Processing

These terms encapsulate the fundamental aspects and principles necessary for understanding the workflow and evaluation of machine learning models in a biomedical context, ensuring comprehensive coverage of the topic. There may be additional terms or concepts that can be further explored depending on specific interests or advanced topics in machine learning and applied bioengineering.