BE-530: Machine Learning in Python - Lecture 1 Introduction

0.0(0)

Studied by 0 people

View linked note

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/45

Earn XP

Description and Tags

Foundational vocabulary and concepts for BE-530 Lecture 1, covering ML definitions, biomedical data types, workflow standards, evaluation metrics, and framing strategies.

Last updated 1:22 AM on 5/20/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

46 Terms

New cards

Machine Learning (ML)

The process of learning a mapping from features $X$ to outcomes $y$ that generalizes to unseen, future-like data.

New cards

Generalization

The performance of a model on new data that was not seen during training.

New cards

Supervised Learning Loop

The iterative process of choosing a model family, defining a loss function, optimizing parameters to minimize loss, and evaluating on held-out data.

New cards

Loss

A penalty for errors that is optimized during training to guide parameter updates.

New cards

Metric

A value used for decisions and reporting that summarizes model behavior for a specific goal.

New cards

Classification

A task type where the output is a discrete label, such as a "yes/no" disease diagnosis.

New cards

Regression

A task type where the output is a continuous number, such as a risk score or lab value.

New cards

Leakage (or Peeking)

A failure in ML workflow where information unavailable at prediction time is included in training, creating fake improvements.

New cards

Baseline

The simplest comparator (e.g., majority class for classification or mean/median for regression) used to see if ML adds value.

New cards

Unit of Analysis

The level at which data is defined and split, such as patient-level, visit-level, time-window level, or image-level.

New cards

Prevalence

The overall rate of the positive class in a dataset.

New cards

Prevalence Baseline

A baseline strategy that ignores all features and always predicts the most common class.

New cards

Train Set

The portion of data used to fit model parameters.

New cards

Validation Set

The portion of data used to tune hyperparameters and thresholds.

New cards

Test Set

A final, unbiased estimate used only once at the end of modeling; it must not be used to guide choices.

New cards

Confusion Matrix

A table showing the breakdown of error types: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

New cards

Accuracy Trap

A situation under data imbalance where high accuracy is misleading because the model fails to detect the minority class.

New cards

Sensitivity (Recall or TPR)

A metric that measures how many actual positives the model catches, calculated as $\frac{TP}{TP + FN}$ .

New cards

Specificity (TNR)

A metric that measures how many actual negatives the model correctly identifies, calculated as $\frac{TN}{TN + FP}$ .

New cards

Precision (PPV)

A metric that measures how much to trust a predicted positive outcome, calculated as $\frac{TP}{TP + FP}$ .

New cards

NPV (Negative Predictive Value)

A metric calculated as $\frac{TN}{TN + FN}$ .

New cards

AUROC

The Area Under the Receiver Operating Characteristic curve, which plots the True Positive Rate (TPR) vs. the False Positive Rate (FPR) across thresholds.

New cards

Memorization

A state where a model has high training performance but low performance on new data.

New cards

Pipelines

A code structure (e.g., from Scikit-Learn) that reduces accidental leakage by ensuring preprocessing transforms are fit only on the training data.

New cards

End-to-end Machine Learning

Building a complete machine learning system that can go from data input to prediction output.

New cards

Defensible Workflow Components

Key elements of a machine learning process including framing, evaluation, reporting, and reproducibility.

New cards

Generalization

The ability of a model to perform well on unseen data, as opposed to just learning the training data.

New cards

Supervised Learning Loop

A process in machine learning where model parameters are adjusted based on training errors.

New cards

Loss Function

A measure used to evaluate how well a model's predictions match the actual outcomes.

New cards

TP (True Positive)

The number of positive cases correctly predicted by the model.

New cards

FP (False Positive)

The number of negative cases incorrectly predicted as positive by the model.

New cards

TN (True Negative)

The number of negative cases correctly predicted by the model.

New cards

FN (False Negative)

The number of positive cases incorrectly predicted as negative by the model.

New cards

Confusion Matrix

A table used to describe the performance of a classification model by highlighting true vs. predicted classifications.

New cards

Sensitivity (Recall)

The ability of a model to correctly identify positive cases.

New cards

Precision

The ratio of correctly predicted positive observations to the total predicted positives.

New cards

ROC Curve

A graphical representation of a model's diagnostic ability across all classification thresholds.

New cards

Cross-Validation

A technique used to assess how the results of a statistical analysis will generalize to an independent dataset.

New cards

Baseline Model

The simplest model used to compare the performance of more complex machine learning models.

New cards

Label Noise

The inaccuracies present in the labels of the dataset which can degrade model performance.

New cards

Data Leakage

The unintentional use of information from outside the training dataset during model training.

New cards

Feature Selection

The process of identifying and selecting a subset of relevant features for model training.

New cards

Hyperparameters

Settings or configurations that are set before training a model, influencing its learning process.

New cards

Reproducibility

The ability to obtain consistent results using the same methods and data across different experiments.

New cards

Biomedical Data

Data that is used in machine learning models pertaining to health and medical conditions.

New cards

Clinical Text Processing

The use of natural language processing techniques to analyze clinical texts.