Block 1

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/31

There's no tags or description

Looks like no tags are added yet.

Last updated 3:32 PM on 5/24/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

32 Terms

New cards

Statistical Learning

Focuses on the foundations, model interpretation, and understanding how predictors relate to the outcome (Y). It provides long-lasting structural frameworks.Machine Learning / AI:

New cards

Machine Learning / AI

Prioritizes prediction accuracy, generalization, and handling highly complex, non-linear algorithms (like deep learning) where individual feature interpretation is often sacrificed ("black box" models)

New cards

Supervised Learning

You possess data containing both the predictors (X) and a known, observed outcome (Y). The model evaluates its performance directly against this known baseline

New cards

Unsupervised Learning

You only possess predictor data (X); there is no observed outcome variable Y. The goal shifts from prediction to discovering underlying groupings or patterns within the data. (CFA)

New cards

Training set

The portion of data used to build and fit your model.

New cards

Test set

A separate data split used to evaluate how well the model predicts on unseen observations. Evaluating performance only on your training data creates a severe risk of overfitting, as complex models will simply "memorize" data noise.

New cards

Internal Validation

Splitting a single, cohesive dataset into a training slice and a testing slice (e.g., a 70/30 split).

New cards

External Validation

Taking a model fully trained on one dataset (e.g., NHANES) and testing its predictive validity on an entirely separate population dataset collected elsewhere (e.g., Framingham).

New cards

Why does External datasets often fail? What are the methods?

Because they feature different parameter ranges, demographics, baseline mean values, or geographic variations. Methods to compensate for this include recalibration (adjusting model parameters or intercepts to match the new population's scale).

New cards

What is Data Leakage and what are solutions?

It is nformation from the test dataset that spills into the training workflow. For example, if you perform data imputation or scaling on the dataset before splitting it into train/test, your training step has seen the structure of the test data. We therefore impute and scale test data using only the "recipe" generated by the training data

New cards

What happens with interability when prediction accuracy increase?

As model complexity increases (e.g., moving from a simple linear fit to a flexible tree-based method like XGBoost), the model's accuracy improves. But the ability to interpret the variability impaact on the output drops.

New cards

The Bias-Variance Trade-off

Components: Error=Bias² +Variance + Irreducible Noise.

Principle that says that you cannot simultaneously minimize both sources of prediction error. As model complexity increases to reduce bias (better fitting), variance increases (higher sensitivity to training data). We want both for generalization.

New cards

Bias (Underfitting)

False assumptions in your model type (e.g., fitting a perfectly straight line through data that actually follows a curved path). High-bias models miss major structural data patterns.

<p><span style="background-color: transparent;">False assumptions in your model type (e.g., fitting a perfectly straight line through data that actually follows a curved path). High-bias models miss major structural data patterns.</span><br><br></p>

New cards

Variance (Overfitting)

Extreme model sensitivity to minor variations in your specific training dataset. High-variance models "connect the dots" perfectly, tailoring themselves to random data noise rather than the true signal.

<p><span style="background-color: transparent;">Extreme model sensitivity to minor variations in your specific training dataset. High-variance models "connect the dots" perfectly, tailoring themselves to random data noise rather than the true signal.</span></p><p></p>

New cards

Irreducible Noise

The fundamental variation in the environment or measurement tools that can never be filtered out, regardless of how flawless your model is.

New cards

The Parsimony Principle ("Less is More")

Models with fewer features are conceptually easier to interpret and less likely to suffer from high variance

New cards

Classical Selection Methods

All Subsets: A naïve strategy that tests every mathematical combination of your variables.

Forward Selection: Begins with an empty model, tests predictors one-by-one, extracts the single best variable, and iteratively adds variables until a specific stopping criterion (like a p-value threshold or a drop in Adjusted R²) is hit.

Backward Selection: Begins with all possible variables included in a single complex model, and iteratively strips away the variable providing the lowest statistical contribution to the fit.

New cards

Regularization / Penalization Concept

Way to stop model froom being too flexible (overfitting) adds a mathematical penalty to the objective function to constrain (shrink) the size of your regression coefficients (β) to be simpler and generalie better

New cards

Ridge Regression (L2 Penalty)

Adds a penalty proportional to the size of the coefficients. It shrinks coefficients close to zero but never forces them to exactly zero; all variables remain inside the final model.

<p><span style="background-color: transparent;">Adds a penalty proportional to the size of the coefficients. It shrinks coefficients close to zero but never forces them to exactly zero; all variables remain inside the final model.</span></p>

New cards

Lasso Regression (L1 Penalty):

Adds a penalty proportional to the absolute value of the coefficients. Because of its distinct geometric shape, Lasso forces less important coefficients to exactly zero, effectively acting as an automated variable selection tool.

<p><span style="background-color: transparent;">Adds a penalty proportional to the absolute value of the coefficients. Because of its distinct geometric shape, Lasso forces less important coefficients to exactly zero, effectively acting as an automated variable selection tool.</span></p><p></p>

New cards

The Scaling Mandate

Regularization methods are highly sensitive to data scales. If one variable is measured in thousands (e.g., blood cell counts) and another ranges from 0 to 1 (e.g., biological sex), the penalty term will unfairly squeeze the smaller-scaled variable. Therefore, features must be standard-scaled (mean = 0, SD = 1) before running Ridge or Lasso.

New cards

Non-Linear Concepts (Tree-based methods)

Algorithms that use a series of "if-then" rules to split complex, non-linear data into distinct, manageable regions. Unlike linear models, they require no data transformations.

New cards

Tree-Based Interactions

Unlike classical linear regressions which assume relationships follow standard straight lines, tree-based models (like XGBoost) naturally capture step-wise, non-linear boundaries and intricate multi-variable interactions without requiring you to manually write explicit interaction parameters (like age * BMI) into your formulas.

New cards

Handling Extreme Outliers (tree based method)

In linear regression, a severe outlier can radically skew your entire slope line. In tree-based models, outliers have a highly restricted impact because the model splits data based on internal value intervals (e.g., X > 150), meaning an extreme value of 1,000 behaves identically to a value of 151 during that split.

New cards

SHAPley (SHAP) Values (tree baseed method)

Method to measure how much each feature contributes to one specific prediciton. A SHAP value calculates how much a variable altered the final prediction for an individual person.

New cards

Missing Data & Imputation Theory

Values are unrecorded for variables of interest. Imputation theory provides statistical frameworks to estimate and replace these missing values, minimizing bias and preserving dataset size

New cards

The Feasibility of Imputation

Imputation is when a variable has a manageable amount of missingness (e.g., around 30%). However, if a variable is missing 50% or more of its data, or if the missingness is tied to an unobserved systemic bias, imputation becomes problematic and bias.

New cards

Imputing Predictors vs. Outcomes

You should only use algorithms like K-Nearest Neighbors (K-NN) to impute missing independent predictors (X) by leveraging information from surrounding variables (e.g., using waist and hip measurements to impute a missing BMI value). You must never impute a missing target outcome variable (Y, such as sysBP). If the outcome is missing, that observation must be omitted from model training

New cards

The "Recipe" and Data Leakage Prevention

When performing validation, you split your data into training and test sets before imputation. You then create an imputation "recipe" based on the training data. This recipe is applied to the test data. If you mix the two datasets before imputing, you create data leakage.

New cards

Why External Validation Fails

Distribution and Range Discrepancies: External set that contains a wider range or completely different extreme values of systolic blood pressure (e.g., up to 295 mmHg vs. NHANES' 231 mmHg), the model is forced to extrapolate outside its learned boundaries.

Historical/Temporal Shifts: Baseline population statistics shift over generations due to public health interventions or cleaner living habits.

Compensating for Failure (Recalibration): If your model has solid predictive discrimination but systematically over- or under-predicts when applied to a new population, you can perform recalibration. This involves taking the frozen model and adjusting its parameters or updating the intercept (β₀) to align perfectly with the baseline risk metrics of the new geography or population.

New cards

Cross Sectional

Data collected from subjects at a single, fixed point in time. - Baseline demographic surveys

New cards

Longitudinal

The exact same subjects are tracked and measured repeatedly over a span of time - The Framingham Heart Study tracking cardiovascular health over decades