1/26
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Definition: Model Formulation
Intuition: How do different variables relate to an outcome
Definition: The core assumption is a linear relationship between a dependent variable Y and one or more independent variables X_i
Components:
Y: Dependent variable
X_i: Independent variables (predictors)
B_0, B_i, e: Intercept, coefficients, error term (residual
Used: The basis for models like CAPM, factor models, and many trading strategies
Definition: Ordinary Least Squares (OLS) Estimation - Matrix Form
Intuition: Finds the smallest error term by trying different coefficients (B)
Definition: Finds the coefficients B that minimize the Residual Sum of Squares
Components:
Data matrix X
Response vector y
Done using transformations and inverses
Uses: B is needed in linear regression model
Definition: OLS Assumption - Linearity
Intuition: 1:1 movement in coefficient
Definition: The model is linear in the parameters B
Components:
Coefficients B
Not observable random variable ϵ
Uses: Make sure we don’t have a non-linear relationship
Definition: OLS Assumption - No multicollinearity
Intuition: There isn’t a strong relationship between any predictors
Definition: X^TX is invertible. There is no perfect linear relationship between predictors.
Components:
Data matrix X
Data matrix X transposed (^T)
Matrix multiplication between the two is invertible
Uses: Prevents inflated standard errors and unstable coefficient estimates
Definition: OLS Assumption - Homoscedasticity
Intuition: The standard error (true range of a coefficient such as β1) value doesn’t change if we use different data
Definition: The error variance is constant across all observations
Components:
Error term for each X (take the variance error term given X)
Ensure (1) equals variance (σ2 )
Uses: High-return periods that also have high volatility (OLS is already unbiased)
Definition: OLS Assumption - No autocorrelation
Intuition: Error terms (true range of a coefficient such as β1) aren’t related to each other at all
Definition: Errors are uncorrelated across observations
Components:
Covariance between two error terms given X
(1) is equal to zero
The two error terms aren’t the same
Uses: Common in time series data (e.g., momentum strategies)
Definition: OLS Assumption - Maximum Likelihood Estimator
Intuition: The OLS estimator could be the MLE depending if the errors (true range of a coefficient such as β1) are a bell curve around 0
Definition: If we add the assumption that ϵ ~ N(0, σ2 ), the OLS estimator is also the MLE
Components:
First five assumptions (Linearity, exogenous, no multicollinearity, homoscedasticity, no autocorrelation)
Error term is normally distributed
Uses: Know which B to use, t-test and F-statistic can only be carried out if this is true
Definition: OLS Assumption - Strictly Exogenous
Intuition: Error term (true range of a coefficient such as β1) always has expected value of 0 no matter the value of the independent variables
Definition: The error term is uncorrelated with the predictors
Components:
The expectation of each error term given the data set X is equal to zero
Uses: Crucial violation in finance, can lead to biased estimators
Definition: R2
Intuition: How much variance in the result is explained by the data matrix
Definition: Proportion of the variance in Y that is predictable from X
Components:
1 - (RSS / TSS)
RSS: Residual Dum of Squares (variation in error between observed data and modeled values)
TSS: Total Sum of Squares (variation in the observed data)
Uses: Compare models with same number of predictors
Definition: Adjusted R2
Intuition: How much variance in the result is explained by the data matrix, prioritized models with less irrelevant predictors
Definition: Proportion of the variance in Y that is predictable from X
Components:
1 - ( (RSS / (m - p - 1)) / ( TSS / (m - 1) ) )
RSS and TSS
Number of predictors (p)
Uses: Compare models with different numbers of predictors, lower is better
Definition: Standard Error (SE) of βi
Intuition: How far the beta can change the predicted values from being the actual true value
Definition: Estimated standard deviation of a parameter estimate
Components:
Square root of variance of coefficient
Uses: Construct confidence intervals and perform hypothesis tests on individual coefficients
Definition: t-statistic βi
Intuition: if the true βi were actually 0, how far away is our estimate of βi from 0? That distance is the t-statistic, and a high one means a low probability that βi is actually 0.
Definition: the ratio of the difference in a number’s (coefficient’s) estimated value from its assumed value (0) to its standard error
Components:
t = (β / SE(β))
Uses: test null hypothesis H0: βi = 0. Follows a t-distribution with m-p-1 degrees of freedom
Definition: F-statistic
Intuition: Does regression model explain a meaningful amount of variation in the dependent variable compared to noise
Definition: Ratio that compares explained variance per parameter to unexplained variance per remaining degree of freedom
Components:
Numerator: How large the sum of squared residuals becomes in %
Denominator: Accounts for sampling variability
Uses: tests the null hypothesis that all slope coefficients are jointly equal to zero
Definition: Ridge Regression
Intuition: Improves prediction by shrinking coefficient magnitudes to reduce variance at the cost of introductions some bias
Definition: Regularized linear regression that minimizes squared errors plus an L2 penalty on the coefficients
Components:
Loss function: RSS measuring fit to the data
L2 penalty: Squared magnitude of coefficients that discourages large weights
Regularization parameter (lambda): Controls strength of coefficient shrinkage
Uses: Handles multicollinearity and improve out of sample performance in high dimensional regressions
Definition: Lasso Regression
Intuition: performs both shrinkage and variable selection by forcing some coefficients exactly to zero
Definition: regularized linear regression that minimizes squared errors plus an L1 penalty on the coefficients
Components:
Loss function: residual sums of squared capturing model fit
L1 penalty: Absolute values of coefficients that promote sparsity
Regularization parameter (lambda): Determines shrinkage and variable elimination
Uses: Feature selection and choosing when predictors may be irrelevant
Definition: Bias
Intuition: Error from approximating a real-world function with a simpler model
Definition: Error when the expected value of an estimator does not equal true parameter value
Components:
True parameter: The actual coefficient values generating the data
Estimator expectation: Average value of the estimated coefficients across samples
Model constraints: Assumptions or regularization that distort the estimator toward simpler models
Uses: Understand the bias-variance trade off and to justify regularization methods like ridge and lasso
Definition: Variance
Intuition: Error from model being too sensitive to training data
Definition: Expected squared deviation of a model’s prediction from its own average prediction across different training datasets
Components:
Training sample randomness: Different datasets drawn from the same process lead to different fitted models.
Estimator instability: Sensitivity of coefficients or prediction to changes in the data
Uses: Understand overfitting risk and to motivate regularization methods that stabilize model estimates
Definition: Bias-Variance Tradeoff
Intuition: finding the middle ground of complex or simple a model should be
Definition: finding the optimal balance between complex models
Components:
Complex model (high degree polynomial): low bias but high variance (overfitting)
Simpler model (OLS): high bias but low variance (underfitting)
Uses: Obtain the least amount of prediction error
Definition: Decision Tree (Regression)
Intuition: A sequence of if-else rules that split the data into groups, where each group predicts the average outcome of the observations inside it.
Definition: Partitions the feature space into a set of non-overlapping regions. For any given observation, the model predicts the mean of the response values of the training points that fall into the same region.
Components:
Splits: At each node, the algorithm chooses a feature and a split point. The goal is to make the response values in the resulting child nodes as similar as possible.
Splitting criterion: For regression, splits are chosen to minimize squared error, usually measured by RSS or MSE. In short, we split it where it reduces variability the most.
Leaves: Each terminal node outputs a constant prediction, equal to the mean of the response values in that node.
Training: The tree is built greedily, one split at a time, choosing the locally best split at each step
Uses: Nonlinear regression, baseline model before moving onto more complex ones
Decision Tree Pros
Easy to interpret (white-box model)
Start at root, ask a series of yes/no questions, end in a lead with a fixed numerical prediction (the mean)
You can interpret locally and don’t need to understand entire model, just path your observation took
Can handle non-linear relationship
Represents the response function as a piece-wise constant function over a partitioned feature space
Instead of fitting one smooth curve everywhere, the tree says “in this part of the space, behave this way; in another part, behave differently.”
Decision Tree Cons
High variance (small changes in data can lead to a very different tree)
Prone to overfitting
Generally lower predictive accuracy than ensemble methods
Ensemble Methods
Intuition: Reduces variance and bias
Definition: Combines multiple individual decision trees
Types:
Bagging (bootstrapping aggregation)
Random forests (improve bagging)
Boosting (sequentially building trees)
Uses: Improve overall predictive performance and robustness
Bagging (Bootstrapping Aggregating)
Intuition: Decision trees are noisy - small changes in the data can change them a lot. Bagging reduces that noise by training many trees on slightly different versions of the data and then averaging their predictions
Definition: A technique that reduces the variance of a learning algorithm by repeatedly resampling the training data with replacement, fitting the model on each resample, and aggregating the predictions
Components:
Bootstrapped samples: Each model is trained on a dataset created by sampling with replacement from the original data
Base learner (decision trees): Uses full, unpruned decision trees, which are high-variance learners
Aggregation: predictions are averaged (regression)
Out-of-bag-error: Each tree sees 2/3 of the data; the remaining 1/3 can be used to estimate test error without cross-validation
Uses: Used primarily to reduce variance in unstable models like decision trees
Random Forests
Intuition: Improve on bagging by making individual tress less similar. They do this by randomly limiting which features a tree is allowed to consider at each split. so different trees learn different structures.
Definition: An ensemble of decision trees trained on bootstrapped samples, where each split considers only a random subset of predictors. The final prediction is the average of the individual tree predictions.
Components:
Bootstrap sampling: same as bagging - each tree is trained on a resampled dataset
Random feature selection: At each split, only m out of p predictors are considered
Hyper parameter m: controls how aggressive the decorrelstion is
Smaller m → more randomness → lower correlation → lower variance
Aggregation: Predictions are averaged over trees
Feature importance: R.F.s naturally reduce variable importance measures by tracking how much splits reduce errors across trees
Uses: When you need a strong, general-purpose regression model that works well when relationships are nonlinear, interactions are important, and interpretability is secondary to performance
Boosting
Intuition: Reduces bias by fitting weak learners sequentially, where each learner is trained to correct the residual errors of the current ensemble
Definition: Builds models sequentially by fitting new trees to the residuals of the current model, with each tree’s contribution scales by a learning rate to control overfitting.
Components:
Residuals: The errors made by the current ensemble
Weak learners: Typically small, shallow trees
Learning rate (gamma): controls how much each new tree contributes to the ensemble
Smaller learning rate → slower learning → better generalization
Number of trees: controls model complexity
Uses:
Often used when maximum predictive accuracy is required and interpretability is secondary
Definition: p-value
Intuition: If the true slope were actually zero, how surprising would this result be?
Definition: Measures how unlikely the observed t-statistic for the coefficient is if the true coefficient were zero.
Components:
Null Hypothesis H0 : Typically β1 = 0, meaning no linear relationship between predictor and response
Test statistic (t-statistic): how many standard errors away from zero is the estimate
Uses: if p-value is less than 0.05, observed effect is unlikely if coefficient = 0 and we can reject null hypothesis
Definition: Stationarity
Intuition: A time series is stationary if its statistical behavior does not change over time.
Definition: Stationary if its mean, variance, and autocovariance structure do not depend on time.
Components:
Constant mean: no drift/trend
Constant volatility: variability stays the same and won’t explode
Autocovariance depends on lag: Dependence between today and tomorrow is the same regardless of when today occurs (yesterday affects today the same way in 2010 as in 2025)
Uses: ARMA and ARIMA models which assume this, and forecasting (stable prediction)