srm formula sheet

Statistical Learning

Modeling Problems

  • Types of Variables

    • Response Variable: Denoted as y or Y, it is the variable of primary interest.
    • Explanatory Variable: Denoted as x or X, it is used to study the response variable.
    • Count Variable: A quantitative variable, usually valid on non-negative integers.
    • Continuous Variable: A real-valued quantitative variable.
    • Nominal Variable: A categorical/qualitative variable with categories that lack a meaningful or logical order.
    • Ordinal Variable: A categorical/qualitative variable with categories possessing a meaningful or logical order.
  • Notation

    • y, Y: Response variable
    • x, X: Explanatory variable
    • i: Subscript for observations
    • n: Number of observations
    • j: Subscript for variables (excluding the response)
    • p: Number of variables (excluding the response)
    • \epsilon: Error term
    • y^*, Y^+, \hat{f}(x): Estimate/Estimator of f(x)
  • Regression Problems:

    • Y = f(x1, …, xp) + \epsilon, where E[\epsilon] = 0, so E[Y] = f(x1, …, xp)
    • Test MSE = E[(Y - Y^+)^2], which can be estimated using \frac{\sum{i=1}^{n} (yi - y_i^*)^2}{n}
    • For fixed inputs x1, …, xp, the test MSE is Var(\hat{f}(x1, …, xp)) + Bias(\hat{f}(x1, …, xp))^2 + Var(\epsilon)
  • Classification Problems:

    • Test Error Rate = E[I(Y \neq Y^+)], which can be estimated using \frac{\sum{i=1}^{n} I(yi \neq y_i^*)}{n}
    • Bayes Classifier: f(x1, …, xp) = \text{arg max}{c \in C} Pr(Y = c | X1 = x1, …, Xp = x_p)
  • Key Ideas

    • Parametric methods are disadvantaged by the risk of choosing a form for f that poorly approximates the truth.
    • Non-parametric methods are disadvantaged by the need for an abundance of observations.
    • Flexibility and interpretability are generally inversely related.
    • As flexibility increases, the training MSE (or error rate) decreases, but the test MSE (or error rate) follows a U-shaped pattern.
    • Low flexibility leads to a method with low variance and high bias; high flexibility leads to a method with high variance and low bias.
    • Overfitting occurs when a model is too complex, with too many parameters relative to the amount of data. This results in high accuracy on training data but poor performance on test data.
    • Underfitting occurs when a model is too simple, failing to represent data adequately and capture underlying patterns. This leads to poor fit on training data and poor performance on new data.

Contrasting Statistical Learning Elements

  • Flexibility and Interpretability:

    • Less flexible, more interpretable: Lasso, Subset selection, Least squares
    • Moderately flexible and interpretable: Regression trees, Classification trees
    • More flexible, less interpretable: Bagging, Boosting
  • Statistical Learning Method

    • Supervised Learning: Methods include SLR, MLR, GLM, Ridge, Lasso, Weighted Least Squares, Partial Least Squares, K-Nearest Neighbors, Decision Trees, Bagging, Random Forest, Boosting; all require a response variable.

    • Unsupervised Learning: Methods include Cluster Analysis and Principal Components Analysis; do not require a response variable.

    • Parametric:

      • SLR, MLR, GLM
      • Ridge, Lasso
      • Weighted Least Squares
      • Partial Least Squares
    • Non-Parametric:

      • K-Nearest Neighbors
      • Decision Trees
      • Bagging, Random Forest, Boosting
      • Cluster Analysis
      • Principal Components Analysis
      • Principal Components Regression
  • Data

    • Training Observations: Used to train/obtain \hat{f}
    • Test Observations: Not used to train/obtain \hat{f}
  • Statistical Learning Problems

    • Supervised: Has response variable.
    • Unsupervised: No response variable.
    • Regression: Quantitative response variable.
    • Classification: Categorical response variable.
  • Method Properties

    • Parametric: Functional form of f specified.
    • Non-Parametric: Functional form of f not specified.
    • Prediction: Output of \hat{f}
    • Inference: Comprehension of \hat{f}
    • Flexibility: \hat{f}'s ability to follow the data.
    • Interpretability: \hat{f}'s ability to be understood.

Linear Models

Simple Linear Regression (SLR)

  • Special case of MLR where p = 1
  • Estimation:
    • \hat{\beta}1 = \frac{\sum{i=1}^{n} (xi - \bar{x})(yi - \bar{y})}{\sum{i=1}^{n} (xi - \bar{x})^2}
    • \hat{\beta}0 = \bar{y} - \hat{\beta}1\bar{x}
  • Other Numerical Results
    • R^2 = r_{x,y}^2

SLR Inferences

  • Standard Errors
    • SE(\hat{\beta}0) = \sqrt{MSE \cdot \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum{i=1}^{n} (x_i - \bar{x})^2} \right]}
    • SE(\hat{\beta}1) = \sqrt{\frac{MSE}{\sum{i=1}^{n} (x_i - \bar{x})^2}}
    • SE(\hat{y}) = \sqrt{MSE \cdot \left[ \frac{1}{n} + \frac{(x - \bar{x})^2}{\sum{i=1}^{n} (xi - \bar{x})^2} \right]}
    • SE(\hat{y}{new}) = \sqrt{MSE \cdot \left[ 1 + \frac{1}{n} + \frac{(x{new} - \bar{x})^2}{\sum{i=1}^{n} (xi - \bar{x})^2} \right]}

Multiple Linear Regression (MLR)

  • Y = \beta0 + \beta1x1 + \cdots + \betapx_p + \epsilon
  • Notation
    • \beta_j : The j^{th} regression coefficient
    • \hat{\beta}j : Estimate of \betaj
    • \sigma^2 : Variance of response / Irreducible error
    • MSE : Estimate of \sigma^2
    • X : Design matrix
    • H : Hat matrix
    • e : Residual
    • SST : Total sum of squares
    • SSR : Regression sum of squares
    • SSE : Error sum of squares
    • r_{x,y} : Sample correlation between two variables x and y
  • Assumptions
    1. Yi = \beta0 + \beta1x{i,1} + \cdots + \betapx{i,p} + \epsilon_i
    2. x_{i,j}\'s are non-random
    3. E[\epsilon_i] = 0
    4. Var[\epsilon_i] = \sigma^2
    5. \epsilon_i\'s are independent
    6. \epsilon_i\'s are normally distributed
    7. The predictor x_j is not a linear combination of the other p predictors, for j = 0, 1, …, p
  • Estimation – Ordinary Least Squares (OLS)
    • \hat{y} = \hat{\beta}0 + \hat{\beta}1x1 + \cdots + \hat{\beta}px_p
    • \begin{bmatrix} \hat{\beta}0 \ \vdots \ \hat{\beta}p \end{bmatrix} = \hat{\mathbf{\beta}} = (X^TX)^{-1}X^T\mathbf{y}
    • MSE = \frac{SSE}{n - p - 1}
    • residual standard error = \sqrt{MSE}
  • Other Numerical Results
    • H = X(X^TX)^{-1}X^T
    • \hat{\mathbf{y}} = H\mathbf{y}
    • e = y - \hat{y}
    • SST = \sum{i=1}^{n} (yi - \bar{y})^2 = total variability
    • SSR = \sum{i=1}^{n} (\hat{y}i - \bar{y})^2 = explained
    • SSE = \sum{i=1}^{n} (yi - \hat{y}_i)^2 = unexplained
    • SST = SSR + SSE
    • R^2 = \frac{SSR}{SST} = r_{\hat{y},y}^2
    • R{adj}^2 = 1 - \frac{MSE}{sy^2} = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}
  • Key Facts on R2 and Adjusted R2
    • R^2 is a poor measure for model comparison because it will increase simply by adding more predictors to a model.
    • Similar to R^2, a high R_{adj}^2 is desirable.
    • R{adj}^2 < R^2, except for two cases; R{adj}^2 = R^2 occurs when R^2 = 1 or when p = 0.
    • \frac{n-1}{n-p-1} > 1 except when p = 0. This makes it possible for R_{adj}^2 to decrease for larger values of p.
    • R_{adj}^2 does not have to be between 0 and 1.
  • Other Key Ideas
    • Polynomials do not change consistently by unit increases of its variable, i.e. no constant slope.
    • Only w - 1 dummy variables are needed to represent w classes of a categorical predictor; one of the classes acts as a baseline.
    • In effect, dummy variables define a distinct intercept for each class. Without the interaction between a dummy variable and a predictor, the dummy variable cannot additionally affect that predictor's regression coefficient.

MLR Inferences

  • Notation

    • \hat{\beta}j : Estimator for \betaj
    • \hat{Y} : Estimator for E[Y]
    • SE : Estimated standard error
    • H_0 : Null hypothesis
    • H_1 : Alternative hypothesis
    • df : Degrees of freedom
    • t_{\alpha, df} : \alpha quantile of a t-distribution
    • \alpha : Significance level
    • k : Confidence level
    • ndf : Numerator degrees of freedom
    • ddf : Denominator degrees of freedom
    • F_{\alpha, ndf, ddf} : \alpha quantile of an F-distribution
    • Y_{new} : Response of new observation
    • Subscript r : Reduced model
    • Subscript f : Full model
  • Standard Errors

    • SE(\hat{\beta}i) = \sqrt{Var(\hat{\beta}i)}
  • Variance-Covariance Matrix

    • Var(\hat{\mathbf{\Beta}}) = MSE(X^TX)^{-1} = \begin{bmatrix} Var(\hat{\beta}0) & Cov(\hat{\beta}0, \hat{\beta}1) & \cdots & Cov(\hat{\beta}0, \hat{\beta}p) \ Cov(\hat{\beta}0, \hat{\beta}1) & Var(\hat{\beta}1) & \cdots & Cov(\hat{\beta}1, \hat{\beta}p) \ \vdots & \vdots & \ddots & \vdots \ Cov(\hat{\beta}0, \hat{\beta}p) & Cov(\hat{\beta}1, \hat{\beta}p) & \cdots & Var(\hat{\beta}_p) \end{bmatrix}
  • t Tests

    • t \text{ statistic} = \frac{\text{estimate } - \text{hypothesized value}}{\text{standard error}}
    • H0: \betaj = \text{hypothesized value}
    Test TypeRejection Region
    Two-tailed
    Left-tailedt \text{ statistic} \leq -t_{\alpha, n-p-1}
    Right-tailedt \text{ statistic} \geq t_{\alpha, n-p-1}
  • F Tests

    • F \text{ statistic} = \frac{MSR}{MSE} = \frac{SSR/p}{SSE/(n-p-1)}
    • H0: \beta1 = \beta2 = \cdots = \betap = 0
    • Reject H0 if F \text{ statistic} \geq F{\alpha, p, n-p-1}
    • ndf = p
    • ddf = n - p - 1
  • Partial F Tests

    • F \text{ statistic} = \frac{(SSEr - SSEf)/(pf - pr)}{SSEf/(n - pf - 1)}
    • H0: \text{Some } \betaj \text{'s } = 0
    • Reject H0 if F \text{ statistic} \geq F{\alpha, pf - pr, n-p_f-1}
    • ndf = pf - pr
    • ddf = n - p_f - 1
    • For all hypothesis tests, reject H_0 if p-value \leq \alpha.
  • Confidence and Prediction Intervals

    • estimate \pm (t quantile)(standard error)
    QuantityInterval Expression
    \beta_j\hat{\beta}j \pm t{(\alpha/2, n-p-1)} \cdot SE(\hat{\beta}_j)
    E[Y]\hat{y} \pm t_{(\alpha/2, n-p-1)} \cdot SE(\hat{y})
    Y_{new}\hat{y}{new} \pm t{(\alpha/2, n-p-1)} \cdot SE(\hat{y}_{new})
    • Prediction intervals are wider than confidence intervals.

Linear Model Assumptions Violations and Issues

  • Concerns when handling a multiple linear regression model:

    • Misspecified model equation
    • Residuals with non-zero averages
    • Heteroscedasticity
    • Dependent errors
    • Non-normal errors
    • Multicollinearity
    • Outliers
    • High leverage points
    • High dimensions
  • Leverage

    • Measures an observation’s influence in predicting the response.
    • hi = \mathbf{x}i^T(X^TX)^{-1}\mathbf{x}i = \frac{SE(\hat{y}i)^2}{MSE}
    • For SLR, hi = \frac{1}{n} + \frac{(xi - \bar{x})^2}{\sum{j=1}^{n} (xj - \bar{x})^2}
    • \frac{1}{n} \leq h_i \leq 1
    • \sum{i=1}^{n} hi = p + 1
    • Frees rule of thumb: An observation is a high leverage point if h_i > \frac{3(p+1)}{n}.
  • Studentized and Standardized Residuals

    • Unitless versions of residuals.
    • e{stud,i} = \frac{ei}{\sqrt{MSE{(i)}(1 - hi)}}
    • e{stand,i} = \frac{ei}{\sqrt{MSE(1 - h_i)}}
    • Outliers are observations with unusual values of response variable relative to the predicted values.
    • Frees rule of thumb: An observation is an outlier if |e_{stand,i}| > 2.
  • Cook’s Distance

    • Combines leverage and residuals into a single measure.
    • Di = \frac{\sum{j=1}^{n} (\hat{y}j - \hat{y}{(i)j})^2}{MSE(p+1)} = \frac{ei^2}{MSE(p+1)} \cdot \frac{hi}{(1-h_i)^2}
    • An observation with typical influence has D_i \approx \frac{1}{n}.
  • Plots of Residuals

    • e versus \hat{y}: Residuals are well-behaved if
      • Points appear to be randomly scattered
      • Residuals seem to average to 0
      • Spread of residuals does not change
    • e versus i: Detects dependence of error terms
    • QQ plot of e: Detects non-normality of error terms
  • Variance Inflation Factor

    • VIFj = \frac{1}{1 - Rj^2} = \frac{s{\hat{\beta}j}^2 (n-1)}{MSE}
    • Tolerance is the reciprocal of VIF.
    • Frees rule of thumb: Severe multicollinearity exists for any VIF_j \geq 10.
    • Rj^2 is the R^2 for the model where xj is regressed against the other predictors.
  • Breusch-Pagan Test for Heteroscedasticity

    • LML = \frac{SSR^*}{2}

    • H0: \beta1^* = \beta2^* = \cdots = \betap^* = 0

    • Reject H0 if LML \geq \chi{\alpha, p}^2

    • SSR^* is the regression sum of squares for the model where the squared standardized residuals are regressed against the predictors.

    • squared standardized residuals ~ predictors

  • Key Ideas

    • As realizations of a t-distribution, studentized residuals can help identify outliers.
    • When residuals have a larger spread for larger predictions, one solution is to transform the response variable with a concave function.
    • Standard errors can be adjusted to account for heteroscedasticity by employing heteroscedasticity-consistent standard errors.
    • There is no universal approach to handling multicollinearity; it is even possible to accept it, such as when there is a suppressor variable. On the other hand, it can be eliminated by using a set of orthogonal predictors.

Model Selection

  • Notation

    • g: Total no. of predictors in consideration
    • p: No. of predictors for a specific model
    • MSE_N: MSE of the model that uses all g predictors
    • \mathcal{M}_p: The ”best” model with p predictors
  • Best Subset Selection

    • Considers all 2^g models.

    • Algorithm:

      1. For p = 0, 1, …, g, fit all \binom{g}{p} models with p predictors. The model with the largest R^2 is \mathcal{M}_p.
      2. Choose the best model among \mathcal{M}0, …, \mathcal{M}g using a selection criterion of choice.
  • Forward Stepwise Selection

    • Considers only 1 + \frac{g(g+1)}{2} models.
    • Only adds the next best predictor as p increases; a greedy approach.
    • Algorithm:
      1. Fit all g simple linear regression models. The model with the largest R^2 is \mathcal{M}_1.
      2. For p = 2, …, g, fit the models that add one of the remaining predictors to \mathcal{M}{p-1}. The model with the largest R^2 is \mathcal{M}p.
      3. Choose the best model among \mathcal{M}1, …, \mathcal{M}g using a selection criterion of choice.
  • Backward Stepwise Selection

    • Considers only 1 + \frac{g(g+1)}{2} models.
    • Algorithm:
      1. Fit the model with all g predictors, \mathcal{M}_g.
      2. For p = g - 1, …, 1, fit the models that drop one of the predictors from \mathcal{M}{p+1}. The model with the largest R^2 is \mathcal{M}p.
      3. Choose the best model among \mathcal{M}1, …, \mathcal{M}g using a selection criterion of choice.
  • Stepwise Regression

    • Relative to forward or backward selection, a wider scope of models is considered.
    • Algorithm:
      1. Start with the null model.
      2. Using two-tailed t tests, keep the predictor that produces the smallest 𝑝-value when added to the current model. However, do not keep the predictor if the 𝑝-value is above a specified threshold.
      3. Using two-tailed t tests, remove the predictor with the largest 𝑝-value from the current model. However, do not remove the predictor if the 𝑝-value is below a specified threshold.
      4. Repeat the previous two steps until the current model becomes stable.
  • Drawbacks of Subset Selection Procedures

    1. The algorithm fails to account for any special knowledge that a researcher has.
    2. Data snooping.
    3. The model found by the algorithm is not guaranteed to be the best model.
    4. The algorithm doesn't consider all 2^g possible models.
    5. The algorithm relies only on using 𝑝-values of two-tailed t test.
    6. The algorithm involves a sequence of two-tailed t tests.
    7. The algorithm ignores the joint effect of predictors.
    8. Computationally intensive.
    9. Should not or cannot be used in high-dimensional settings.
  • Selection Criteria

    • Mallows’ C_p

      • Cp = \frac{1}{n}(SSE + 2p \cdot MSEN)
      • Cp = \frac{SSE}{MSEN} - n + 2(p+1)
    • Akaike information criterion

      • AIC = \frac{1}{n}(SSE + 2p \cdot MSE_N)
    • Bayesian information criterion

      • BIC = \frac{1}{n}(SSE + \ln(n) \cdot p \cdot MSE_N)
    • Adjusted R^2

    • Cross-validation error

  • Key Ideas

    • Cross-validation error is a more direct estimate of the test MSE.
    • Smaller values of C_p, AIC, BIC, and cross-validation error are preferred.
    • The best model by AIC will also be the best model by C_p.
    • For n \geq 8, BIC favors models with a smaller p compared to AIC and C_p.
    • Cp, AIC, and BIC have theoretical justifications for being good measures of model quality, whereas R{adj}^2 does not.
    • For overfitted models due to high dimensions, Cp, AIC, BIC, and R{adj}^2 are not reliable.
  • Validation Set

    • Randomly splits all available observations into two groups: the training set and the validation set.
    • Only the observations in the training set are used to attain the fitted model, and those in validation set are used to estimate the test MSE.
    • Advantage: conceptually simple and easy to implement.
    • Disadvantages:
      • The results are fickle.
      • The validation set error tends to overestimate the test MSE.
  • 𝑘-fold Cross-Validation

    • Algorithm:
      1. Randomly divide all available observations into k folds.
      2. For v = 1, …, k, obtain the v^{th} fit by training with all observations except those in the v^{th} fold.
      3. For v = 1, …, k, use \hat{y} from the v^{th} fit to calculate a test MSE estimate with observations in the v^{th} fold.
      4. To calculate CV error, average the k test MSE estimates in the previous step.
    • Advantages:
      • The results are typically less fickle because the CV error is an average that involves all observations.
      • As every fit uses a majority of the observations, they exhibit less bias and thus should not overestimate the test MSE as much.
  • Leave-one-out Cross-Validation (LOOCV)

    • LOOCV is a special case of 𝑘-fold cross-validation where k = n.
    • Advantages:
      • Less bias than validation set approach; doesn’t overestimate the test MSE as much.
      • Performing LOOCV multiple times always yield the same result since no randomization of observations.
    • Disadvantage: expensive to implement as it is computationally intensive for a large 𝑛
    • Ordinary least squares estimation simplifies the LOOCV error calculation, making the cost of LOOCV the same as a single model fit.
  • LOOCV Error = \frac{1}{n} \sum{i=1}^{n} \left(\frac{yi - \hat{y}i}{1 - hi}\right)^2

  • Key Ideas on Cross-Validation

    • With respect to bias, LOOCV < 𝑘-fold CV < Validation Set.
    • With respect to variance, LOOCV > 𝑘-fold CV > Validation Set.
    • To balance between the bias and variance, a rule of thumb is to use k = 5 or 10 folds in CV.
  • Other Regression Approaches

    • Standardizing Variables
      • A centered variable is the result of subtracting the sample mean from a variable.
      • A scaled variable is the result of dividing a variable by its sample standard deviation.
      • A standardized variable is the result of first centering a variable, then scaling it.
    • Ridge Regression
      • Coefficients are estimated by minimizing the SSE while constrained by \sum{j=1}^{p} \betaj^2 \leq a or equivalently, by minimizing the expression
        SSE + \lambda \sum{j=1}^{p} \betaj^2
      • a: budget parameter
      • \lambda: tuning parameter
    • Lasso Regression
      • Coefficients are estimated by minimizing the SSE while constrained by \sum{j=1}^{p} |\betaj| \leq a or equivalently, by minimizing the expression SSE + \lambda \sum{j=1}^{p} |\betaj|
      • a: budget parameter
      • \lambda: tuning parameter
  • Key Ideas on Ridge and Lasso

    • x1, …, xp are scaled predictors.
    • Increasing the budget parameter is equivalent to decreasing the tuning parameter.
    • \lambda is inversely related to flexibility.
    • With a finite \lambda, none of the ridge estimates will equal 0, but the lasso estimates could equal 0.
    • Lasso performs variable selection, whereas ridge does not.
    • Lasso tends to yield models that are easier to interpret than ridge.
    • Lasso uses an l1 shrinkage penalty, whereas ridge uses an l2 shrinkage penalty.
    • Both are useful in dealing with high dimensions.
  • Partial Least Squares

    • The first partial least squares direction, z1, is a linear combination of standardized predictors x1, …, xp, with coefficients based on the relation between xj and y.
    • Every subsequent partial least squares direction is calculated iteratively as a linear combination of ”updated predictors” which are the residuals of fits with the ”previous predictors” explained by the previous direction.
    • The directions z1, …, zp are used as predictors in a multiple linear regression. The number of directions, g, is a measure of flexibility.
  • Weighted Least Squares

    • Var[\epsiloni] = \frac{\sigma^2}{wi}
    • Equivalent to running OLS with \sqrt{w}y as the response and \sqrt{w}\mathbf{x} as the predictors, hence minimizing \sum{i=1}^{n} wi(yi - \hat{y}i)^2.
    • \hat{\mathbf{\beta}} = (X^TWX)^{-1}X^TW\mathbf{y} where W is the diagonal matrix of the weights.
  • 𝑘-Nearest Neighbors (KNN)

    1. Identify the ”center of the neighborhood”, i.e. the location of an observation with inputs x1, …, xp.
    2. Starting from the ”center of the neighborhood”, identify the k nearest training observations.
    3. For classification, \hat{y} is the most frequent category among the k observations; for regression, \hat{y} is the average of the response among the k observations.
    • k is inversely related to flexibility.

Drawbacks of Subset Selection Procedures

DrawbackBest SubsetForward StepwiseBackward StepwiseStepwise Regression
1. The algorithm fails to account for any special knowledge that a researcher has.
2. Data snooping.
3. The model found by the algorithm is not guaranteed to be the best model.
4. The algorithm doesn't consider all 2^p possible models.
5. The algorithm relies only on using 𝑝-values of two-tailed t test.
6. The algorithm involves a sequence of two-tailed t tests.
7. The algorithm ignores the joint effect of predictors.
8. Computationally intensive.
9. Should not or cannot be used in high-dimensional settings.

Key Results for Distributions in the Exponential Family

DistributionProbability Function\theta\phib(\theta)Canonical Link, b'^{-1}(\mu)
Normal\frac{1}{{\sigma \sqrt{2\pi}}} \exp\left(-\frac{{(y - \mu)}^2}{{2\sigma