knowt logo

Advanced Applied Statistics

Bayesian Inference ch. 3, 6 & 13

Bayes Theorem 

Prior =

Likelihood =


Jeffreys Prior

The Fisher information is

The Jeffreys’ uninformative prior is given by:

Jeffreys prior is used whenever there is not a good reason to use another prior


Conjugate priors 

Connection to exponential families 

For certain likelihood functions, selecting a specific prior results in the posterior sharing the same distribution as the prior. This type of prior is then called a conjugate prior because it is conjugate to the likelihood.



Empirical Bayes

Know the standard approach

Baseball example 

  • The uncertainty of low counts ½ is not the same as 50/100 since there is much more uncertainty in the first ½ than the second 

  • 1. step: Estimate a prior for the data by fitting the data to a chosen distribution by for example using maximum likelihood estimation (In this example using a Beta distribution) 

  • 2. step:  Use that distribution as a prior for each individual estimate

    • This is a shrinkage method where uncertain observations move a lot towards the average and certain observations do not move that much. 

Robbins Formula 

  • With no knowledge about the prior density it is possible to estimate it using data


Objective Bayes 

BIC - Bayes Information Criterion 

  • A criterion for model selection 

  • Introduces a penalty term for the number of parameters in the model 

  • models with lower BIC values are preferred over models with higher BIC values

  • Alternative is AIC  Akaike information criterion wich drops the penalty on the number of parameters / sample size


MCMC - Markov Chain Monte Carlo 

  • Start with a current state theta_current 

  • Propose a new state theta_new based on a proposal distribution

  • Calculate the acceptance probability which is a function of the likelihoods of the new and current states and the proposal distribution

  • Generate a random number u from a uniform distribution over [0, 1]. 

  • if accept the new state else retain current state

  • Repeat the steps for a large number of iterations to generate a chain of samples.


Maximum-likelihood estimation, exponential families and generalised linear models


Maximum likelihood estimation 

Maximum likelihood estimation 

Steps in the maximum likelihood estimation:

  • Choose a likelihood function. A distribution that the parameters should be estimated

  • Take the log of the PDF of the likelihood function

  • Differentiate the logarithm of the likelihood function with respect to the model parameters: This gives the score function, which is a vector of partial derivatives of the logarithm of the likelihood function with respect to each model parameter

  • Set the score function equal to zero and solve for the model parameters: This gives the maximum likelihood estimates of the model parameters


Score function 

  • The score function is the derivative of the log-likelihood function

  • The score function indicates how much the log-likelihood changes if you vary the

  • parameter estimate by an infinitesimal amount, given data x

  • The variance of the score is called the Fisher information



Fisher Information 

  •  It is defined as the variance of the score function, or the expected value of the observed information

  • The observed information is a sample-based version of the Fisher information, and it is used to estimate the Fisher information 3. The Fisher information is a theoretical measure of the amount of information that an observable random variable carries about an unknown parameter of a distribution that models the variable

Expectation of score:

Higher Fisher information is good


Observed Fisher information 

  • It is the negative of the second derivative of the log-likelihood function with respect to the model parameters


Cramér Rao lower bound 

  • Cramér-Rao: Any unbiased estimator θ-hat must have variance at least equal to the reciprocal of the Fisher information:


Conditional inference 

  • Experiments which were not actually performed are statistically irrelevant.

  • Ancillary statistics: A statistic that contains “no direct information by itself”, but describes the experiment that was performed. Eg sample size

  • In conditional inference, we condition on the ancillary statistics, hoping to draw inferences that are simpler and more relevant.


Randomisation 


Exponential families 

From video

From book


a non negative volume of x which might make some x-values more likely regardless of theta. Will often just be 1

Sufficient statistics. Vector measuring everything within x that matter for determining the probability for the parameters

The normalizer → ensure that the probability sum to 1 if you integrate over the domain of x

The parameter vector 



Parametric families 

Basic univariate families

  • Normal 

  • Poison 

  • Binomial 

  • Gamma 

  • Beta 


Multivariate normal distribution 

is a gaussian in p dimensions 

Σ is a real symmetric positive definite matrix, called the covariance matrix. 

Because of Gaussian integral identities, you con obtain the following analytically:

  • The conditional distribution of a Gaussian is also Gaussian.

  • The marginal distribution of a Gaussian is also Gaussian


Law of large numbers

  • The law of large numbers says that if we sample from a distribution with a finite expectation, then the sample means will converge in probability (at least) to the population mean.


Central Limit Theorem 

  • The sample mean of a large random sample of random variables with mean µ and finite variance σ2 has approximately a normal distribution with mean µ and variance σ 2/n.



Generalised Linear models 

Logistic regression

  • Predicting either 0 or 1 

  • s shape from 0 to 1 

  • Using MLE to find the curve that give the maximum likelihood 

  • The underlying exponential family is the binomial distribution. Binomial distribution: Probability of k successes in n trials, given an independent success probability of π.



Poison regression 

  • Also uses a maximum likelihood estimation to find the best fit, however in this case it is just using a poisson distribution 

  • Used when data are counts and always non negative


Generalised linear models 

  • GLMs  are a principled way to apply regression to quantities that are not normally distributed.  Data are from a one-parameter exponential family

  • Fitting linear data

  • In the general case X is a vector


Deviance 

  • The deviance between two densities f1(x) and f2(x) is a measure of how different the two distributions are.


Survival Analysis and Expectation Maximization 

Expectation Maximization 

Gaussian Mixture Models 

  • Assumes that data points come from k Gaussians with unknown parameters


Expectation and maximization models

  • The Expectation-Maximization (EM) algorithm is an iterative method used to estimate the parameters of a GMM 

  • E and M step:

    • E step:

      • “completes the data” by assigning (expected) values to hidden variables.

    • M step:

      • Updates the model by maximising the expected log-likelihood



Latent variables:

  • Latent (or hidden) variables are random variables that facilitate the description of the model, but that you don’t have access to.





Survival Analysis 

Life tables & Hazard rates 


  • X is the year of death 

  • fi = P(X = i) is the probability of dying at age i. 

  • is the probability of surviving past age i. 

  • The hazard rate at age i is the probability of dying at age i given you made it to age i − 1.

  • Survival age past j given survival past i − 1:



The Kaplan Meier Estimate

  • The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function from lifetime data.

  • Let t(1) < t(2) < ... < t(n) be the ordered survival times. The Kaplan Meier formula estimates the survival probability as

  • where t is a time when at least one event happened, k​ is the number of events (e.g., deaths) that happened at time ti​, and ni​ is the individuals known to have survived


The log-rank test

  • The Log-Rank test is good at getting a quantitative estimate of how likely it is that one treatment is better than another. 

  • Statistical hypothesis test used to compare the survival distributions of two or more groups. 

  • It is a non-parametric test



Non-parametric methods and bootstrap

Non-parametric methods

  • a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied

Parametric Bootstrap

  • Assumes the data follows a specific distribution

  • Samples the same way as the non-parametric bootstrap

  • Choose a parametric model 

  • Estimate parameters based on the given sample set. 

  • Generate new samples of the model 

  • Calculate the statistic.

Non-parametric Bootstrap

  • Does not assume that data comes from a specific distribution 

  • Sampling with replacement 

  • Original data should be a reasonable representation of the population

  • Bootstrap and calculate some relevant statistic on each sample


Bootstrap confidence intervals

  • Bootstrap samples 

  • Calculate relevant statistic on the sample, R^2 from a regression or median or mean

  • The percentile method: Calculate the 95% confidence interval of the relevant statistic fx by taking the 2.5 and 97.5 percentiles


Bootstrap error:

  • The standard deviation of the mean/median calculated from the bootstrap


Jackknife-Estimate

  • When the underlying distribution is unknown 

  • Leave one out resampling

  • Removing the i’th sample for i in length observations 

  • The standard error is n-1 because we remove 1 obs in all samples.



Neymans construction

  • Repeated Sampling: Draw multiple samples and compute a confidence interval from each sample.

  • Visualize Intervals: Plot all intervals—those covering the true parameter are one color (e.g., green), those missing it are another (e.g., red).

  • Frequentist Interpretation: In the long run, the proportion of intervals containing the true parameter matches the confidence level (e.g., 95% of intervals contain the true value for a 95% confidence interval).


Bias adjusted confidence interval 

  • Calculate the bias-correction parameter, z0, by finding the value of the standard normal distribution that corresponds to the percentile of the bootstrap distribution that contains the original estimate.

  • where Φ is the cdf of the normal distribution

  • Calculate the acceleration parameter, a, by finding the value of the standard normal distribution that corresponds to the percentile of the bootstrap distribution that contains the z0 value.

  • The bias corrected (BC) confidence interval endpoint is defined as:


Bootstrap-t intervals

  • Take n many bootstrap samples from the original dataset.

  • For each sample, calculate the mean.

  • Calculate the standard error of the distribution of means from the bootstrap samples.

  • Calculate the t-value for the desired level of confidence and degrees of freedom.

  • Multiply the standard error by the t-value and add and subtract the resulting value from the sample mean to obtain the confidence interval.


Confidence intervals and Cross-validation 

Confidence intervals

  • Suppose you know the underlying process, given by fµ(x). The 95% confidence interval is obtained by finding the values of xmin and xmax such that P(X > x) ≤ .025 and/or P(X < x) ≤ .025.

  • Example: the normal distribution with µ = 0 and σ = 1:

    • P(X < x) = 0.025 → x ≈ −1.96 

    • P(X > x) = 0.025 → x ≈ 1.96 


Cross-validation

Prediction problems 

  • A normal ML prediction problem 

  • The input consists of a set of predictors (xi) and the output consists of a set of responses (yi). The goal is to find a function rd(x) that accurately predicts the response y for a given new predictor x.


The overall prediction error (quantifying prediction error) 

  • An error that maps whether the prediction was correct or not

  • Most common squared error with continuous numbers:

    • with y-hat being the prediction and y being the actual value

  • And in the discrete case the most used is:

Cross-validation in general

  • Non-parametric method

  • When a part of the training set is held out as the validation set

  • Err = EF {D(y0, yˆ0)} for a new case obtained independently of d. 

  • If we can spare some samples from the training set, then we can define a validation set dval = {(x0i, y0i)} and define the unbiased estimate;

  • The cross-validation estimate of prediction error is;


Covariance penalties 

  • The covariance penalty approach treats prediction error estimation in a regression framework.

  • It is based on the covariance between data points and their corresponding prediction

  • In the context of the covariance penalty, it is used to measure the relationship between the observed values of the dependent variable and the predicted values of the dependent variable

  • The overall prediction error is assessed in terms of the squared discrepancy: 

    • where the expectation is given by;

    • The overall prediction error is estimated by as;


The linear model and Mallows Cp estimate

  • Is used to estimate the prediction error of a linear regression model. 

  •  applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of Cp means that the model is relatively precise.

  • This formula is called Mallows’ Cp estimate of prediction error. For least squared estimation, M = X(XT X) −1XT has tr[M] = p, the number of predictors, so;



Hypothesis testing and false-discovery rates

Hypothesis testing

  • Null hypothesis - Rejected if p< alpha level set

  • alternative hypothesis - The hypothesis accepted if H0 is rejected

  • Collect unbiased samples and make a statistical test on the data

  • Type 1 error false positive: 

    • Null hypothesis was rejected, but shouldn’t have been

  • Type 2 error:

    • Null hypothesis should have been rejected, but wasn’t.

  • Different tests:

    • Z-test: A z-test is used to test the mean of a population when the population standard deviation is known

    • T-test: A t-test is used to test the mean of a population when the population standard deviation is unknown

    • Chi-squared test: A chi-squared test is used to test the independence of two categorical variables

    • Mann-Whitney U test: The Mann-Whitney U test is another nonparametric test used to compare two independent samples 

    • Kolmogorov-Smirnov test: The Kolmogorov-Smirnov test is used to test whether two samples come from the same distribution 

    • Fisher’s exact test: Fisher’s exact test is used to test the independence of two categorical variables when the sample size is small


Large scale testing

  • When testing many hypotheses

  • Will give many false positive (type 1 errors) 

  • If we ignore them, we’ll make many Type II errors (lose power).

  • How to balance between type 1 and type 2 errors when testing multiple hypotheses? 


Bonferroni correction 

  • Divide significance level α by number of tests N, test each hypothesis at level α/N. 

  • Very conservative for large N


Holm’s procedure

  • Uniformly more powerful than Bonferroni bound. 

  • The procedure: 

    • Order observed p-values from smallest to largest  

    • For each p-value: 

    • reject the i’th null hypothesis if the p value < α/(N − i + 1)


family-wise error rate

  • FWER - Probability of making a type 1 error among a specified group or “family” of tests. 



False-discovery rate

  • Poportion of false posivites among all significant results

  • The false-discovery rate (FDR) of a decision procedure D is its expected false-discovery proportion. 

  • Controlling the FDR increases the power of our tests while still keeping Type I error under (less stringent) control. 

  • FDR control is less conservative than FWER control. 

  • Fdp:  false-discovery proportion 

    • Is unobservable 


Benjamini-Hochberg procedure

  • Used to control the FDR 

  • Arrange the p-values in order from smallest to largest, assigning a rank to each one – the smallest p-value has a rank of 1, the next smallest has a rank of 2, etc.

  • q= the chosen false discovery rate.

  • Find the largest p-value that is less than the critical value. Designate every p-value that is smaller than this p-value to be significant.


Two group model

  • The two-groups model is a simple Bayesian construction that is used to estimate the distribution of the null hypothesis. It is used to calculate the probability of observing a certain number of false positives, given the number of tests performed

  • f(z) = π0f0(z) + π1f1(z)

  •  

  • π0: Prior probability of null hypothesis being true.

  • π1 = 1 − π0: Prior probability of alternative being true.

  • Test statistics where H0 is true are drawn iid from density f0(z) with cdf F0(z).

  • Test statistics where H1 is true are drawn iid from density f1(z) with cdf F1(z).




Local fdr

  • local false-discovery rate

  • Is a Bayesian approach that provides a more refined measure of significance than the traditional false discovery rate (FDR)

  • The lfdrs defined as the probability that a null hypothesis is true, given that it has been rejected by a statistical test.


null distribution 

  • The main point of using the empirical null distribution is to estimate the distribution of the null hypothesis.

  • Is used to calculate the probability of observing a certain number of false positives (type 1 error), given the number of tests performed


null estimation 



Sparse Modeling and Regularization 

James Stein Estimator 

Shrinkage estimation 

  • Shrinkage estimation is a statistical technique that involves “shrinking” extreme values in a sample towards a central value, such as the sample mean, to obtain a better estimate of the true population mean


James-Stein Estimator

  • The James-Stein estimator is a statistical technique that involves “shrinking” extreme values in a sample towards a central value, such as the sample mean, to obtain a better estimate of the true population mean


Least squares estimator

  • When using least squares to find the best fitting line 

  • X is the design matrix, and y is the response vector

  • The design matrix is a matrix of values of explanatory variables of a set of objects.  Each row represents an individual object, with the successive columns corresponding to the variables and their specific values for that object. 


Ridge Regression 

https://campus.datacamp.com/courses/practicing-machine-learning-interview-questions-in-python/supervised-learning-0fe20a83-ea2f-4455-84fa-7fbd7a5ed99c?ex=5 

Ridge Regression 

  • Ridge regression is linear regression with squared loss and a penalty on the `2-norm of β

  • Effects of ridge regression: 

    • Biased parameter estimates: Estimated coefficients are shrunk towards zero.

    • The model requires “stronger evidence” to grow the weights. 

    • Avoids overfitting. 

    • Potentially better performance on new datasets.

  • Ridge regression overcomes the problem of overfitting by adding a shrinkage penalty term to the least squares objective function, which shrinks the magnitude of the regression coefficients towards zero. This results in a more stable and generalizable model that is less prone to overfitting.


Standardization in Ridge Regression

  • In Ridge Regression it is important to standardize all variables 

  • Otherwise some higher variables might be overly influential



Sparse Modeling and the Lasso 

Best subset regression 

  • Best subset regression is a variable selection method that involves fitting all possible regression models derived from all possible combinations of the candidate predictors, and then selecting the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R^2-value or the smallest MSE



Forward stepwise regression 

  • Forward stepwise regression is a variable selection method that starts with a model that contains no variables (called the Null Model) and then adds the most significant variables one after the other until a pre-specified stopping rule is reached or until all the variables under consideration are included in the model.

  • The most significant value can be determined by different measures for example by when adding to the model;

    • It has the smallest p-value, or

    • It provides the highest increase in R^2

  • The stopping rule is satisfied when all remaining variables to consider have a p-value larger than some specified threshold, if added to the model.



Lasso regression 

  •  LASSO regression uses squared loss and a penalty on the l1 norm of β. 

Regularization 

  • We want our models to learn generalisable patterns, not idiosyncrasies of samples.

  • Key idea of regularization: Limit the capacity of the model, so it only learns patterns for which it has good evidence.


L1 and L2 norms

  • L1 norm:

    • Also Lasso regression

  • L2 norm:

    • Also Ridge regression 




Other 

What is the difference between Bayesian and Frequentist approach? 



What is the difference between parametric and non-parametric methods?


Advanced Applied Statistics

Bayesian Inference ch. 3, 6 & 13

Bayes Theorem 

Prior =

Likelihood =


Jeffreys Prior

The Fisher information is

The Jeffreys’ uninformative prior is given by:

Jeffreys prior is used whenever there is not a good reason to use another prior


Conjugate priors 

Connection to exponential families 

For certain likelihood functions, selecting a specific prior results in the posterior sharing the same distribution as the prior. This type of prior is then called a conjugate prior because it is conjugate to the likelihood.



Empirical Bayes

Know the standard approach

Baseball example 

  • The uncertainty of low counts ½ is not the same as 50/100 since there is much more uncertainty in the first ½ than the second 

  • 1. step: Estimate a prior for the data by fitting the data to a chosen distribution by for example using maximum likelihood estimation (In this example using a Beta distribution) 

  • 2. step:  Use that distribution as a prior for each individual estimate

    • This is a shrinkage method where uncertain observations move a lot towards the average and certain observations do not move that much. 

Robbins Formula 

  • With no knowledge about the prior density it is possible to estimate it using data


Objective Bayes 

BIC - Bayes Information Criterion 

  • A criterion for model selection 

  • Introduces a penalty term for the number of parameters in the model 

  • models with lower BIC values are preferred over models with higher BIC values

  • Alternative is AIC  Akaike information criterion wich drops the penalty on the number of parameters / sample size


MCMC - Markov Chain Monte Carlo 

  • Start with a current state theta_current 

  • Propose a new state theta_new based on a proposal distribution

  • Calculate the acceptance probability which is a function of the likelihoods of the new and current states and the proposal distribution

  • Generate a random number u from a uniform distribution over [0, 1]. 

  • if accept the new state else retain current state

  • Repeat the steps for a large number of iterations to generate a chain of samples.


Maximum-likelihood estimation, exponential families and generalised linear models


Maximum likelihood estimation 

Maximum likelihood estimation 

Steps in the maximum likelihood estimation:

  • Choose a likelihood function. A distribution that the parameters should be estimated

  • Take the log of the PDF of the likelihood function

  • Differentiate the logarithm of the likelihood function with respect to the model parameters: This gives the score function, which is a vector of partial derivatives of the logarithm of the likelihood function with respect to each model parameter

  • Set the score function equal to zero and solve for the model parameters: This gives the maximum likelihood estimates of the model parameters


Score function 

  • The score function is the derivative of the log-likelihood function

  • The score function indicates how much the log-likelihood changes if you vary the

  • parameter estimate by an infinitesimal amount, given data x

  • The variance of the score is called the Fisher information



Fisher Information 

  •  It is defined as the variance of the score function, or the expected value of the observed information

  • The observed information is a sample-based version of the Fisher information, and it is used to estimate the Fisher information 3. The Fisher information is a theoretical measure of the amount of information that an observable random variable carries about an unknown parameter of a distribution that models the variable

Expectation of score:

Higher Fisher information is good


Observed Fisher information 

  • It is the negative of the second derivative of the log-likelihood function with respect to the model parameters


Cramér Rao lower bound 

  • Cramér-Rao: Any unbiased estimator θ-hat must have variance at least equal to the reciprocal of the Fisher information:


Conditional inference 

  • Experiments which were not actually performed are statistically irrelevant.

  • Ancillary statistics: A statistic that contains “no direct information by itself”, but describes the experiment that was performed. Eg sample size

  • In conditional inference, we condition on the ancillary statistics, hoping to draw inferences that are simpler and more relevant.


Randomisation 


Exponential families 

From video

From book


a non negative volume of x which might make some x-values more likely regardless of theta. Will often just be 1

Sufficient statistics. Vector measuring everything within x that matter for determining the probability for the parameters

The normalizer → ensure that the probability sum to 1 if you integrate over the domain of x

The parameter vector 



Parametric families 

Basic univariate families

  • Normal 

  • Poison 

  • Binomial 

  • Gamma 

  • Beta 


Multivariate normal distribution 

is a gaussian in p dimensions 

Σ is a real symmetric positive definite matrix, called the covariance matrix. 

Because of Gaussian integral identities, you con obtain the following analytically:

  • The conditional distribution of a Gaussian is also Gaussian.

  • The marginal distribution of a Gaussian is also Gaussian


Law of large numbers

  • The law of large numbers says that if we sample from a distribution with a finite expectation, then the sample means will converge in probability (at least) to the population mean.


Central Limit Theorem 

  • The sample mean of a large random sample of random variables with mean µ and finite variance σ2 has approximately a normal distribution with mean µ and variance σ 2/n.



Generalised Linear models 

Logistic regression

  • Predicting either 0 or 1 

  • s shape from 0 to 1 

  • Using MLE to find the curve that give the maximum likelihood 

  • The underlying exponential family is the binomial distribution. Binomial distribution: Probability of k successes in n trials, given an independent success probability of π.



Poison regression 

  • Also uses a maximum likelihood estimation to find the best fit, however in this case it is just using a poisson distribution 

  • Used when data are counts and always non negative


Generalised linear models 

  • GLMs  are a principled way to apply regression to quantities that are not normally distributed.  Data are from a one-parameter exponential family

  • Fitting linear data

  • In the general case X is a vector


Deviance 

  • The deviance between two densities f1(x) and f2(x) is a measure of how different the two distributions are.


Survival Analysis and Expectation Maximization 

Expectation Maximization 

Gaussian Mixture Models 

  • Assumes that data points come from k Gaussians with unknown parameters


Expectation and maximization models

  • The Expectation-Maximization (EM) algorithm is an iterative method used to estimate the parameters of a GMM 

  • E and M step:

    • E step:

      • “completes the data” by assigning (expected) values to hidden variables.

    • M step:

      • Updates the model by maximising the expected log-likelihood



Latent variables:

  • Latent (or hidden) variables are random variables that facilitate the description of the model, but that you don’t have access to.





Survival Analysis 

Life tables & Hazard rates 


  • X is the year of death 

  • fi = P(X = i) is the probability of dying at age i. 

  • is the probability of surviving past age i. 

  • The hazard rate at age i is the probability of dying at age i given you made it to age i − 1.

  • Survival age past j given survival past i − 1:



The Kaplan Meier Estimate

  • The Kaplan-Meier estimator is a non-parametric statistic used to estimate the survival function from lifetime data.

  • Let t(1) < t(2) < ... < t(n) be the ordered survival times. The Kaplan Meier formula estimates the survival probability as

  • where t is a time when at least one event happened, k​ is the number of events (e.g., deaths) that happened at time ti​, and ni​ is the individuals known to have survived


The log-rank test

  • The Log-Rank test is good at getting a quantitative estimate of how likely it is that one treatment is better than another. 

  • Statistical hypothesis test used to compare the survival distributions of two or more groups. 

  • It is a non-parametric test



Non-parametric methods and bootstrap

Non-parametric methods

  • a type of statistical analysis that makes minimal assumptions about the underlying distribution of the data being studied

Parametric Bootstrap

  • Assumes the data follows a specific distribution

  • Samples the same way as the non-parametric bootstrap

  • Choose a parametric model 

  • Estimate parameters based on the given sample set. 

  • Generate new samples of the model 

  • Calculate the statistic.

Non-parametric Bootstrap

  • Does not assume that data comes from a specific distribution 

  • Sampling with replacement 

  • Original data should be a reasonable representation of the population

  • Bootstrap and calculate some relevant statistic on each sample


Bootstrap confidence intervals

  • Bootstrap samples 

  • Calculate relevant statistic on the sample, R^2 from a regression or median or mean

  • The percentile method: Calculate the 95% confidence interval of the relevant statistic fx by taking the 2.5 and 97.5 percentiles


Bootstrap error:

  • The standard deviation of the mean/median calculated from the bootstrap


Jackknife-Estimate

  • When the underlying distribution is unknown 

  • Leave one out resampling

  • Removing the i’th sample for i in length observations 

  • The standard error is n-1 because we remove 1 obs in all samples.



Neymans construction

  • Repeated Sampling: Draw multiple samples and compute a confidence interval from each sample.

  • Visualize Intervals: Plot all intervals—those covering the true parameter are one color (e.g., green), those missing it are another (e.g., red).

  • Frequentist Interpretation: In the long run, the proportion of intervals containing the true parameter matches the confidence level (e.g., 95% of intervals contain the true value for a 95% confidence interval).


Bias adjusted confidence interval 

  • Calculate the bias-correction parameter, z0, by finding the value of the standard normal distribution that corresponds to the percentile of the bootstrap distribution that contains the original estimate.

  • where Φ is the cdf of the normal distribution

  • Calculate the acceleration parameter, a, by finding the value of the standard normal distribution that corresponds to the percentile of the bootstrap distribution that contains the z0 value.

  • The bias corrected (BC) confidence interval endpoint is defined as:


Bootstrap-t intervals

  • Take n many bootstrap samples from the original dataset.

  • For each sample, calculate the mean.

  • Calculate the standard error of the distribution of means from the bootstrap samples.

  • Calculate the t-value for the desired level of confidence and degrees of freedom.

  • Multiply the standard error by the t-value and add and subtract the resulting value from the sample mean to obtain the confidence interval.


Confidence intervals and Cross-validation 

Confidence intervals

  • Suppose you know the underlying process, given by fµ(x). The 95% confidence interval is obtained by finding the values of xmin and xmax such that P(X > x) ≤ .025 and/or P(X < x) ≤ .025.

  • Example: the normal distribution with µ = 0 and σ = 1:

    • P(X < x) = 0.025 → x ≈ −1.96 

    • P(X > x) = 0.025 → x ≈ 1.96 


Cross-validation

Prediction problems 

  • A normal ML prediction problem 

  • The input consists of a set of predictors (xi) and the output consists of a set of responses (yi). The goal is to find a function rd(x) that accurately predicts the response y for a given new predictor x.


The overall prediction error (quantifying prediction error) 

  • An error that maps whether the prediction was correct or not

  • Most common squared error with continuous numbers:

    • with y-hat being the prediction and y being the actual value

  • And in the discrete case the most used is:

Cross-validation in general

  • Non-parametric method

  • When a part of the training set is held out as the validation set

  • Err = EF {D(y0, yˆ0)} for a new case obtained independently of d. 

  • If we can spare some samples from the training set, then we can define a validation set dval = {(x0i, y0i)} and define the unbiased estimate;

  • The cross-validation estimate of prediction error is;


Covariance penalties 

  • The covariance penalty approach treats prediction error estimation in a regression framework.

  • It is based on the covariance between data points and their corresponding prediction

  • In the context of the covariance penalty, it is used to measure the relationship between the observed values of the dependent variable and the predicted values of the dependent variable

  • The overall prediction error is assessed in terms of the squared discrepancy: 

    • where the expectation is given by;

    • The overall prediction error is estimated by as;


The linear model and Mallows Cp estimate

  • Is used to estimate the prediction error of a linear regression model. 

  •  applied in the context of model selection, where a number of predictor variables are available for predicting some outcome, and the goal is to find the best model involving a subset of these predictors. A small value of Cp means that the model is relatively precise.

  • This formula is called Mallows’ Cp estimate of prediction error. For least squared estimation, M = X(XT X) −1XT has tr[M] = p, the number of predictors, so;



Hypothesis testing and false-discovery rates

Hypothesis testing

  • Null hypothesis - Rejected if p< alpha level set

  • alternative hypothesis - The hypothesis accepted if H0 is rejected

  • Collect unbiased samples and make a statistical test on the data

  • Type 1 error false positive: 

    • Null hypothesis was rejected, but shouldn’t have been

  • Type 2 error:

    • Null hypothesis should have been rejected, but wasn’t.

  • Different tests:

    • Z-test: A z-test is used to test the mean of a population when the population standard deviation is known

    • T-test: A t-test is used to test the mean of a population when the population standard deviation is unknown

    • Chi-squared test: A chi-squared test is used to test the independence of two categorical variables

    • Mann-Whitney U test: The Mann-Whitney U test is another nonparametric test used to compare two independent samples 

    • Kolmogorov-Smirnov test: The Kolmogorov-Smirnov test is used to test whether two samples come from the same distribution 

    • Fisher’s exact test: Fisher’s exact test is used to test the independence of two categorical variables when the sample size is small


Large scale testing

  • When testing many hypotheses

  • Will give many false positive (type 1 errors) 

  • If we ignore them, we’ll make many Type II errors (lose power).

  • How to balance between type 1 and type 2 errors when testing multiple hypotheses? 


Bonferroni correction 

  • Divide significance level α by number of tests N, test each hypothesis at level α/N. 

  • Very conservative for large N


Holm’s procedure

  • Uniformly more powerful than Bonferroni bound. 

  • The procedure: 

    • Order observed p-values from smallest to largest  

    • For each p-value: 

    • reject the i’th null hypothesis if the p value < α/(N − i + 1)


family-wise error rate

  • FWER - Probability of making a type 1 error among a specified group or “family” of tests. 



False-discovery rate

  • Poportion of false posivites among all significant results

  • The false-discovery rate (FDR) of a decision procedure D is its expected false-discovery proportion. 

  • Controlling the FDR increases the power of our tests while still keeping Type I error under (less stringent) control. 

  • FDR control is less conservative than FWER control. 

  • Fdp:  false-discovery proportion 

    • Is unobservable 


Benjamini-Hochberg procedure

  • Used to control the FDR 

  • Arrange the p-values in order from smallest to largest, assigning a rank to each one – the smallest p-value has a rank of 1, the next smallest has a rank of 2, etc.

  • q= the chosen false discovery rate.

  • Find the largest p-value that is less than the critical value. Designate every p-value that is smaller than this p-value to be significant.


Two group model

  • The two-groups model is a simple Bayesian construction that is used to estimate the distribution of the null hypothesis. It is used to calculate the probability of observing a certain number of false positives, given the number of tests performed

  • f(z) = π0f0(z) + π1f1(z)

  •  

  • π0: Prior probability of null hypothesis being true.

  • π1 = 1 − π0: Prior probability of alternative being true.

  • Test statistics where H0 is true are drawn iid from density f0(z) with cdf F0(z).

  • Test statistics where H1 is true are drawn iid from density f1(z) with cdf F1(z).




Local fdr

  • local false-discovery rate

  • Is a Bayesian approach that provides a more refined measure of significance than the traditional false discovery rate (FDR)

  • The lfdrs defined as the probability that a null hypothesis is true, given that it has been rejected by a statistical test.


null distribution 

  • The main point of using the empirical null distribution is to estimate the distribution of the null hypothesis.

  • Is used to calculate the probability of observing a certain number of false positives (type 1 error), given the number of tests performed


null estimation 



Sparse Modeling and Regularization 

James Stein Estimator 

Shrinkage estimation 

  • Shrinkage estimation is a statistical technique that involves “shrinking” extreme values in a sample towards a central value, such as the sample mean, to obtain a better estimate of the true population mean


James-Stein Estimator

  • The James-Stein estimator is a statistical technique that involves “shrinking” extreme values in a sample towards a central value, such as the sample mean, to obtain a better estimate of the true population mean


Least squares estimator

  • When using least squares to find the best fitting line 

  • X is the design matrix, and y is the response vector

  • The design matrix is a matrix of values of explanatory variables of a set of objects.  Each row represents an individual object, with the successive columns corresponding to the variables and their specific values for that object. 


Ridge Regression 

https://campus.datacamp.com/courses/practicing-machine-learning-interview-questions-in-python/supervised-learning-0fe20a83-ea2f-4455-84fa-7fbd7a5ed99c?ex=5 

Ridge Regression 

  • Ridge regression is linear regression with squared loss and a penalty on the `2-norm of β

  • Effects of ridge regression: 

    • Biased parameter estimates: Estimated coefficients are shrunk towards zero.

    • The model requires “stronger evidence” to grow the weights. 

    • Avoids overfitting. 

    • Potentially better performance on new datasets.

  • Ridge regression overcomes the problem of overfitting by adding a shrinkage penalty term to the least squares objective function, which shrinks the magnitude of the regression coefficients towards zero. This results in a more stable and generalizable model that is less prone to overfitting.


Standardization in Ridge Regression

  • In Ridge Regression it is important to standardize all variables 

  • Otherwise some higher variables might be overly influential



Sparse Modeling and the Lasso 

Best subset regression 

  • Best subset regression is a variable selection method that involves fitting all possible regression models derived from all possible combinations of the candidate predictors, and then selecting the subset of predictors that do the best at meeting some well-defined objective criterion, such as having the largest R^2-value or the smallest MSE



Forward stepwise regression 

  • Forward stepwise regression is a variable selection method that starts with a model that contains no variables (called the Null Model) and then adds the most significant variables one after the other until a pre-specified stopping rule is reached or until all the variables under consideration are included in the model.

  • The most significant value can be determined by different measures for example by when adding to the model;

    • It has the smallest p-value, or

    • It provides the highest increase in R^2

  • The stopping rule is satisfied when all remaining variables to consider have a p-value larger than some specified threshold, if added to the model.



Lasso regression 

  •  LASSO regression uses squared loss and a penalty on the l1 norm of β. 

Regularization 

  • We want our models to learn generalisable patterns, not idiosyncrasies of samples.

  • Key idea of regularization: Limit the capacity of the model, so it only learns patterns for which it has good evidence.


L1 and L2 norms

  • L1 norm:

    • Also Lasso regression

  • L2 norm:

    • Also Ridge regression 




Other 

What is the difference between Bayesian and Frequentist approach? 



What is the difference between parametric and non-parametric methods?