Applied Bayesian

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/62

There's no tags or description

Looks like no tags are added yet.

Last updated 3:18 PM on 4/19/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

63 Terms

New cards

What is convergence?

Where the chain has forgotten its initial starting position and is generating samples from the target posterior distribution.

Traceplots should look like a random scatter about a stable value, the Gelman Rubin statistic $\hat{R}$ should be below 1.05.

New cards

What is mixing?

Mixing is the rate at which the MCMC sampler explores the target distribution’s parameter space.

Good mixing is where the chain moves rapidly across the space, resulting in low correlation between successive samples.

Poor mixing is where the chain moves slowly, often due to high autocorrelation between samples (e.g. small step sizes in a random walk, or too large step sizes leading to high rejection), leading to a slow exploration of the target distribution

New cards

What is the extended form of the Bayes’ theorem?

Use this in practice

$P(A|B) \propto P(A)P(B|A)$

$P(A’|B) \propto P(A’)P(B|A)$

$\therefore P\left(A\left|B\right.\right)=\frac{P\left(B\left|A\right.\right)P\left(A\right)}{P\left(B\left|A\right.\right)P\left(A\right)+P\left(B\left|A\right.^{\prime}\right)P\left(A^{\prime}\right)}$

$Use this in practice$$P(A|B) \propto P(A)P(B|A)$$ $$P(A’|B) \propto P(A’)P(B|A)$$$$\therefore P\left(A\left|B\right.\right)=\frac{P\left(B\left|A\right.\right)P\left(A\right)}{P\left(B\left|A\right.\right)P\left(A\right)+P\left(B\left|A\right.^{\prime}\right)P\left(A^{\prime}\right)}$$ $

New cards

What is the difference between the frequentist and Bayesian interpretations of probability?

The frequentist interpretation is

$P[A] = lim_{n \to \infty} \frac{m}{n}$

where m is the number of times the event A in occurs in a sequence of n independent and identical ‘experiments’

However, this sequence of experiments is hypothetical, and does not actually occur

Frequentist interpretation is based on (potentially) observable events, but we often wish to consider the probabilities of unobservable quantities.

The Bayesian interpretation is that the probability of an event A is a measure of someone’s degree of belief that A will occur.

New cards

How do Bayesian and frequentist statistics treat unknown parameters?

New cards

What are the mean and variance of a Beta distribution?

$m = \frac{\alpha}{\alpha+\beta}$

$v= \frac{\alpha \beta}{(\alpha+\beta)²(\alpha = \beta + 1)} = \frac{m(1-m)}{\alpha + \beta + 1}$

Useful trick for priors, easier to solve after expressing v in terms of m

New cards

What are Bayesian point estimates?

A point estimate is a numerical summary of the ‘location’ of the posterior distribution

Common choices include the mean, median or mode of the posterior

New cards

What is a credible interval?

A set $C \subset \mathbb{R}$ is a $100(1-\alpha)\%$ credible interval for $\theta$ is

$P(\theta \in C | y) = 1- \alpha$

There are many $100(1-\alpha)\%$ credible intervals

The most widely used is the central credible interval for $\theta$ , which is

$[\theta_{\alpha/2}, \theta_{1-\alpha/2}]$

where $\theta_q$ is the ( $100 \times q)$ th percentile of $p(\theta | y)$ , i.e. $P(\theta \le \theta_q | y) = q)$

We can alternatively use the highest posterior density (HPD) credible interval

C = \{\theta | p(\theta|y) > b \} where b is set such that $p(\theta \in C | y)=1- \alpha$

The $100(1-\alpha)\%$ HPD interval is the $100(1-\alpha)\%$ credible interval with the shortest width

New cards

What is the predictive distribution of $\tilde{Y}$ , following the same process as observed data $y$ ?

$p(\tilde{Y}=\tilde{y}|Y=y)=\int p(\tilde{y}|\theta)p(\theta|y)d\theta$

New cards

How does sequential analysis work in Bayesian statistics?

New cards

Let $Y| \theta \sim N(\theta, \tau^{-1})$ , where $\tau$ is known and $\theta$ is unknown and let the prior be $\theta \sim N(\mu_0, \phi_0^{-1})$ . How do we obtain the posterior of $\theta$ ?

If $n \to \infty$ with $\phi_0$ fixed, or $\phi_0 \to 0$ with $n$ fixed (either lots of data or very diffused prior beliefs) then approximately $\theta | y \sim N(\bar{y}, (n\tau)^{-1})$ : the sampling distribution of the MLE.

If we write $\phi_0 = \kappa_0 \tau$ , then

$\theta | y \sim N(\frac{n}{n+\kappa_0} \bar{y} + \frac{\kappa_0}{n+\kappa_0} \mu_0, ((n + \kappa_0) \tau)^{-1})$

Hence $\kappa_0$ may be viewed as a ‘prior sample size’

$If $$n \to \infty$$ with $$\phi_0$$ fixed, or $$\phi_0 \to 0$$ with $$n$$ fixed (either lots of data or very diffused prior beliefs) then approximately $$\theta | y \sim N(\bar{y}, (n\tau)^{-1})$$: the sampling distribution of the MLE.If we write $$\phi_0 = \kappa_0 \tau$$, then$$\theta | y \sim N(\frac{n}{n+\kappa_0} \bar{y} + \frac{\kappa_0}{n+\kappa_0} \mu_0, ((n + \kappa_0) \tau)^{-1})$$ Hence $$\kappa_0$$ may be viewed as a ‘prior sample size’$

New cards

Let $Y| \tau \sim N(\theta, \tau^{-1})$ , where $\tau$ is unknown and $\theta$ is known and let the prior be $\tau \sim Gamma(\alpha, \beta)$ . How do we obtain the posterior of $\tau$ ?

New cards

How do we find the posterior distribution of a function of a parameter?

(Rewrite $p_{\theta | y}(\theta | y)$ in terms of $\phi$ and then multiply by the Jacobin $|\frac{d\theta}{d\phi}|$

$(Rewrite $$p_{\theta | y}(\theta | y)$$ in terms of $$\phi$$ and then multiply by the Jacobin $$|\frac{d\theta}{d\phi}|$$$

New cards

How do we do hypothesis tests in Bayesian statistics?

Choose the hypothesis with the largest posterior probability

Alternatively, can set losses for type I and type II error and calculate expected losses

New cards

How can we take into account the support of $\theta$ when choosing a prior $p(\theta)$ ?

For vague priors

For $\theta \in (-\infty, \infty)$ , use $\theta \sim N(0, \sigma²)$ where $\sigma²$ is very large

For $\theta \in (0, \infty)$ , use $\theta \sim Gamma(\epsilon, \epsilon)$ where $\epsilon$ is very small (however, peaked at 0, so highly informative when the likelihood is not negligible near 0)

For $\theta \in [0,1]$ , use $\theta \sim Beta(\epsilon, \epsilon)$ where $\epsilon$ is very small (however, peaked at 0 and 1, so highly informative when the likelihood is not negligible near 0 or 1)

$For vague priorsFor $$\theta \in (-\infty, \infty)$$ , use $$\theta \sim N(0, \sigma²)$$ where $$\sigma²$$ is very largeFor $$\theta \in (0, \infty)$$, use $$\theta \sim Gamma(\epsilon, \epsilon)$$ where $$\epsilon$$ is very small (however, peaked at 0, so highly informative when the likelihood is not negligible near 0)For $$\theta \in [0,1]$$, use $$\theta \sim Beta(\epsilon, \epsilon)$$ where $$\epsilon$$ is very small (however, peaked at 0 and 1, so highly informative when the likelihood is not negligible near 0 or 1)$

New cards

What is a conjugate prior?

A prior is a conjugate prior for a likelihood function $p(x | \theta)$ if the prior $p(\theta)$ is in the same family of distributions as the posterior $p(\theta | x)$

$A prior is a conjugate prior for a likelihood function $$p(x | \theta)$$ if the prior $$p(\theta)$$ is in the same family of distributions as the posterior $$p(\theta | x)$$$

New cards

What is a natural conjugate prior?

New cards

What is the relationship between exponential families of distributions and conjugate priors?

Can use this to find conjugate priors

New cards

What is an improper prior?

Improper priors are used in Bayesian inference, but can only be used if the posterior will be proper for all possible observable data

New cards

What are non-informative priors?

Priors that will be dominated by the likelihood, such that the posterior depends on the data as much as possible. Do not depend on previously obtained information.

e.g. uniform priors (which may be improper), vague / diffused priors (priors with very high variance; $p(\theta)$ does not change much over the values of $\theta$ for which the likelihood is non-negligible) or Jeffreys’ prior

New cards

How does Haldane’s prior differ from the uniform prior?

New cards

What is the “vague” proper prior for values between 0 and 1?

New cards

Are uniform priors invariant to transformations?

No.

A prior for a parameter $\theta$ , $p(\theta)$ , implies that the prior distribution of $\phi=g(\theta)$ is

$p_{\Phi}(\phi) = p_\Theta(\theta) |\frac{d \theta}{d \phi}|$

For a uniform prior, $p(\theta) \propto 1$ , so $p_{\Phi}(\phi) \propto |\frac{d \theta}{d \phi}|$ . This is constant only if $|\frac{d \theta}{d \phi}|$ is constant, i.e. g() is a linear transformation

Therefore we can’t have uniform priors for both $\theta$ and a non-linear transformation $g(\theta)$

New cards

What is Jeffreys’ prior?

New cards

If we know the distribution of $\theta$ , what is the distribution of $\phi = g(\theta)$ ?

$f_\Phi(\phi) = f_\Theta(\theta) |\frac{d \theta}{d \phi}| = f_\Theta(g^{-1}(\phi)) |\frac{d \theta}{d \phi}|$

New cards

When is the posterior mode approximately equal to the MLE?

New cards

What is the difference between credible intervals and confidence intervals?

New cards

What is the Likelihood principle?

The LP implies that it matters only what was observed, and not what might have been observed. However, the frequentist approach depends not only on what was observed, but also on the design of the study (e.g. how the experiment was stopped, binomial and negative binomial likelihoods differ)

<p>The LP implies that it matters only what was observed, and not what might have been observed. However, the frequentist approach depends not only on what was observed, but also on the design of the study (e.g. how the experiment was stopped, binomial and negative binomial likelihoods differ)</p>

New cards

What is marginal independence?

New cards

What is conditional independence?

New cards

What is a DAG?

A Directed Acyclic Graph is a directed graph (all nodes are random variables, all edges are arrows) that contains no directed cycles

A directed edge / arrow from one node to another indicates that the first variable causes / influences the second. Dashed arrows denote deterministic dependencies, solid arrows denote stochastic dependencies.

DAGs are useful for visualising and investigating conditional dependence, e.g. causal relationships between random variable, where X causes / influences Y, but Y does not cause / influence X

<p>A Directed Acyclic Graph is a directed graph (all nodes are random variables, all edges are arrows) that contains no directed cycles</p><p>A directed edge / arrow from one node to another indicates that the first variable causes / influences the second. Dashed arrows denote deterministic dependencies, solid arrows denote stochastic dependencies. </p><p>DAGs are useful for visualising and investigating conditional dependence, e.g. causal relationships between random variable, where X causes / influences Y, but Y does not cause / influence X</p>

New cards

What are parents, children, ancestors, descendants and founders in DAGs?

New cards

What are the dependence properties of DAGs?

New cards

What is moralising a DAG?

New cards

How can we use DAGs to determine whether variables are conditionally independent?

c.i graph = conditional independence graph

only draw relevant variables and their ancestors in the partial DAG

New cards

What is the factorisation theorem?

Proceeds from the fact that a variable is conditionally independent of its ancestors given its parents

New cards

What is a Markov Blanket?

New cards

What is the full-conditional distribution of $X_k$ ?

$P\left(X_{k}|X_{\backslash X_{k}}\right)\propto P(X)=\prod_{i=1}^{K}P(X_{i}|\text{parents}[X_{i}])$ (factorisation theorem)

$P(X_{i}|\text{parents}[X_{i}]$ is constant with regards to $X_k$ if $X_k$ is neither $X_i$ or a parent of $X_i$

So $P(X_k | X_{\backslash X_k}) \propto P(X_k | \text{parents}[X_k]) \prod_{w\in children[X_k]} P(w|\text{parents}[w])$

$$$P\left(X_{k}|X_{\backslash X_{k}}\right)\propto P(X)=\prod_{i=1}^{K}P(X_{i}|\text{parents}[X_{i}])$$ (factorisation theorem)$$P(X_{i}|\text{parents}[X_{i}]$$ is constant with regards to $$X_k$$ if $$X_k$$ is neither $$X_i$$ or a parent of $$X_i$$So $$P(X_k | X_{\backslash X_k}) \propto P(X_k | \text{parents}[X_k]) \prod_{w\in children[X_k]} P(w|\text{parents}[w])$$ $

New cards

What is the motivation for MCMC?

New cards

What is a Markov chain?

New cards

What is MCMC?

For any function of the parameters, simply calculate $f^{(i)} = f(\theta_{(i)}$ to obtain a sample from its posterior distribution. Can then calculate posterior mean / median / mode, etc.

$For any function of the parameters, simply calculate $$f^{(i)} = f(\theta_{(i)}$$ to obtain a sample from its posterior distribution. Can then calculate posterior mean / median / mode, etc. $

New cards

What is the Gibbs sampler algorithm?

The full conditional distribution of a parameter may be a known distribution, which can be sampled from simply

If the full conditional distribution is not proportional to a kernel of a known distribution, the full conditional distribution would need to be sampled using a method such as the Metropolis-Hastings algorithm (Metropolis within Gibbs) or rejection sampling

New cards

What are the two ways of obtaining the full-conditional distribution for a variable?

$P(C | V\backslash C) \propto$ terms in joint distribution containing C

Factorisation theorem - Joint distribution $P(V) = \prod_{v \in V} P(v |\text{parents}[v])$

$P(C | \backslash V) \propto P(C | \text{parents}[C]) \prod _{w \in children[C]} P(w| \text{parents}[w])$

New cards

How can we use DAGs to represent hierarchical models?

New cards

What are the issues that arise due to dependence between MCMC samples?

New cards

What is burn-in?

The burn-in are the first M samples of the chain, which are discarded because it is believed that the chain has not yet converged and that these values are still dependent on the initial values.

Strictly speaking, convergence is only achieved for $M = \infty$

In practice, we can only detect lack of convergence

If no evidence of lack of convergence is found, we are more confident that the chain has converged

We can check this:

using traceplots: once convergence has been reached, samples should look like a random scatter about a stable value
using convergence diagnostics: if the Gelman-Rubin diagnostic \hat{R} < 1.05, this indicates practical convergence

Use these to set M, the length of the burn-in

$The burn-in are the first M samples of the chain, which are discarded because it is believed that the chain has not yet converged and that these values are still dependent on the initial values.Strictly speaking, convergence is only achieved for $$M = \infty$$In practice, we can only detect lack of convergenceIf no evidence of lack of convergence is found, we are more confident that the chain has convergedWe can check this:<ul><li>using traceplots: once convergence has been reached, samples should look like a random scatter about a stable value</li><li>using convergence diagnostics: if the Gelman-Rubin diagnostic $$\hat{R} < 1.05$$, this indicates practical convergence</li></ul>Use these to set M, the length of the burn-in$

New cards

What is the Gelman-Rubin statistic?

Compares within chain and between chain variation

New cards

How can we determine how long to run MCMC chains for?

New cards

What are batch mean SEs?

Divide by Q(Q-1) because we are estimating the variance of $\hat{b}$ , not an individual $b$

Can also account for auto correlation via time series SEs

$Divide by Q(Q-1) because we are estimating the variance of $$\hat{b}$$, not an individual $$b$$Can also account for auto correlation via time series SEs$

New cards

What are Bayesian hierarchical models?

Also called random effect or multi-level models

We assume that the parameters $\theta$ of groups are a random sample from a common population distribution, and then estimate the (hyper-)parameters of that population distribution

We assign a (hierarchical) prior distribution to the hyper-parameters.

We have:

Likelihood $p(y|\theta)$ (1st level)
Prior $p(\theta | \phi_2)$ with higher level parameter $\phi_2$ (2nd level)
Prior $p(\phi_2)$ (3rd level)

We can add further levels, with $\phi_k$ being k-th level hyper-parameters. A non informative prior is usually specified for the top level parameters.

These models imply that $\theta_i$ is different for every group, but similar - $\theta_{i}$ are not marginally independent, but are exchangeable

By assuming that the parameters are drawn from a common population distribution, the more extreme parameters are shrunk towards the overall mean. The posterior distribution for each $\theta_i$ borrows strength from the likelihood contributions of all groups, via their influence on the unknown population parameters, and reflects our full uncertainty about the true values of the population parameters. These models are also useful if we are interested in the population parameters themselves.

Better than assuming a common $\theta$ between all groups, or that the parameters $\theta_{i}$ for each group are independent (we want to use information about $\theta_{\backslash i}$ to estimate $\theta_i$ )

Can obtain full conditional distribution for each parameter and then use Gibbs sampler

$Also called random effect or multi-level modelsWe assume that the parameters $$\theta$$ of groups are a random sample from a common population distribution, and then estimate the (hyper-)parameters of that population distributionWe assign a (hierarchical) prior distribution to the hyper-parameters.We have:<ul><li>Likelihood $$p(y|\theta)$$ (1st level)</li><li>Prior $$p(\theta | \phi_2)$$ with higher level parameter $$\phi_2$$ (2nd level)</li><li>Prior $$p(\phi_2)$$ (3rd level)</li></ul>We can add further levels, with $$\phi_k$$ being k-th level hyper-parameters. A non informative prior is usually specified for the top level parameters. These models imply that $$\theta_i$$ is different for every group, but similar - $$\theta_{i}$$ are not marginally independent, but are exchangeableBy assuming that the parameters are drawn from a common population distribution, the more extreme parameters are shrunk towards the overall mean. The posterior distribution for each $$\theta_i$$ borrows strength from the likelihood contributions of all groups, via their influence on the unknown population parameters, and reflects our full uncertainty about the true values of the population parameters. These models are also useful if we are interested in the population parameters themselves. Better than assuming a common $$\theta$$ between all groups, or that the parameters $$\theta_{i}$$ for each group are independent (we want to use information about $$\theta_{\backslash i}$$ to estimate $$\theta_i$$)Can obtain full conditional distribution for each parameter and then use Gibbs sampler$

New cards

What is exchangeability and the general representation theorem?

A sequence of random variables $\theta_1, …, \theta_n$ is exchangeable if, for any permutation $\{ i_1, …, i_n\}$ of $\{1, …, n\}$ , $(\theta_{i_1}, …, \theta_{i_n})$ has the same n-dimensional joint probability distribution as $(\theta_1, …, \theta_n)$

i.e. for all $a_1, …, a_n$ ,

$p(\theta_1=a_1,\ldots,\theta_{n}=a_{n})=p(\theta_{i_1}=a_1,\ldots,\theta_{i_{n}}=a_{n})$

If $\theta_1, …, \theta_n$ are marginally independent and have the same marginal distribution (they are i.i.d), then they are exchangeable

The General representation theorem shows that if $\theta_1, …, \theta_n$ are exchangeable, there there exists a parametric model $p(\theta | \phi)$ with prior $p(\phi)$ for $\phi$ such that $\theta_i \perp\!\!\!\!\perp \theta_j | \phi$ and therefore $\theta_i | \phi \stackrel{\text{iid}}{\sim} p(\theta | \phi)$

Exchangeability therefore implies a hierarchical model: this is equivalent to $\theta_1, …, \theta_n$ being a random sample from a model $p(\theta|\phi)$ with prior $p(\phi)$

New cards

When is it reasonable to assume exchangeability?

The parameters of groups, $\theta_1, …, \theta_n$ , are exchangeable if, for any permutation $\{ i_1, …, i_n\}$ of $\{1, …, n\}$ , $(\theta_{i_1}, …, \theta_{i_n})$ has the same n-dimensional joint probability distribution as $(\theta_1, …, \theta_n)$ : e.g. there is no reason to believe that $\theta_i=a$ and $\theta_j = b$ is more likely than $\theta_i = b$ and $\theta_j = a$

However, if we know additional information about the groups, exchangeability may not be a reasonable assumption; we may have reason to expect one group’s parameter to be greater than another’s.

Whether it is reasonable to assume exchangeability depends on our degree of knowledge / ignorance.

We can address this by including covariates in our hierarchical models, e.g. a generalised linear mixed model

New cards

What is the Metropolis-Hastings sampler?

The Metropolis-Hastings sampler produces a chain of values $\theta^{(1)}, \theta^{(2)}, …$ from a distribution with density $g(\theta)$ (e.g. full conditional or posterior)

Choose $\theta^{(0)}$ then, at the $i$ -th iteration

Propose $\theta^’$ from a proposal distribution $q(\theta^{(i-1)}, \theta^’)$
(Via a simulated uniform random variable) Accept this proposal with probability $\alpha\left(\theta^{(i-1)},\theta^{’}\right)=\min\left\lbrace1,\frac{g(\theta’)q(\theta^{‘},\theta^{(i-1)})}{g(\theta^{(i-1)}q(\theta^{\left(i-1\right)},\theta^{‘})}\right\rbrace$

There are two main types of proposal distribution:

The random walk sampler, where $\theta^’ = \theta^{(i-1)}+\epsilon_i$ where the distribution of $\epsilon_i$ is zero-mean and symmetric (e.g. normal, t-distribution)

Then $q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)})$ : always accept uphill moves, sometimes accept downhill moves

The variance of $\epsilon_i$ is a tuning parameter. Too low step-size will lead to suggestions of small deviations and a high acceptance rate, causing random walk behaviour and poor mixing (explores sample space slowly). Too large step-size will lead to suggestions of large deviations and a low acceptance rate, causing a ‘sticky’ chain with high autocorrelation and poor mixing. High autocorrelation in the chain will result in a high variance of estimators.

The independence sampler, where the proposal distribution does not depend on the previous state $\theta^{(i-1)}$ , i.e. $q(\theta^{(i-1)}, \theta^’) = q(\theta^’)$

This method works well when $q(\theta)$ is a good approximation of $g(\theta)$ . Can move from 1 side of the distribution to another, better mixing than random walk. However, if it is a poor approximation, we will reject most proposals, leading to a sticky chain

New cards

How do we tune Metropolis Hastings random walk samplers?

The random walk sampler is $\theta^’ = \theta^{(i-1)}+\epsilon_i$ where the distribution of $\epsilon_i$ is zero-mean and symmetric (e.g. normal, t-distribution)

Then $q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)})$ : always accept uphill moves, sometimes accept downhill moves

High autocorrelation in the chain will result in a high variance of estimators.

It can be shown that, in general, samplers will the lowest variability have an average acceptance rate which is between 0.2 and 0.3

$The random walk sampler is $$\theta^’ = \theta^{(i-1)}+\epsilon_i$$ where the distribution of $$\epsilon_i$$ is zero-mean and symmetric (e.g. normal, t-distribution)Then $$q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)})$$: always accept uphill moves, sometimes accept downhill movesThe variance of $$\epsilon_i$$ is a tuning parameter. Too low step-size will lead to suggestions of small deviations and a high acceptance rate, causing random walk behaviour and poor mixing (explores sample space slowly). Too large step-size will lead to suggestions of large deviations and a low acceptance rate, causing a ‘sticky’ chain with high autocorrelation and poor mixing. High autocorrelation in the chain will result in a high variance of estimators.It can be shown that, in general, samplers will the lowest variability have an average acceptance rate which is between 0.2 and 0.3$

New cards

What problems does correlation cause in the Gibbs sampler?

The Gibbs sampler works by sampling from each full conditional distribution. This can lead to slow mixing if the posterior distribution has high correlation between some variables.

e.g. let $Var(X_1)=Var(X_2)=1$ and $Cov(X_1, X_2) = \rho$

The full conditional distribution of $X_1$ is $N(\rho X_2, 1- \rho²)$ and vice versa

If $\rho$ is large, the conditional variance of $X_1 | X_2$ , $(1-\rho)²)$ is small relative to the marginal variance 1, which leads to slower convergence

A second problem is that higher correlation between variables will lead to higher dependence between samples. This autocorrelation will cause the variance of estimators to increase.

There are two main remedies:

Thinning - only take every kth value of the chain, leading to reduced autocorrelation between samples
Reparameterisation - transform from X to some variables with lower correlation

e.g. in the model $y_i \sim Poisson(\mu_i)$

and $\mu_i = \alpha + \beta X_i$

$\alpha$ and $\beta$ are highly correlated. If we centre X and instead specify $\mu_i = \alpha + \beta (X_i - \bar{X})$ , then this removes the correlation between $\alpha$ and $\beta$

New cards

For hierarchical models, what is the predictive distribution of a new value in a group, a parameter for a new group, and a new value in a new group?

New cards

What is the Bayes factor?

Disadvantages: very difficult to work out Bayes factor analytically for more complex models

Lindley’s paradox: if comparing a model M1 where a parameter is set with a model M2 where a parameter is given a prior, we can choose a weight on that prior to make the Bayes factor in favour of M2 arbitrarily small. Bayes factors are sensitive to the choice of the weight on the prior: should choose a sensible prior variance of any parameters.