Applied Bayesian

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/62

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 3:18 PM on 4/19/26
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

63 Terms

1
New cards

What is convergence?

Where the chain has forgotten its initial starting position and is generating samples from the target posterior distribution.

Traceplots should look like a random scatter about a stable value, the Gelman Rubin statistic R^\hat{R} should be below 1.05.

2
New cards

What is mixing?

Mixing is the rate at which the MCMC sampler explores the target distribution’s parameter space.

Good mixing is where the chain moves rapidly across the space, resulting in low correlation between successive samples.

Poor mixing is where the chain moves slowly, often due to high autocorrelation between samples (e.g. small step sizes in a random walk, or too large step sizes leading to high rejection), leading to a slow exploration of the target distribution

3
New cards

What is the extended form of the Bayes’ theorem?

Use this in practice

P(AB)P(A)P(BA)P(A|B) \propto P(A)P(B|A)

P(AB)P(A)P(BA)P(A’|B) \propto P(A’)P(B|A)

P(AB)=P(BA)P(A)P(BA)P(A)+P(BA)P(A)\therefore P\left(A\left|B\right.\right)=\frac{P\left(B\left|A\right.\right)P\left(A\right)}{P\left(B\left|A\right.\right)P\left(A\right)+P\left(B\left|A\right.^{\prime}\right)P\left(A^{\prime}\right)}

<p>Use this in practice</p><p></p><p>$$P(A|B) \propto P(A)P(B|A)$$ </p><p>$$P(A’|B) \propto P(A’)P(B|A)$$</p><p>$$\therefore P\left(A\left|B\right.\right)=\frac{P\left(B\left|A\right.\right)P\left(A\right)}{P\left(B\left|A\right.\right)P\left(A\right)+P\left(B\left|A\right.^{\prime}\right)P\left(A^{\prime}\right)}$$ </p>
4
New cards

What is the difference between the frequentist and Bayesian interpretations of probability?

The frequentist interpretation is

P[A]=limnmnP[A] = lim_{n \to \infty} \frac{m}{n}

where m is the number of times the event A in occurs in a sequence of n independent and identical ‘experiments’

However, this sequence of experiments is hypothetical, and does not actually occur

Frequentist interpretation is based on (potentially) observable events, but we often wish to consider the probabilities of unobservable quantities.

The Bayesian interpretation is that the probability of an event A is a measure of someone’s degree of belief that A will occur.

5
New cards

How do Bayesian and frequentist statistics treat unknown parameters?

knowt flashcard image
6
New cards

What are the mean and variance of a Beta distribution?

m=αα+βm = \frac{\alpha}{\alpha+\beta}

v=αβ(α+β)2(α=β+1)=m(1m)α+β+1v= \frac{\alpha \beta}{(\alpha+\beta)²(\alpha = \beta + 1)} = \frac{m(1-m)}{\alpha + \beta + 1}

Useful trick for priors, easier to solve after expressing v in terms of m

7
New cards

What are Bayesian point estimates?

A point estimate is a numerical summary of the ‘location’ of the posterior distribution

Common choices include the mean, median or mode of the posterior

8
New cards

What is a credible interval?

A set CRC \subset \mathbb{R} is a 100(1α)%100(1-\alpha)\% credible interval for θ\theta is

P(θCy)=1αP(\theta \in C | y) = 1- \alpha

There are many 100(1α)%100(1-\alpha)\% credible intervals

The most widely used is the central credible interval for θ\theta, which is

[θα/2,θ1α/2][\theta_{\alpha/2}, \theta_{1-\alpha/2}]

where θq\theta_q is the (100×q)100 \times q)th percentile of p(θy)p(\theta | y), i.e. P(θθqy)=q)P(\theta \le \theta_q | y) = q)

We can alternatively use the highest posterior density (HPD) credible interval

C = \{\theta | p(\theta|y) > b \} where b is set such that p(θCy)=1αp(\theta \in C | y)=1- \alpha

The 100(1α)%100(1-\alpha)\% HPD interval is the 100(1α)%100(1-\alpha)\% credible interval with the shortest width

9
New cards

What is the predictive distribution of Y~\tilde{Y}, following the same process as observed data yy?

p(Y~=y~Y=y)=p(y~θ)p(θy)dθp(\tilde{Y}=\tilde{y}|Y=y)=\int p(\tilde{y}|\theta)p(\theta|y)d\theta

10
New cards

How does sequential analysis work in Bayesian statistics?

knowt flashcard image
11
New cards

Let YθN(θ,τ1)Y| \theta \sim N(\theta, \tau^{-1}), where τ\tau is known and θ\theta is unknown and let the prior be θN(μ0,ϕ01)\theta \sim N(\mu_0, \phi_0^{-1}). How do we obtain the posterior of θ\theta?

If nn \to \infty with ϕ0\phi_0 fixed, or ϕ00\phi_0 \to 0 with nn fixed (either lots of data or very diffused prior beliefs) then approximately θyN(yˉ,(nτ)1)\theta | y \sim N(\bar{y}, (n\tau)^{-1}): the sampling distribution of the MLE.

If we write ϕ0=κ0τ\phi_0 = \kappa_0 \tau, then

θyN(nn+κ0yˉ+κ0n+κ0μ0,((n+κ0)τ)1)\theta | y \sim N(\frac{n}{n+\kappa_0} \bar{y} + \frac{\kappa_0}{n+\kappa_0} \mu_0, ((n + \kappa_0) \tau)^{-1})

Hence κ0\kappa_0 may be viewed as a ‘prior sample size’

<p>If $$n \to \infty$$ with $$\phi_0$$ fixed, or $$\phi_0 \to 0$$ with $$n$$ fixed (either lots of data or very diffused prior beliefs) then approximately $$\theta | y \sim N(\bar{y}, (n\tau)^{-1})$$: the sampling distribution of the MLE.</p><p></p><p>If we write $$\phi_0 = \kappa_0 \tau$$, then</p><p>$$\theta | y \sim N(\frac{n}{n+\kappa_0} \bar{y} + \frac{\kappa_0}{n+\kappa_0} \mu_0, ((n + \kappa_0) \tau)^{-1})$$ </p><p>Hence $$\kappa_0$$ may be viewed as a ‘prior sample size’</p>
12
New cards

Let YτN(θ,τ1)Y| \tau \sim N(\theta, \tau^{-1}) , where τ\tau is unknown and θ\theta is known and let the prior be τGamma(α,β)\tau \sim Gamma(\alpha, \beta). How do we obtain the posterior of τ\tau?

<p></p><p></p>
13
New cards

How do we find the posterior distribution of a function of a parameter?

(Rewrite pθy(θy)p_{\theta | y}(\theta | y) in terms of ϕ\phi and then multiply by the Jacobin dθdϕ|\frac{d\theta}{d\phi}|

<p>(Rewrite $$p_{\theta | y}(\theta | y)$$ in terms of $$\phi$$ and then multiply by the Jacobin $$|\frac{d\theta}{d\phi}|$$</p><p></p>
14
New cards

How do we do hypothesis tests in Bayesian statistics?

Choose the hypothesis with the largest posterior probability

Alternatively, can set losses for type I and type II error and calculate expected losses

15
New cards

How can we take into account the support of θ\theta when choosing a prior p(θ)p(\theta)?

For vague priors

For θ(,)\theta \in (-\infty, \infty) , use θN(0,σ2)\theta \sim N(0, \sigma²) where σ2\sigma² is very large

For θ(0,)\theta \in (0, \infty), use θGamma(ϵ,ϵ)\theta \sim Gamma(\epsilon, \epsilon) where ϵ\epsilon is very small (however, peaked at 0, so highly informative when the likelihood is not negligible near 0)

For θ[0,1]\theta \in [0,1], use θBeta(ϵ,ϵ)\theta \sim Beta(\epsilon, \epsilon) where ϵ\epsilon is very small (however, peaked at 0 and 1, so highly informative when the likelihood is not negligible near 0 or 1)

<p><u>For vague priors</u></p><p>For $$\theta \in (-\infty, \infty)$$ , use $$\theta \sim N(0, \sigma²)$$ where $$\sigma²$$ is very large</p><p>For $$\theta \in (0, \infty)$$, use $$\theta \sim Gamma(\epsilon, \epsilon)$$ where $$\epsilon$$ is very small (however, peaked at 0, so highly informative when the likelihood is not negligible near 0)</p><p>For $$\theta \in [0,1]$$, use $$\theta \sim Beta(\epsilon, \epsilon)$$ where $$\epsilon$$ is very small (however, peaked at 0 and 1, so highly informative when the likelihood is not negligible near 0 or 1)</p>
16
New cards

What is a conjugate prior?

A prior is a conjugate prior for a likelihood function p(xθ)p(x | \theta) if the prior p(θ)p(\theta) is in the same family of distributions as the posterior p(θx)p(\theta | x)

<p>A prior is a conjugate prior for a likelihood function $$p(x | \theta)$$ if the prior $$p(\theta)$$ is in the same family of distributions as the posterior $$p(\theta | x)$$</p>
17
New cards

What is a natural conjugate prior?

knowt flashcard image
18
New cards

What is the relationship between exponential families of distributions and conjugate priors?

Can use this to find conjugate priors

<p>Can use this to find conjugate priors</p>
19
New cards

What is an improper prior?

Improper priors are used in Bayesian inference, but can only be used if the posterior will be proper for all possible observable data

<p>Improper priors are used in Bayesian inference, but can only be used if the posterior will be proper for all possible observable data</p>
20
New cards

What are non-informative priors?

Priors that will be dominated by the likelihood, such that the posterior depends on the data as much as possible. Do not depend on previously obtained information.

e.g. uniform priors (which may be improper), vague / diffused priors (priors with very high variance; p(θ)p(\theta) does not change much over the values of θ\theta for which the likelihood is non-negligible) or Jeffreys’ prior

21
New cards

How does Haldane’s prior differ from the uniform prior?

<p></p>
22
New cards

What is the “vague” proper prior for values between 0 and 1?

knowt flashcard image
23
New cards

Are uniform priors invariant to transformations?

No.

A prior for a parameter θ\theta, p(θ)p(\theta), implies that the prior distribution of ϕ=g(θ)\phi=g(\theta) is

pΦ(ϕ)=pΘ(θ)dθdϕp_{\Phi}(\phi) = p_\Theta(\theta) |\frac{d \theta}{d \phi}|

For a uniform prior, p(θ)1p(\theta) \propto 1, so pΦ(ϕ)dθdϕp_{\Phi}(\phi) \propto |\frac{d \theta}{d \phi}|. This is constant only if dθdϕ|\frac{d \theta}{d \phi}| is constant, i.e. g() is a linear transformation

Therefore we can’t have uniform priors for both θ\theta and a non-linear transformation g(θ)g(\theta)

24
New cards

What is Jeffreys’ prior?

knowt flashcard image
25
New cards

If we know the distribution of θ\theta, what is the distribution of ϕ=g(θ)\phi = g(\theta)?

fΦ(ϕ)=fΘ(θ)dθdϕ=fΘ(g1(ϕ))dθdϕf_\Phi(\phi) = f_\Theta(\theta) |\frac{d \theta}{d \phi}| = f_\Theta(g^{-1}(\phi)) |\frac{d \theta}{d \phi}|

26
New cards

When is the posterior mode approximately equal to the MLE?

knowt flashcard image
27
New cards

What is the difference between credible intervals and confidence intervals?

knowt flashcard image
28
New cards

What is the Likelihood principle?

The LP implies that it matters only what was observed, and not what might have been observed. However, the frequentist approach depends not only on what was observed, but also on the design of the study (e.g. how the experiment was stopped, binomial and negative binomial likelihoods differ)

<p>The LP implies that it matters only what was observed, and not what might have been observed. However, the frequentist approach depends not only on what was observed, but also on the design of the study (e.g. how the experiment was stopped, binomial and negative binomial likelihoods differ)</p>
29
New cards

What is marginal independence?

knowt flashcard image
30
New cards

What is conditional independence?

knowt flashcard image
31
New cards

What is a DAG?

A Directed Acyclic Graph is a directed graph (all nodes are random variables, all edges are arrows) that contains no directed cycles

A directed edge / arrow from one node to another indicates that the first variable causes / influences the second. Dashed arrows denote deterministic dependencies, solid arrows denote stochastic dependencies.

DAGs are useful for visualising and investigating conditional dependence, e.g. causal relationships between random variable, where X causes / influences Y, but Y does not cause / influence X

<p>A Directed Acyclic Graph is a directed graph (all nodes are random variables, all edges are arrows) that contains no directed cycles</p><p>A directed edge / arrow from one node to another indicates that the first variable causes / influences the second. Dashed arrows denote deterministic dependencies, solid arrows denote stochastic dependencies. </p><p>DAGs are useful for visualising and investigating conditional dependence, e.g. causal relationships between random variable, where X causes / influences Y, but Y does not cause / influence X</p>
32
New cards

What are parents, children, ancestors, descendants and founders in DAGs?

knowt flashcard image
33
New cards

What are the dependence properties of DAGs?

knowt flashcard image
34
New cards

What is moralising a DAG?

knowt flashcard image
35
New cards

How can we use DAGs to determine whether variables are conditionally independent?

c.i graph = conditional independence graph

only draw relevant variables and their ancestors in the partial DAG

<p>c.i graph = conditional independence graph</p><p></p><p>only draw relevant variables and their ancestors in the partial DAG</p>
36
New cards

What is the factorisation theorem?

Proceeds from the fact that a variable is conditionally independent of its ancestors given its parents

<p>Proceeds from the fact that a variable is conditionally independent of its ancestors given its parents</p>
37
New cards

What is a Markov Blanket?

knowt flashcard image
38
New cards

What is the full-conditional distribution of XkX_k?

P(XkX\Xk)P(X)=i=1KP(Xiparents[Xi])P\left(X_{k}|X_{\backslash X_{k}}\right)\propto P(X)=\prod_{i=1}^{K}P(X_{i}|\text{parents}[X_{i}]) (factorisation theorem)

P(Xiparents[Xi]P(X_{i}|\text{parents}[X_{i}] is constant with regards to XkX_k if XkX_k is neither XiX_i or a parent of XiX_i

So P(XkX\Xk)P(Xkparents[Xk])wchildren[Xk]P(wparents[w])P(X_k | X_{\backslash X_k}) \propto P(X_k | \text{parents}[X_k]) \prod_{w\in children[X_k]} P(w|\text{parents}[w])

<p>$$P\left(X_{k}|X_{\backslash X_{k}}\right)\propto P(X)=\prod_{i=1}^{K}P(X_{i}|\text{parents}[X_{i}])$$ (factorisation theorem)</p><p>$$P(X_{i}|\text{parents}[X_{i}]$$ is constant with regards to $$X_k$$ if $$X_k$$ is neither $$X_i$$ or a parent of $$X_i$$</p><p>So $$P(X_k | X_{\backslash X_k}) \propto P(X_k | \text{parents}[X_k]) \prod_{w\in children[X_k]} P(w|\text{parents}[w])$$ </p><p></p>
39
New cards

What is the motivation for MCMC?

knowt flashcard image
40
New cards

What is a Markov chain?

knowt flashcard image
41
New cards

What is MCMC?

For any function of the parameters, simply calculate f(i)=f(θ(i)f^{(i)} = f(\theta_{(i)} to obtain a sample from its posterior distribution. Can then calculate posterior mean / median / mode, etc.

<p>For any function of the parameters, simply calculate $$f^{(i)} = f(\theta_{(i)}$$ to obtain a sample from its posterior distribution. Can then calculate posterior mean / median / mode, etc. </p>
42
New cards

What is the Gibbs sampler algorithm?

The full conditional distribution of a parameter may be a known distribution, which can be sampled from simply

If the full conditional distribution is not proportional to a kernel of a known distribution, the full conditional distribution would need to be sampled using a method such as the Metropolis-Hastings algorithm (Metropolis within Gibbs) or rejection sampling

<p>The full conditional distribution of a parameter may be a known distribution, which can be sampled from simply</p><p>If the full conditional distribution is not proportional to a kernel of a known distribution, the full conditional distribution would need to be sampled using a method such as the Metropolis-Hastings algorithm (Metropolis within Gibbs) or rejection sampling</p>
43
New cards

What are the two ways of obtaining the full-conditional distribution for a variable?

P(CV\C)P(C | V\backslash C) \propto terms in joint distribution containing C

Factorisation theorem - Joint distribution P(V)=vVP(vparents[v])P(V) = \prod_{v \in V} P(v |\text{parents}[v])

P(C\V)P(Cparents[C])wchildren[C]P(wparents[w])P(C | \backslash V) \propto P(C | \text{parents}[C]) \prod _{w \in children[C]} P(w| \text{parents}[w])

44
New cards

How can we use DAGs to represent hierarchical models?

knowt flashcard image
45
New cards

What are the issues that arise due to dependence between MCMC samples?

knowt flashcard image
46
New cards

What is burn-in?

The burn-in are the first M samples of the chain, which are discarded because it is believed that the chain has not yet converged and that these values are still dependent on the initial values.

Strictly speaking, convergence is only achieved for M=M = \infty

In practice, we can only detect lack of convergence

If no evidence of lack of convergence is found, we are more confident that the chain has converged

We can check this:

  • using traceplots: once convergence has been reached, samples should look like a random scatter about a stable value

  • using convergence diagnostics: if the Gelman-Rubin diagnostic \hat{R} < 1.05, this indicates practical convergence

Use these to set M, the length of the burn-in

<p>The burn-in are the first M samples of the chain, which are discarded because it is believed that the chain has not yet converged and that these values are still dependent on the initial values.</p><p></p><p>Strictly speaking, convergence is only achieved for $$M = \infty$$</p><p></p><p>In practice, we can only detect lack of convergence</p><p>If no evidence of lack of convergence is found, we are more confident that the chain has converged</p><p></p><p>We can check this:</p><ul><li><p>using traceplots: once convergence has been reached, samples should look like a random scatter about a stable value</p></li><li><p>using convergence diagnostics: if the Gelman-Rubin diagnostic $$\hat{R} &lt; 1.05$$, this indicates practical convergence</p></li></ul><p>Use these to set M, the length of the burn-in</p><p></p>
47
New cards

What is the Gelman-Rubin statistic?

Compares within chain and between chain variation

<p>Compares within chain and between chain variation</p>
48
New cards

How can we determine how long to run MCMC chains for?

knowt flashcard image
49
New cards

What are batch mean SEs?

Divide by Q(Q-1) because we are estimating the variance of b^\hat{b}, not an individual bb

Can also account for auto correlation via time series SEs

<p>Divide by Q(Q-1) because we are estimating the variance of $$\hat{b}$$, not an individual $$b$$</p><p></p><p>Can also account for auto correlation via time series SEs</p>
50
New cards

What are Bayesian hierarchical models?

Also called random effect or multi-level models

We assume that the parameters θ\theta of groups are a random sample from a common population distribution, and then estimate the (hyper-)parameters of that population distribution

We assign a (hierarchical) prior distribution to the hyper-parameters.

We have:

  • Likelihood p(yθ)p(y|\theta) (1st level)

  • Prior p(θϕ2)p(\theta | \phi_2) with higher level parameter ϕ2\phi_2 (2nd level)

  • Prior p(ϕ2)p(\phi_2) (3rd level)

We can add further levels, with ϕk\phi_k being k-th level hyper-parameters. A non informative prior is usually specified for the top level parameters.

These models imply that θi\theta_i is different for every group, but similar - θi\theta_{i} are not marginally independent, but are exchangeable

By assuming that the parameters are drawn from a common population distribution, the more extreme parameters are shrunk towards the overall mean. The posterior distribution for each θi\theta_i borrows strength from the likelihood contributions of all groups, via their influence on the unknown population parameters, and reflects our full uncertainty about the true values of the population parameters. These models are also useful if we are interested in the population parameters themselves.

Better than assuming a common θ\theta between all groups, or that the parameters θi\theta_{i} for each group are independent (we want to use information about θ\i\theta_{\backslash i} to estimate θi\theta_i)

Can obtain full conditional distribution for each parameter and then use Gibbs sampler

<p>Also called random effect or multi-level models</p><p></p><p>We assume that the parameters $$\theta$$ of groups are a random sample from a common population distribution, and then estimate the (hyper-)parameters of that population distribution</p><p></p><p>We assign a (hierarchical) prior distribution to the hyper-parameters.</p><p>We have:</p><ul><li><p>Likelihood $$p(y|\theta)$$ (1st level)</p></li><li><p>Prior $$p(\theta | \phi_2)$$ with higher level parameter $$\phi_2$$ (2nd level)</p></li><li><p>Prior $$p(\phi_2)$$ (3rd level)</p></li></ul><p>We can add further levels, with $$\phi_k$$ being k-th level hyper-parameters. A non informative prior is usually specified for the top level parameters. </p><p></p><p>These models imply that $$\theta_i$$ is different for every group, but similar - $$\theta_{i}$$ are not marginally independent, but are exchangeable</p><p></p><p>By assuming that the parameters are drawn from a common population distribution, the more extreme parameters are shrunk towards the overall mean. The posterior distribution for each $$\theta_i$$ borrows strength from the likelihood contributions of all groups, via their influence on the unknown population parameters, and reflects our full uncertainty about the true values of the population parameters. These models are also useful if we are interested in the population parameters themselves. </p><p></p><p>Better than assuming a common $$\theta$$ between all groups, or that the parameters $$\theta_{i}$$ for each group are independent (we want to use information about $$\theta_{\backslash i}$$ to estimate $$\theta_i$$)</p><p></p><p>Can obtain full conditional distribution for each parameter and then use Gibbs sampler</p>
51
New cards

What is exchangeability and the general representation theorem?

A sequence of random variables θ1,,θn\theta_1, …, \theta_n is exchangeable if, for any permutation {i1,,in}\{ i_1, …, i_n\} of {1,,n}\{1, …, n\}, (θi1,,θin)(\theta_{i_1}, …, \theta_{i_n}) has the same n-dimensional joint probability distribution as (θ1,,θn)(\theta_1, …, \theta_n)

i.e. for all a1,,ana_1, …, a_n,

p(θ1=a1,,θn=an)=p(θi1=a1,,θin=an)p(\theta_1=a_1,\ldots,\theta_{n}=a_{n})=p(\theta_{i_1}=a_1,\ldots,\theta_{i_{n}}=a_{n})

If θ1,,θn\theta_1, …, \theta_n are marginally independent and have the same marginal distribution (they are i.i.d), then they are exchangeable

The General representation theorem shows that if θ1,,θn\theta_1, …, \theta_n are exchangeable, there there exists a parametric model p(θϕ)p(\theta | \phi) with prior p(ϕ)p(\phi) for ϕ\phi such that θi ⁣ ⁣ ⁣ ⁣θjϕ\theta_i \perp\!\!\!\!\perp \theta_j | \phi and therefore θiϕiidp(θϕ)\theta_i | \phi \stackrel{\text{iid}}{\sim} p(\theta | \phi)

Exchangeability therefore implies a hierarchical model: this is equivalent to θ1,,θn\theta_1, …, \theta_n being a random sample from a model p(θϕ)p(\theta|\phi) with prior p(ϕ)p(\phi)

52
New cards

When is it reasonable to assume exchangeability?

The parameters of groups, θ1,,θn\theta_1, …, \theta_n, are exchangeable if, for any permutation {i1,,in}\{ i_1, …, i_n\} of {1,,n}\{1, …, n\}, (θi1,,θin)(\theta_{i_1}, …, \theta_{i_n}) has the same n-dimensional joint probability distribution as (θ1,,θn)(\theta_1, …, \theta_n): e.g. there is no reason to believe that θi=a\theta_i=a and θj=b\theta_j = b is more likely than θi=b\theta_i = b and θj=a\theta_j = a

However, if we know additional information about the groups, exchangeability may not be a reasonable assumption; we may have reason to expect one group’s parameter to be greater than another’s.

Whether it is reasonable to assume exchangeability depends on our degree of knowledge / ignorance.

We can address this by including covariates in our hierarchical models, e.g. a generalised linear mixed model

53
New cards

What is the Metropolis-Hastings sampler?

The Metropolis-Hastings sampler produces a chain of values θ(1),θ(2),\theta^{(1)}, \theta^{(2)}, … from a distribution with density g(θ)g(\theta) (e.g. full conditional or posterior)

Choose θ(0)\theta^{(0)} then, at the ii-th iteration

  • Propose θ\theta^’ from a proposal distribution q(θ(i1),θ)q(\theta^{(i-1)}, \theta^’)

  • (Via a simulated uniform random variable) Accept this proposal with probability α(θ(i1),θ)=min{1,g(θ)q(θ,θ(i1))g(θ(i1)q(θ(i1),θ)}\alpha\left(\theta^{(i-1)},\theta^{’}\right)=\min\left\lbrace1,\frac{g(\theta’)q(\theta^{‘},\theta^{(i-1)})}{g(\theta^{(i-1)}q(\theta^{\left(i-1\right)},\theta^{‘})}\right\rbrace

There are two main types of proposal distribution:

The random walk sampler, where θ=θ(i1)+ϵi\theta^’ = \theta^{(i-1)}+\epsilon_i where the distribution of ϵi\epsilon_i is zero-mean and symmetric (e.g. normal, t-distribution)

Then q(θ(i1),θ)=q(θ,θ(i1))q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)}): always accept uphill moves, sometimes accept downhill moves

The variance of ϵi\epsilon_i is a tuning parameter. Too low step-size will lead to suggestions of small deviations and a high acceptance rate, causing random walk behaviour and poor mixing (explores sample space slowly). Too large step-size will lead to suggestions of large deviations and a low acceptance rate, causing a ‘sticky’ chain with high autocorrelation and poor mixing. High autocorrelation in the chain will result in a high variance of estimators.

The independence sampler, where the proposal distribution does not depend on the previous state θ(i1)\theta^{(i-1)}, i.e. q(θ(i1),θ)=q(θ)q(\theta^{(i-1)}, \theta^’) = q(\theta^’)

This method works well when q(θ)q(\theta) is a good approximation of g(θ)g(\theta). Can move from 1 side of the distribution to another, better mixing than random walk. However, if it is a poor approximation, we will reject most proposals, leading to a sticky chain

54
New cards

How do we tune Metropolis Hastings random walk samplers?

The random walk sampler is θ=θ(i1)+ϵi\theta^’ = \theta^{(i-1)}+\epsilon_i where the distribution of ϵi\epsilon_i is zero-mean and symmetric (e.g. normal, t-distribution)

Then q(θ(i1),θ)=q(θ,θ(i1))q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)}): always accept uphill moves, sometimes accept downhill moves

The variance of ϵi\epsilon_i is a tuning parameter. Too low step-size will lead to suggestions of small deviations and a high acceptance rate, causing random walk behaviour and poor mixing (explores sample space slowly). Too large step-size will lead to suggestions of large deviations and a low acceptance rate, causing a ‘sticky’ chain with high autocorrelation and poor mixing.

High autocorrelation in the chain will result in a high variance of estimators.

It can be shown that, in general, samplers will the lowest variability have an average acceptance rate which is between 0.2 and 0.3

<p>The random walk sampler is $$\theta^’ = \theta^{(i-1)}+\epsilon_i$$ where the distribution of $$\epsilon_i$$ is zero-mean and symmetric (e.g. normal, t-distribution)</p><p>Then $$q(\theta^{(i-1)}, \theta^’) = q(\theta^{‘}, \theta^{(i-1)})$$: always accept uphill moves, sometimes accept downhill moves</p><p></p><p>The variance of $$\epsilon_i$$ is a tuning parameter. Too low step-size will lead to suggestions of small deviations and a high acceptance rate, causing random walk behaviour and poor mixing (explores sample space slowly). Too large step-size will lead to suggestions of large deviations and a low acceptance rate, causing a ‘sticky’ chain with high autocorrelation and poor mixing. </p><p></p><p>High autocorrelation in the chain will result in a high variance of estimators.</p><p>It can be shown that, in general, samplers will the lowest variability have an average acceptance rate which is between 0.2 and 0.3</p>
55
New cards

What problems does correlation cause in the Gibbs sampler?

The Gibbs sampler works by sampling from each full conditional distribution. This can lead to slow mixing if the posterior distribution has high correlation between some variables.

e.g. let Var(X1)=Var(X2)=1Var(X_1)=Var(X_2)=1 and Cov(X1,X2)=ρCov(X_1, X_2) = \rho

The full conditional distribution of X1X_1 is N(ρX2,1ρ2)N(\rho X_2, 1- \rho²) and vice versa

If ρ\rho is large, the conditional variance of X1X2X_1 | X_2, (1ρ)2)(1-\rho)²) is small relative to the marginal variance 1, which leads to slower convergence

A second problem is that higher correlation between variables will lead to higher dependence between samples. This autocorrelation will cause the variance of estimators to increase.

There are two main remedies:

  • Thinning - only take every kth value of the chain, leading to reduced autocorrelation between samples

  • Reparameterisation - transform from X to some variables with lower correlation

e.g. in the model yiPoisson(μi)y_i \sim Poisson(\mu_i)

and μi=α+βXi\mu_i = \alpha + \beta X_i

α\alpha and β\beta are highly correlated. If we centre X and instead specify μi=α+β(XiXˉ)\mu_i = \alpha + \beta (X_i - \bar{X}) , then this removes the correlation between α\alpha and β\beta

56
New cards

For hierarchical models, what is the predictive distribution of a new value in a group, a parameter for a new group, and a new value in a new group?

knowt flashcard image
57
New cards

What is the Bayes factor?

Disadvantages: very difficult to work out Bayes factor analytically for more complex models

Lindley’s paradox: if comparing a model M1 where a parameter is set with a model M2 where a parameter is given a prior, we can choose a weight on that prior to make the Bayes factor in favour of M2 arbitrarily small. Bayes factors are sensitive to the choice of the weight on the prior: should choose a sensible prior variance of any parameters.

<p>Disadvantages: very difficult to work out Bayes factor analytically for more complex models</p><p>Lindley’s paradox: if comparing a model M1 where a parameter is set with a model M2 where a parameter is given a prior, we can choose a weight on that prior to make the Bayes factor in favour of M2 arbitrarily small.  Bayes factors are sensitive to the choice of the weight on the prior: should choose a sensible prior variance of any parameters. </p>
58
New cards

Why can’t we simply use number of parameters in Bayesian hierarchical models?

knowt flashcard image
59
New cards

What is the Deviance Information Criterion (DIC)?

knowt flashcard image
60
New cards

What is the Widely applicable information criterion (WAIC) ?

More stable than the Deviance Information Criterion

<p>More stable than the Deviance Information Criterion</p>
61
New cards

How can we carry out cross-validation of Bayesian models?

knowt flashcard image
62
New cards

What are the pros and cons of using Bayes factor, DIC / WAIC and cross-validation to evaluate Bayesian models?

knowt flashcard image
63
New cards

What is the label switching problem in mixture models?

knowt flashcard image