variational bayes

1. prereqs

marginal likelihood

2. main idea

Given data \(X\), and unobserved variables \(Z\), we want to approximate the true posterior \(P(Z \mid X)\) with \(Q(Z)\), where \(Q\) comes from a simpler family of functions and is selected so that it has the smallest "distance" to \(Z\). Most of the time, we use the kl divergence as our distance measure

The KL divergence between \(Q\) and \(P\) is: \[ D_{KL}(Q || P) = \mathbb{E}_Q\left[ \log \frac{Q(Z)}{P(Z\mid X)}\right] \] Note that this expectation is taken over \(Q\), so if \(Q\) comes away from \(P\), but \(Q\) is small, then it will not effect the expectation much. This has a couple consequences for where \(Q\) might not well approximate \(P\).

2.1. motivation

Usually \(Q\) is from a family of functions for which it is easier to find the mode, variance, etc.

2.2. evidence lower bound

If we use the KL divergence, then we can write:
\(\log P(X) = D_{KL}(Q || P) + \mathcal{L}(Q)\)
- where \(\mathcal{L}(Q) = \mathbb{E}_{Z\sim Q}[\log Q(Z) - \log P(Z,X)]\) is called the evidence lower bound
  - the name comes from the fact that \(L(Q)\) is a lower bound for \(\log P(X)\), since \(D_{KL} \geq 0\)
  - If we consider the model that generates \(X\) to be fixed, \(P(X)\) is the marginal likelihood of the evidence, marginalizing over the unseen variables \(Z\). In this way, we can use variational bayes for model selection. I.e., we should choose the model that highest likelihood of the model.
  - since \(\log P(X)\) is fixed with respect to the choice of \(Q\), minimizing the \(D_{KL}\) is equivalent to maximizing \(\mathcal{L}(Q)\). Importantly, computing \(\mathcal{L}(Q)\) does not require us to compute \(P(X)\) or \(P(Z\mid X)\)

3. related

Expectation Maximization – which tries to find the \(\arg\max\) setting of \(\theta\) of the likelihood \(P(X; \theta)\). In comparison, variational bayes tries to find an exact analytical form for an approximation to the posterior.
Variational inference stands in contrast to Metropolis-Hastings and other sampling methods in that we are looking for an exact analytical form of an approximation

4. questions

what is the difference between a parameter and a latent variable?