variational bayes
1. prereqs
2. main idea
- Given data \(X\), and unobserved variables \(Z\), we want to approximate the true posterior \(P(Z \mid X)\) with \(Q(Z)\), where \(Q\) comes from a simpler family of functions and is selected so that it has the smallest "distance" to \(Z\). Most of the time, we use the kl divergence as our distance measure
The KL divergence between \(Q\) and \(P\) is: \[ D_{KL}(Q || P) = \mathbb{E}_Q\left[ \log \frac{Q(Z)}{P(Z\mid X)}\right] \] Note that this expectation is taken over \(Q\), so if \(Q\) comes away from \(P\), but \(Q\) is small, then it will not effect the expectation much. This has a couple consequences for where \(Q\) might not well approximate \(P\).
2.1. motivation
Usually \(Q\) is from a family of functions for which it is easier to find the mode, variance, etc.
2.2. evidence lower bound
- If we use the KL divergence, then we can write:
- \(\log P(X) = D_{KL}(Q || P) + \mathcal{L}(Q)\)
- where \(\mathcal{L}(Q) = \mathbb{E}_{Z\sim Q}[\log Q(Z) - \log P(Z,X)]\) is called the evidence lower bound
- the name comes from the fact that \(L(Q)\) is a lower bound for \(\log P(X)\), since \(D_{KL} \geq 0\)
- If we consider the model that generates \(X\) to be fixed, \(P(X)\) is the marginal likelihood of the evidence, marginalizing over the unseen variables \(Z\). In this way, we can use variational bayes for model selection. I.e., we should choose the model that highest likelihood of the model.
- since \(\log P(X)\) is fixed with respect to the choice of \(Q\), minimizing the \(D_{KL}\) is equivalent to maximizing \(\mathcal{L}(Q)\). Importantly, computing \(\mathcal{L}(Q)\) does not require us to compute \(P(X)\) or \(P(Z\mid X)\)
- where \(\mathcal{L}(Q) = \mathbb{E}_{Z\sim Q}[\log Q(Z) - \log P(Z,X)]\) is called the evidence lower bound
- see also evidence lower bound
3. notes
- as noted by jason eisner, \(Q\) only has to approximate \(P(z \mid x)\) for a particular \(x\), not all possible \(x\)
4. related
- Expectation Maximization – which tries to find the \(\arg\max\) setting of \(\theta\) of the likelihood \(P(X; \theta)\). In comparison, variational bayes tries to find an exact analytical form for an approximation to the posterior.
- Variational inference stands in contrast to Metropolis-Hastings and other sampling methods in that we are looking for an exact analytical form of an approximation
5. questions
- what is the difference between a parameter and a latent variable?
- Doesn't computing the ELBO take as much computation as computing the posterior? Because we are still doing an integral over \(Z\)?
- Why not directly try to approximate \(P(Z \mid X)\)?
- Because there is no way to sample from the posterior without computing the probability \(P(X)\), which requires marginalizing over \(Z\)
- Why not directly try to optimize \(Z\) for \(P(Z \mid X)\)?
- as noted in this reddit thread, this doesn't give a distribution over \(Z\). This gives an MAP estimate, i.e., a single value of \(Z\).
- Why is computing the evidence hard?