variational auto encoders
1. motivation
1.1. machine learning perspective
- We want our latent space to be continuous and complete
- continuous: similar representations correspond to similar entities
- complete: every point in the latent space can be decoded to a reasonable output
- our loss is composed of a reconstruction term and a regularization term
- the reconstruction term encourages the output to be similar to the input
- the regularization term encourages the latent representations to be continuous and complete
- namely the distribution of latent variables should be close to 0 mean and unit variance
1.2. graphical models perspective
TODO
2. Loss
- Jointly optimize \(\theta\) to make the re-construction loss high…
- As well as tune \(\phi\) to make our approximation \(q_{\phi}(z\mid x)\) of the posterior \(p(z\mid x)\) as close as possible – as measured by kl divergence
- see wikipedia for the full derivation
- But we eventually end up with this equation:
- \(-\log(p_{\theta}(x)) + D_{KL}(q_{\phi}(z\mid x) || p(z\mid x)) = -E_{z\sim q_{\phi}(z\mid x)}(\log p_{\theta}(x \mid z)) + D_{kl}(q_{\phi}(z\mid x) || p_{\theta}(z))\)
- The LHS is the quantity that we will use as our loss
- The RHS is the way we will end up computing/back-propagating this loss
- On the LHS we see two terms
- One is the negative log likelihood of the data – the reconstruction loss
- The other is the distance between our approximation to the posterior and the true posterior – as parameterized by \(\theta\) at least
- On the RHS we see two terms
- One is an expectation that can be sampled from easily at runtime (see gumbel max trick)
- Where does the other come from?
- Question: I can see that we will eventually pull \(q\) under \(p(z\mid x)\), but what about this loss is encouraging \(p_{\theta}(z)\) and \(p_{\theta}(z\mid x)\) to be anything sensible? I guess I just have to really trust the starting equation, and believe that \(p_{\theta}(x)\) is being optimized
3. useful links
4. sources
- Neural Discrete Representation Learning – Oord et al 2017
- wikipedia