variational auto encoders

1. motivation

We want our latent space to be continuous and complete
- continuous: similar representations correspond to similar entities
- complete: every point in the latent space can be decoded to a reasonable output
our loss is composed of a reconstruction term and a regularization term
- the reconstruction term encourages the output to be similar to the input
- the regularization term encourages the latent representations to be continuous and complete
  - namely the distribution of latent variables should be close to 0 mean and unit variance

TODO

Jointly optimize \(\theta\) to make the re-construction loss high…
- As well as tune \(\phi\) to make our approximation \(q_{\phi}(z\mid x)\) of the posterior \(p(z\mid x)\) as close as possible – as measured by kl divergence
- see wikipedia for the full derivation
But we eventually end up with this equation:
\(-\log(p_{\theta}(x)) + D_{KL}(q_{\phi}(z\mid x) || p(z\mid x)) = -E_{z\sim q_{\phi}(z\mid x)}(\log p_{\theta}(x \mid z)) + D_{kl}(q_{\phi}(z\mid x) || p_{\theta}(z))\)
The LHS is the quantity that we will use as our loss
The RHS is the way we will end up computing/back-propagating this loss
On the LHS we see two terms
- One is the negative log likelihood of the data – the reconstruction loss
- The other is the distance between our approximation to the posterior and the true posterior – as parameterized by \(\theta\) at least
On the RHS we see two terms
- One is an expectation that can be sampled from easily at runtime (see gumbel max trick)
- Where does the other come from?
Question: I can see that we will eventually pull \(q\) under \(p(z\mid x)\), but what about this loss is encouraging \(p_{\theta}(z)\) and \(p_{\theta}(z\mid x)\) to be anything sensible? I guess I just have to really trust the starting equation, and believe that \(p_{\theta}(x)\) is being optimized