# variational auto encoders

## 1. motivation

### 1.1. machine learning perspective

- We want our latent space to be continuous and complete
- continuous: similar representations correspond to similar entities
- complete: every point in the latent space can be decoded to a reasonable output

- our loss is composed of a reconstruction term and a regularization term
- the reconstruction term encourages the output to be similar to the input
- the regularization term encourages the latent representations to be continuous and complete
- namely the distribution of latent variables should be close to 0 mean and unit variance

### 1.2. graphical models perspective

TODO

## 2. Loss

- Jointly optimize \(\theta\) to make the re-construction loss high…
- As well as tune \(\phi\) to make our approximation \(q_{\phi}(z\mid x)\) of the posterior \(p(z\mid x)\) as close as possible – as measured by kl divergence
- see wikipedia for the full derivation

- But we eventually end up with this equation:
- \(-\log(p_{\theta}(x)) + D_{KL}(q_{\phi}(z\mid x) || p(z\mid x)) = -E_{z\sim q_{\phi}(z\mid x)}(\log p_{\theta}(x \mid z)) + D_{kl}(q_{\phi}(z\mid x) || p_{\theta}(z))\)
- The LHS is the quantity that we will use as our loss
- The RHS is the way we will end up computing/back-propagating this loss
- On the LHS we see two terms
- One is the negative log likelihood of the data – the reconstruction loss
- The other is the distance between our approximation to the posterior and the true posterior – as parameterized by \(\theta\) at least

- On the RHS we see two terms
- One is an expectation that can be sampled from easily at runtime (see gumbel max trick)
- Where does the other come from?

- Question: I can see that we will eventually pull \(q\) under \(p(z\mid x)\), but what about this loss is encouraging \(p_{\theta}(z)\) and \(p_{\theta}(z\mid x)\) to be anything sensible? I guess I just have to really trust the starting equation, and believe that \(p_{\theta}(x)\) is being optimized

## 3. useful links

## 4. sources

- Neural Discrete Representation Learning – Oord et al 2017
- wikipedia