Expectation Maximization

The EM algorithm is used for finding the maximum likelihood parameters \(\theta\) for a distribution \(p(x;\theta)\) where \(x\) is observed, but there are also unobserved latent variables \(z\), so that \(p(x;\theta) = \sum_z p(x,z;\theta)\).

Normally, if there were no latent variables, we could take the derivative with respect to \(\theta\), set it to 0, and solve for \(\theta\). But we can't do that here. In the case of Gaussian mixture models, the equations for \(\phi\) which selects the topic and \(\mu\) which gives the mean for a topic become intertwined. There's no way to solve it analytically.

Instead, if we knew what the latent variables \(z\) were, we could solve using our usual method very easily. So, the EM algorithm:

Makes guesses for \(z\) based on \(\theta^t\)
Optimizes \(l(\theta)\) to get \(\theta^{(t+1)}\) for those guesses

However, to reflect our uncertainty, we actually make soft guesses and optimize for the expected likelihood, where the expectation is over our uncertainty in guessing.

So, the real algorithm looks like:

Find the distribution over latent variables given \(\theta^t\), which is \(p(z \mid x; \theta^t)\). This is our soft guess for \(z\) at \(x\).
Optimize the expected (w.r.t the above distribution) likelihood

0.1. another way to think of the problem setup

In 6.437, the problem was presented this way: imagine you have some parameters \(\theta\) that generate data \(z\). If you had access to \(z\), you could easily find the MLE \(\theta\). But you don't. Instead you have, \(y=f(z)\), where \(f\) is a many-to-one function. Now computing the MLE is harder, because if you want the likelihood of \(y\), you need to sum over all \(z\) s.t. \(f(z) = y\). Then, we can also talk about partitioning the variables into observed \(X\) and unobserved \(Z\), where \(f(Z) = X\) without loss of generality.

1. Derivation

1.1. Input

Given \(m\) training examples \(\{x^{(1)},...,x^{(m)}\}\) and corresponding un-observed latent variables \(\{z^{(1)},...,z^{(m)}\}\), we want to find the setting of parameters \(\theta\) that maximizes the likelihood: \[\begin{align} l(\theta) &= \sum_{i=1}^m \log p(x^{(i)}; \theta)\\ &= \sum_{i=1}^m \log \sum_{z} p(x^{(i)}, z; \theta)\\ \end{align}\]

1.2. Algorithm

1.2.1. E step

For each \(i\), we select a distribution \(Q_i\) over \(z^{(i)}\). Then,

\[\begin{align} \sum_{i=1}^m \log \sum_{z} p(x^{(i)}, z; \theta) &= \sum_{i=1}^m \log \sum_{z^{(i)}} Q_i(z^{(i)}) \frac{p(x^{(i)}, z; \theta)}{Q_i(z^{(i)})} \\ &\geq \sum_{i=1}^{m} \sum_{z^{(i)}} Q_i(z^{(i)}) \log \frac{p(x^{(i)}, z; \theta)}{Q_i(z^{(i)})} \\ \end{align}\]

The last line comes from:

Jensen's Inequality
The fact that \(\log\) is concave
The fact that \(\sum_{z^{(i)}} Q_i(z^{i})[\cdots]\) is an expectation according to \(Q_i\)

The last line holds for any choice of \(Q_i\). But when does the last line hold with equality? Well, if \(\frac{p(x^{(i)}, z; \theta)}{Q_i(z^{(i)})}\) was a constant, then it would be equal to it's expectation, and the \(\log\) of it's expectation.

So, the last line holds with equality when: \[ \frac{p(x^{(i)}, z; \theta)}{Q_i(z^{(i)})} = c \]

That is, we should choose \(Q_i = \frac{p(x^{(i)}, z; \theta)}{\sum_{z}p(x^{(i)}, z}; \theta) = p(z \mid x^{(i)}; \theta)\).

So, for this choice of \(Q_i\), we have \(l(\theta)\) on the last line.

1.2.2. M step

Now, with this choice of \(Q_i\), we find new parameters \(\theta'\) that maximize the quantity: \[ \sum_{i=1}^{m} \sum_{z^{(i)}} Q_i(z^{(i)}) \log \frac{p(x^{(i)}, z; \theta)}{Q_i(z^{(i)})} \]

We update the parameters.

1.2.3. Convergence

We see that \(l(\theta^{t+1}) \geq l(\theta^{t})\) because,

\[\begin{align} l(\theta^{(t+1)}) &\geq \sum_{i=1}^{m} \sum_{z^{(i)}} Q^t_i(z^{(i)}) \log \frac{p(x^{(i)}, z; \theta^{(t+1)})}{Q^t_i(z^{(i)})}\\ &\geq \sum_{i=1}^{m} \sum_{z^{(i)}} Q^t_i(z^{(i)}) \log \frac{p(x^{(i)}, z; \theta^t)}{Q^t_i(z^{(i)})}\\ &= l(\theta^t) \end{align}\]

The last equality comes from our choice of \(Q^t_i\) which made Jensen's inequality hold with equality. The second line comes from the fact that in the \(M\) step, we specifically chose \(\theta^{t+1}\) to maximize the summation, given \(Q^{t}_i\). The first line will hold for any \(Q_i\) (see above work in the E step section), but will hold in particular for \(Q_i^t\)

2. see also

note that something conceptually similar happens in the k-means algorithm, where we iteratively select the center of the clusters and then assign points to clusters, and then adjust the cluster centers according to the assigned points.