Oord et al 2019 – Representation Learning with Contrastive Predictive Coding
1. contrastive predictive coding
The goal is to produce \(f_k(x_{t+k}, c_{t}) \propto \frac{p(x_{t+k} \mid c_{t})}{p(x_{t+k})}\)
2. my summary on first pass
For original context \(c'_t\) and original future \(x_{t+k}\), we want to find representations \(c_t\) and \(z_{t+k}\) such that the mutual information between \(x_{t+k}\) and \(c'_t\) is the same as the MI between \(c_t\) and \(z_{t+k}\)
3. my summary on second pass
We show that we can find \(z_{t+k}\) and \(c_{t}\) from which the number \(\frac{p(x_{t+k} \mid c_{t})}{p(x_{t+k})}\) can be recovered. Note two things: first, at best we have only shown that the embeddings are useful for recovering information about the mutual information between the two distributions \(c'_t\) and \(x_{t+k}\), not that they are actually in the same mutual-information relationship. Second, even if this is true, we've only shown that the embeddings are useful for recovering information about the MI relationship between \(c_t\) and \(x_{t+k}\). So there seems to be an a priori assumption that the MI relationship between the learned \(c_t\) and the original \(x_{t+k}\) will be close to the MI relationship between the original signals.
4. my summary on third pass
A lot of the times, people train their generative models to match \(p(x \mid c)\) however, if \(x\) is high-dimensional, like an image, then their model will have to do a lot of work trying to learn what makes a good quality image. This means learning a lot about low level features.
Instead, what if we picked a target that only required the model to know something about the relationship between \(x\) and \(c\)? Let's say that our model needs to learn a "compatibility score" that quantifies how well the \(x\) fits with the context. For example, let's say that our context is a picture of a child throwing \(x\). Then, we would assign a high compatibility score to anything that looks light enough to be throwable. Note that this is a much different task than requiring the model to learn how to generate an image of a throwable object.
If this "compatibility score" is accurate, it is actually enough to recover the mutual information between the \(X\) and \(C\) random variables. For \(C=c\) where \(c\) has a lot of influence over \(c\), e.g., \(c\) shows a child throwing something, the compatibility score will peak where \(X=x\) is "throwable". On the other hand, when \(C=c\) is not informative, e.g., it's an image of a man looking at \(x\), then the compatibility score will be a plateau over \(X\). The expected value of the "compatibility score" is the mutual information.
Note that the product of our learning will be the representations that our model learns, not any sort of prediction that our model makes. These representations will be useful for determining "compatibility" as detailed above, so we hope that they will also be useful for informing us of relationships in the input.