Peters et al 2018 - Deep contextualized word representations
Notes for peters18_deep_contex_word_repres.
1. ELMo
Recall that for a stacked bi-LSTM, each layer produces a sequence of forward and backward hidden states. So, for token \(t_k\), a model with \(L\) layers produces \(2L + 1\) embeddings: \[R_k = \left\{\mathbf{x}_{k}, \overset{\leftarrow}{\mathbf{h}}_{j,k}, \overset{\rightarrow}{\mathbf{h}}_{j,k} \mid j=1,\ldots L \right \}\] where \(\mathbf{x}\) is the initial token embedding.
Then for a given language task and token \(t_k\), ELMo receives \(R_k\) as input and learns to compute a weighted sum over the embedding layers: \[ ELMo^{task}_{k} = E(R_k, \Theta^{task}) = \gamma^{task}\sum_{j=0}^{L} s^{task}_j \mathbf{h}_{j,k} \] where \(s^{task}_j\) is the result of a softmax, parameterized by \(\Theta\) and \(\mathbf{h}_{j,k}\) refers to both the forward and backward direction.
This weighted sum can serve as a token embedding for downstream tasks.
2. Training using the embedding
When learning under supervision, the weights of the original stacked Bi-LSTM are frozen and only the weights of the classifier are learned. Note that in effect, the hidden representations of the Bi-LSTM serve as context-aware embeddings for a sequence.