Attention
1. Attention
Bahdanau attention described in bahdanau14_neural_machin_trans_by_joint works as follows:
For a source of length \(M\), the encoder produces outputs \(\mathbf{h}_1,...,\mathbf{h}_M\). Now, for the \(i\) -th output of the decoder, we would like to produce a context vector \(\mathbf{c}_i\) that summarizes the content of the input in a way that allows the decoder to attend to the most relevant pieces of the input.
Then, for each \(\mathbf{h}_j\) we compute a coefficient \(\alpha_{ij}\) that encodes how much the \(i\) -th output should attend to the \(j\) -th word representation.
At time \(i\), the state of the decoder is \(\mathbf{s}_i\). For every \(j\), \(\mathbf{h}_j\) and \(\mathbf{s}_i\) are concatenated and fed through a multi-layer perceptron with a tanh-activation to produce a single number: \(e_{ij}\). A softmax is used to normalize all the \(e_{ij}\). These normalized values are the \(\alpha_{ij}\).
Then, the context vector is \(c_i = \sum_j \alpha_{ij} \mathbf{h}_j\).
2. Matrices
All these operations can be done with matrices:
- \(Q\) the query matrix is \(n \times d\) where \(n\) is the length of the target sequence and \(d\) is the embedding dimension
- \(K\) the key matrix is \(m \times d\)
- \(V\) the value matrix is \(m \times d\) where \(m\) is the length of the input sequence
Then, the attention weights are:
\begin{equation} \alpha = \text{softmax}\left( \frac{QK^T}{\sqrt{d}} \right) \end{equation}Then, the context is given by
\begin{equation} \text{Attention}(Q,K,V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d}} \right)V \end{equation}We can check that the dimensions make sense. \(QK^T\) is a \(m\times n\) matrix, where \((QK^T)_{ij} \propto e_{ij}\). And, \(QK^{T}V\) is a \(n \times d\) matrix, and each row is the context vector for the \(i\) th token in the target sequence.
Helpful links: