Attention

1. Attention

Bahdanau attention described in bahdanau14_neural_machin_trans_by_joint works as follows:

For a source of length \(M\), the encoder produces outputs \(\mathbf{h}_1,...,\mathbf{h}_M\). Now, for the \(i\) -th output of the decoder, we would like to produce a context vector \(\mathbf{c}_i\) that summarizes the content of the input in a way that allows the decoder to attend to the most relevant pieces of the input.

Then, for each \(\mathbf{h}_j\) we compute a coefficient \(\alpha_{ij}\) that encodes how much the \(i\) -th output should attend to the \(j\) -th word representation.

At time \(i\), the state of the decoder is \(\mathbf{s}_i\). For every \(j\), \(\mathbf{h}_j\) and \(\mathbf{s}_i\) are concatenated and fed through a multi-layer perceptron with a tanh-activation to produce a single number: \(e_{ij}\). A softmax is used to normalize all the \(e_{ij}\). These normalized values are the \(\alpha_{ij}\).

Then, the context vector is \(c_i = \sum_j \alpha_{ij} \mathbf{h}_j\).

2. Matrices

All these operations can be done with matrices:

\(Q\) the query matrix is \(n \times d\) where \(n\) is the length of the target sequence and \(d\) is the embedding dimension
\(K\) the key matrix is \(m \times d\)
\(V\) the value matrix is \(m \times d\) where \(m\) is the length of the input sequence

Then, the attention weights are:

\begin{equation} \alpha = \text{softmax}\left( \frac{QK^T}{\sqrt{d}} \right) \end{equation}

Then, the context is given by

\begin{equation} \text{Attention}(Q,K,V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d}} \right)V \end{equation}

We can check that the dimensions make sense. \(QK^T\) is a \(m\times n\) matrix, where \((QK^T)_{ij} \propto e_{ij}\). And, \(QK^{T}V\) is a \(n \times d\) matrix, and each row is the context vector for the \(i\) th token in the target sequence.

Helpful links:

bibliography/references.bib