Transformer

1. Key points

Introduced by vaswani17_atten_is_all_you_need.
Uses only an attention mechanism and no recurrent modules.

2. Encoder

The encoder is composed of $N$ stacked encoder-modules. That is, the $i$ th module produces the input for the $(i+1)$ th module.

2.1. Postional Encoding

Each position $t$ is encoded as a $d$-dimensional vector $p$. To encode the $i$ th dimension of the vector for the $t$ position, the following formula is used: \[p_t(i) = \begin{cases} \sin(\omega_k \cdot t) & \text{if } i = 2k\\ \cos(\omega_k \cdot t) & \text{if } i = 2k + 1\\ \end{cases}\] where \[\omega_k = \frac{1}{1000^{2k/d}}\]

2.1.1. Intuition

Note that frequency decreases with dimension. Imagine counting in binary. The least significant bit changes very quickly, while the most significant bit changes very slowly. Think of this as the continuous version of that. The low dimension of $p$ has a very small period, $2\pi$, while the highest dimension of $p$ has a very long period, $1000\cdot2\pi$

2.2. Encoder Module

Each encoder module uses multi-head attention. What is multi-head attention? In single head attention, we have projection matrices $W^Q$, $W^K$, and $W^V$ for the query, key, and value embeddings respectively. In multi-head attention, we have $W^Q_i$, $W^K_i$, and $W^V_i$ for each head $i$. Let $n$ be the source length, $m$ be the target length, and $d$ the embedding dimension. Each attention head outputs $Z_i$, which has dimension $m \times d$. These are concatenated and multiplied by a weight matrix $W^o$.

The input to the first module are the embedded tokens of the input sequence. Added to these embeddings is a postional embedding that encodes the relative position of the token in the sequence.

The output $Z$ passes through a feed forward network. There are residual connections and layer normalization operations around the self-attention layer as well as around the feed forward network.

3. Decoder

The decoder is composed of $M$ stacked decoder-modules.

3.1. Decoder module

Each decoder module takes as input: the final $K$ and $V$ matrices of the top-most encoder module, and the output of the previous decoder module. Each decoder module attends to the encoder output with an Encoder-Decoder attention layer. Tokens are predicted one at a time. At time $i$, each decoder module attends to every output produced up until $i$ with a self-attention layer.

A linear and softmax layer sit at the top.

4. Training

The model is typically trained with full supervision and cross-entropy loss.

4.1. using pytorch transformer

docs for encoder layer
very helpful thread: input needs to be [batch, seq, len]
- along the same lines
- helpful student written guide

5. Useful links:

bibliography/references.bib