UP | HOME

Transformer

1. Key points

2. Encoder

The encoder is composed of \(N\) stacked encoder-modules. That is, the \(i\) th module produces the input for the \((i+1)\) th module.

2.1. Postional Encoding

Each position \(t\) is encoded as a $d$-dimensional vector \(p\). To encode the \(i\) th dimension of the vector for the \(t\) position, the following formula is used: \[p_t(i) = \begin{cases} \sin(\omega_k \cdot t) & \text{if } i = 2k\\ \cos(\omega_k \cdot t) & \text{if } i = 2k + 1\\ \end{cases}\] where \[\omega_k = \frac{1}{1000^{2k/d}}\]

2.1.1. Intuition

Note that frequency decreases with dimension. Imagine counting in binary. The least significant bit changes very quickly, while the most significant bit changes very slowly. Think of this as the continuous version of that. The low dimension of \(p\) has a very small period, \(2\pi\), while the highest dimension of \(p\) has a very long period, \(1000\cdot2\pi\)

2.2. Encoder Module

Each encoder module uses multi-head attention. What is multi-head attention? In single head attention, we have projection matrices \(W^Q\), \(W^K\), and \(W^V\) for the query, key, and value embeddings respectively. In multi-head attention, we have \(W^Q_i\), \(W^K_i\), and \(W^V_i\) for each head \(i\). Let \(n\) be the source length, \(m\) be the target length, and \(d\) the embedding dimension. Each attention head outputs \(Z_i\), which has dimension \(m \times d\). These are concatenated and multiplied by a weight matrix \(W^o\).

The input to the first module are the embedded tokens of the input sequence. Added to these embeddings is a postional embedding that encodes the relative position of the token in the sequence.

The output \(Z\) passes through a feed forward network. There are residual connections and layer normalization operations around the self-attention layer as well as around the feed forward network.

3. Decoder

The decoder is composed of \(M\) stacked decoder-modules.

3.1. Decoder module

Each decoder module takes as input: the final \(K\) and \(V\) matrices of the top-most encoder module, and the output of the previous decoder module. Each decoder module attends to the encoder output with an Encoder-Decoder attention layer. Tokens are predicted one at a time. At time \(i\), each decoder module attends to every output produced up until \(i\) with a self-attention layer.

A linear and softmax layer sit at the top.

4. Training

The model is typically trained with full supervision and cross-entropy loss.

5. multi-head

  • Typically Transformers have multiple-attention heads. Each head will need to have separate projections for their keys, values, and queries.
  • These projections can be computed all at once for all attention heads by a stack of projection matrices.
  • Then, to ensure that attention is done per head, the tensors can be reshaped (see this nice blog post)

5.1. using pytorch transformer

6. Useful links:

Created: 2025-11-02 Sun 18:55