LSTM
At each time-step \(t\), the LSTM sees:
- \(x_{t}\) the input token
- \(C_{t-1}\) the cell state. Intuitively, this is some internal memory that the LSTM carries from timestep to timestep. Some timesteps will affect the cell a lot, and others won't. The ones that don't will not kill the gradient as much.
- \(h_{t-1}\) the hidden state. Intuitively, this is the output of the LSTM right before it's projected into the vocabulary dimension.
1. Updating the cell
First, we need to determine what part of the old cell state we want to keep. We compute a forget vector: \[f_t = \sigma(W_f \cdot [h_{t-1}, x_{t}] + b_f)\] where \(W_f\) is a 2-D weight matrix and \(b_f\) is a vector of bias weights. Then, \(f_t\) is a vector of weights between 0 and 1.
Then, we need to determine what updates we are adding to the new cell state. Let \(\tilde{C}_t\) be the new cell state and \(i_t\) be the coefficient that determines how much of the new state get's added: \[ \tilde{C}_t = \tanh(W_C[h_{t-1},x_t] + b_C) \] and \[ i_t = \sigma(W_i[h_{t-1},x_t] + b_i) \]
Then, we have our updated cell state! \[ C_t = f_t * C_t + i_t * \tilde{C}_t \]
2. Creating output
First, we need to decide which pieces of the cell state to output. So, we create an output vector: \[ o_t = \sigma(W_o[h_{t-1},x_{t}] + b_o) \]
Then, the output is: \[ h_t = o_t * \tanh(C_t) \]
Note that all the matrices \(W_o\), \(W_f\), and \(W_i\) only deal with the previous output and the current input.