KV cache
1. Why don't I have to recompute the keys and values after each new generated token?
- This is something that confused me when I first learned about KV cache
- At inference time, I'm generating token by token (see Transformer).
- Everytime I generate a token, I keep its key and value around.
- Then, when I generate the next token, I feed the whole sequence into the transformer again. But the keys and values for the first \(n-1\) tokens have already been generated, so I can use those.
- But shouldn't the keys and values be changed? Because now they should be produced via attention that ranges over one more additional token? Actually no! The only inputs that are needed to form the value (key) for a token are all the tokens that come before. So any token that comes subsequent will not require the old values to be updated.
- Technically, I suppose it would be possible to actually update all the values everytime a new token is generated. This would be more expensive. But maybe lead to a more expressive model? I think people don't do this because it doesn't match the causal masking that happens at train time.
- See this answer