model soup
1. procedure
- Take a pretrained model. Fine-tune many different checkpoints. Then, you can obtain a better performance than any of the checkpoints by averaging the weights of all the checkpoints together.
- Crucially, you need to start from the same checkpoint model. Otherwise, the neurons of each model could be permuted with respect to each other.
- Why does this work? You can think of the fine-tuned models as all sitting around a regime of the loss landscape which is locally convex. Then if they sit on the "rim" of the convex area, any convex combination will result in a lower loss.