model soup

1. procedure

Take a pretrained model. Fine-tune many different checkpoints. Then, you can obtain a better performance than any of the checkpoints by averaging the weights of all the checkpoints together.
Crucially, you need to start from the same checkpoint model. Otherwise, the neurons of each model could be permuted with respect to each other.
Why does this work? You can think of the fine-tuned models as all sitting around a regime of the loss landscape which is locally convex. Then if they sit on the "rim" of the convex area, any convex combination will result in a lower loss.