UP | HOME

model soup

1. procedure

  • Take a pretrained model. Fine-tune many different checkpoints. Then, you can obtain a better performance than any of the checkpoints by averaging the weights of all the checkpoints together.
  • Crucially, you need to start from the same checkpoint model. Otherwise, the neurons of each model could be permuted with respect to each other.
  • Why does this work? You can think of the fine-tuned models as all sitting around a regime of the loss landscape which is locally convex. Then if they sit on the "rim" of the convex area, any convex combination will result in a lower loss.

Created: 2025-11-02 Sun 18:54