Dehghani et al 2021 – The Benchmark Lottery
A few ways that ML benchmarks may be misleading the community.
1. task selection bias
Algorithms/methods are sensitive to the specific tasks included in aggregate benchmarks. A particular approach may top the rankings largely by it's coincidental alignment with a particular benchmark.
2. community bias
Reviewers will ask to see results on specific benchmarks until only a few benchmarks are being optimized for. The community as a whole will begin to overfit to these particular benchmarks
3. benchmarks are stateful
Let's say that one model tops a leaderboard. That is, it gets very good performance on the held-out test set. Then, subsequent approaches may begin with this model's parameters or checkpoints. In this way, future approaches begin to overfit to the test-set. After a while, it is hard to say whether doing better on a benchmark actually corresponds to making algorithmic progress against the original problem.
4. rigging the lottery
In fields where there are not standardized and accepted benchmarks such as RL or recommender systems, the authors may fit their benchmark to their approach, rather than the other way around.
5. what can we do
5.1. statistical significance testing
Given approaches \(A\) and \(B\) that produce models \(m_a\) and \(m_b\), we are not interested in whether a particular \(m'_a < m'_b\), but instead rather the distribution \(p(m_a)\) is more likely to produce a better model than \(p(m_b)\).
6. other trends
Aggregate benchmarks that test how well an approach does on many disparate tasks has a bias against approaches that do a few things poorly, but may have very good specific applications.