Dehghani et al 2021 – The Benchmark Lottery

A few ways that ML benchmarks may be misleading the community.

1. task selection bias

Algorithms/methods are sensitive to the specific tasks included in aggregate benchmarks. A particular approach may top the rankings largely by it's coincidental alignment with a particular benchmark.

2. community bias

Reviewers will ask to see results on specific benchmarks until only a few benchmarks are being optimized for. The community as a whole will begin to overfit to these particular benchmarks

3. benchmarks are stateful

Let's say that one model tops a leaderboard. That is, it gets very good performance on the held-out test set. Then, subsequent approaches may begin with this model's parameters or checkpoints. In this way, future approaches begin to overfit to the test-set. After a while, it is hard to say whether doing better on a benchmark actually corresponds to making algorithmic progress against the original problem.

4. rigging the lottery

In fields where there are not standardized and accepted benchmarks such as RL or recommender systems, the authors may fit their benchmark to their approach, rather than the other way around.

5. what can we do

5.1. statistical significance testing

Given approaches \(A\) and \(B\) that produce models \(m_a\) and \(m_b\), we are not interested in whether a particular \(m'_a < m'_b\), but instead rather the distribution \(p(m_a)\) is more likely to produce a better model than \(p(m_b)\).

6. other trends

Aggregate benchmarks that test how well an approach does on many disparate tasks has a bias against approaches that do a few things poorly, but may have very good specific applications.

bibliography/references.bib