Dunietz et al 2020 - To Test Machine Comprehension, Start by Defining Comprehension
1. Motivation
Models based on BERT (devlin18_bert) are achieving and exceeding human performance on MRC datasets, yet machines still make baffling errors that humans do not make and do not generalize well.
2. Argument: existing machine reading comprehension datasets do not effectively capture comprehension
Machine reading comprehension (MRC) datasets contain natural language question-answer pairs in the context of a passage to be comprehended. In past approaches, the questions, answers, and passages have been:
- handwritten by humans
- These include questions which require multi-hop reasoning
- collected from the wild, e.g. Reddit
- taken from tests for humans, e.g. trivia bowl questions
- generated by machine
Main contention: performing well on difficult questions does not necessarily indicate good understanding. Many datasets do not cover questions which humans would find obvious: e.g. "what color is a banana?" A systems sophistication does not matter as much as its ability to comprehend passages in a certain context.
3. Approach: questions about stories
For their context, the authors propose looking at narrative stories. For this purpose, they define a template of understanding - a baseline set of concepts that a reading comprehension system should attend to:
- temporal relationships
- causal relationships
- spatial
- motivational
In this way, for narrative fiction, they present a systematized way to probe machine understanding of the world and the passage.
4. Useful Links
- The field of natural language processing is chasing the wrong goal - opinion piece by the first author