Prasanna 2020 – When BERT Plays The Lottery, All Tickets Are Winning
Notes for prasanna-etal-2020-bert
1. pruning BERT
- BERT can be pruned, removing either:
- the weights with smallest magnitude (call this magnitude pruning)
- the self attention heads of least importance (call this structural pruning)
2. methods
- use BERT embeddings for downstream tasks, for which a top layer network is fine-tuned
- iteratively prune, while checking to see if performance is maintained
- investigate whether the pruned heads/weights are invariant across:
- random initializations of the task-specific top layer
- different tasks
3. Key takeaways from this paper:
- the heads which survive structural pruning do not seem to encode much linguistic/structural information
- all the heads which can be pruned contribute about as much as those which were not pruned. This indicates that perhaps they did redundant work
- points away from a theory that BERT is composed of modules which do individual jobs. points towards a theory that language processing is distributed across BERT in many heads