Prasanna 2020 – When BERT Plays The Lottery, All Tickets Are Winning

Notes for prasanna-etal-2020-bert

1. pruning BERT

BERT can be pruned, removing either:
- the weights with smallest magnitude (call this magnitude pruning)
- the self attention heads of least importance (call this structural pruning)

2. methods

use BERT embeddings for downstream tasks, for which a top layer network is fine-tuned
iteratively prune, while checking to see if performance is maintained
investigate whether the pruned heads/weights are invariant across:
- random initializations of the task-specific top layer
- different tasks

3. Key takeaways from this paper:

the heads which survive structural pruning do not seem to encode much linguistic/structural information
all the heads which can be pruned contribute about as much as those which were not pruned. This indicates that perhaps they did redundant work
points away from a theory that BERT is composed of modules which do individual jobs. points towards a theory that language processing is distributed across BERT in many heads

bibliography/references.bib

Created: 2024-07-15 Mon 01:28