UP | HOME

Prasanna 2020 – When BERT Plays The Lottery, All Tickets Are Winning

Notes for prasanna-etal-2020-bert

1. pruning BERT

  • BERT can be pruned, removing either:
    • the weights with smallest magnitude (call this magnitude pruning)
    • the self attention heads of least importance (call this structural pruning)

2. methods

  • use BERT embeddings for downstream tasks, for which a top layer network is fine-tuned
  • iteratively prune, while checking to see if performance is maintained
  • investigate whether the pruned heads/weights are invariant across:
    • random initializations of the task-specific top layer
    • different tasks

3. Key takeaways from this paper:

  • the heads which survive structural pruning do not seem to encode much linguistic/structural information
  • all the heads which can be pruned contribute about as much as those which were not pruned. This indicates that perhaps they did redundant work
  • points away from a theory that BERT is composed of modules which do individual jobs. points towards a theory that language processing is distributed across BERT in many heads

bibliography/references.bib

Created: 2024-07-15 Mon 01:28