Tenney et al 2019 - BERT Rediscovers the Classical NLP Pipeline
notes for: tenney19_bert_redis_class_nlp_pipel
This paper examines the BERT model from devlin18_bert to determine what linguistic structure the model learns and makes use of. They do this by:
- freezing the encoder weights for BERT
- training a classifier to perform language tasks using the hidden representations produced by the encoder.
- examining the information (i.e. encoder layers) that the classifiers make use of
1. Set up
The tasks that they perform are:
- POS tagging
- consituents labeling
- dependencies
- semantic role labeling
- entities
- coreference
- semantic proto roles
- relation classification
Given a span(s), the classifier produces a label. They use two classifiers, one of which relies on scaling mixed weights and the other of which uses cumulative scoring
1.1. Scaling mixed weights
Per task \(\tau\), for the \(i\) th token, the classifier takes as input the weighted sum of encoder layers \(\mathbf{h}\). So, the \(l\) th layer \(\mathbf{h}_i^{(l)}\) is weighted by the scalar \(s^{(l)}_{\tau}\). This method of weighting is borrowed from ELMo (peters18_deep_contex_word_repres)
For a given task, higher weights can be interpreted as evidence that a layer is more important for a task.
1.2. Cumulative scoring
For a given task, a collection \(\{P_\tau^{(l)}\}_l\) of classifiers is trained. The model \(P_\tau^{(0)}\) only has access to the 0 th layer of the encoder. The model \(P_\tau^{(1)}\) only has access to layers 0 through 1, and so on. Then, the difference in score achieved by \(P_\tau^{(l+1)}\) over \(P_\tau^{(l)}\) tells us what number of layers is needed for good performance.
2. Findings
They find the for simple tasks, the shallow layers of the encoder are sufficient. For more complex tasks, more layers are needed. In other words, basic syntactic information appears early on in the network. They find that semantic information is spread throughout all layers.