Tenney et al 2019 - BERT Rediscovers the Classical NLP Pipeline

notes for: tenney19_bert_redis_class_nlp_pipel

This paper examines the BERT model from devlin18_bert to determine what linguistic structure the model learns and makes use of. They do this by:

freezing the encoder weights for BERT
training a classifier to perform language tasks using the hidden representations produced by the encoder.
examining the information (i.e. encoder layers) that the classifiers make use of

1. Set up

The tasks that they perform are:

POS tagging
consituents labeling
dependencies
semantic role labeling
entities
coreference
semantic proto roles
relation classification

Given a span(s), the classifier produces a label. They use two classifiers, one of which relies on scaling mixed weights and the other of which uses cumulative scoring

1.1. Scaling mixed weights

Per task \(\tau\), for the \(i\) th token, the classifier takes as input the weighted sum of encoder layers \(\mathbf{h}\). So, the \(l\) th layer \(\mathbf{h}_i^{(l)}\) is weighted by the scalar \(s^{(l)}_{\tau}\). This method of weighting is borrowed from ELMo (peters18_deep_contex_word_repres)

For a given task, higher weights can be interpreted as evidence that a layer is more important for a task.

1.2. Cumulative scoring

For a given task, a collection \(\{P_\tau^{(l)}\}_l\) of classifiers is trained. The model \(P_\tau^{(0)}\) only has access to the 0 th layer of the encoder. The model \(P_\tau^{(1)}\) only has access to layers 0 through 1, and so on. Then, the difference in score achieved by \(P_\tau^{(l+1)}\) over \(P_\tau^{(l)}\) tells us what number of layers is needed for good performance.

2. Findings

They find the for simple tasks, the shallow layers of the encoder are sufficient. For more complex tasks, more layers are needed. In other words, basic syntactic information appears early on in the network. They find that semantic information is spread throughout all layers.

3. Useful links

helpful blog post

bibliography/references.bib