Kim 2021 – Sequence-to-Sequence Learning with Latent Neural Grammars

kim21_sequen_to_sequen_learn_with

1. QCFGs

For full description of Quasi-Synchronous CFGs see smith-eisner-2006-quasi

1.1. motivation

Synchronous CFGs are often used for sequence-to-sequence tasks, e.g. machine translation
but synchronous CFGs require the source and target tree to be generated in lock-step
so there is an isomorphism between the source and target trees
instead, we would like the nodes in the target tree to have arbitrary (possibly one-to-many, possibly null) alignments with the nodes of the source tree

1.2. specification

for a given source sentence, take a parse tree \(T_1\) of the sentence
Then, given this tree, we dynamically generate a set of CFG rules to produce parse trees in the target
- The rules of this grammar are the usual rules of a monolingual grammar, but for each rule, there are now a family of rules where each non-terminal is annotated with a subset of the source \(T_1\) nodes
- That is, the rules now make reference to arbitrary sets of nodes in the source target
Q: How can our model generalize? There are so many possible sets of source nodes:
- A1: limit the set of source nodes per non-terminal to be of size 1
- A2: our model only cares about the features of these source node trees. The features can be hand-crafted or given by a NN

2. learning

maximize \(p_{\theta, \phi}p(y\mid x)\)
- we get this by marginalizing over all possible trees \(s\) for the source \(x\) and all possible target trees \(t\) that produce the target \(y\).
Actually maximize the lower bound \(\mathbb{E}_{s \sim p_{\phi}(s\mid x)}\left[ \log p_{\theta} (y \mid s) \right]\)

3. prediction

Given a source sentence \(x\), find the \(\arg\max\) parse \(\hat{s}\). For \(\hat{s}\), find the most likely sequence \(y\) produced by the specific grammar for \(\hat{s}\). This is actually NP-hard, so we just sample \(K\) trees from the grammar.

4. conclusion

performs decently on:
- machine translation
- style transfer
- compositional generalization
as compared to:
- LSTM baseline
- fine-tuned transformer
can be useful for:
- generating additional data for more unconstrained NN models

5. thoughts

Note that the grammar is completely latent. The tasks are learned end-to-end. What sort of hand-tuned/pre-trained constraints could be put on the monolingual grammars?
The gradient of how much inductive bias an approach can encode goes:
- CFGs –> QCFGs –> QCFGs + Neural Net features (this paper) –> tranformers (completely no restraints put on what sort of dependencies can be represented)
- QCFGs were originally introduced to handle weird dependencies of the target on the source. Currently, the most performant approaches are based on transformers, which can model any dependency. But QCFGs give more explicit insight into what exact dependencies are being represented.

6. basics that I need to more fully understand

synchronous tree adjoining grammars
inside algorithm
evidence lower bound
unbiased monte-carlo approximation of the gradient – see reinforcement learning
perplexity
posterior regularization

7. bib

bibliography/references.bib