Kim 2021 – Sequence-to-Sequence Learning with Latent Neural Grammars
kim21_sequen_to_sequen_learn_with
1. QCFGs
For full description of Quasi-Synchronous CFGs see smith-eisner-2006-quasi
1.1. motivation
- Synchronous CFGs are often used for sequence-to-sequence tasks, e.g. machine translation
- but synchronous CFGs require the source and target tree to be generated in lock-step
- so there is an isomorphism between the source and target trees
- instead, we would like the nodes in the target tree to have arbitrary (possibly one-to-many, possibly null) alignments with the nodes of the source tree
1.2. specification
- for a given source sentence, take a parse tree \(T_1\) of the sentence
- Then, given this tree, we dynamically generate a set of CFG rules to produce parse trees in the target
- The rules of this grammar are the usual rules of a monolingual grammar, but for each rule, there are now a family of rules where each non-terminal is annotated with a subset of the source \(T_1\) nodes
- That is, the rules now make reference to arbitrary sets of nodes in the source target
- Q: How can our model generalize? There are so many possible sets of source nodes:
- A1: limit the set of source nodes per non-terminal to be of size 1
- A2: our model only cares about the features of these source node trees. The features can be hand-crafted or given by a NN
2. learning
- maximize \(p_{\theta, \phi}p(y\mid x)\)
- we get this by marginalizing over all possible trees \(s\) for the source \(x\) and all possible target trees \(t\) that produce the target \(y\).
- Actually maximize the lower bound \(\mathbb{E}_{s \sim p_{\phi}(s\mid x)}\left[ \log p_{\theta} (y \mid s) \right]\)
3. prediction
Given a source sentence \(x\), find the \(\arg\max\) parse \(\hat{s}\). For \(\hat{s}\), find the most likely sequence \(y\) produced by the specific grammar for \(\hat{s}\). This is actually NP-hard, so we just sample \(K\) trees from the grammar.
4. conclusion
- performs decently on:
- machine translation
- style transfer
- compositional generalization
- as compared to:
- LSTM baseline
- fine-tuned transformer
- can be useful for:
- generating additional data for more unconstrained NN models
5. thoughts
- Note that the grammar is completely latent. The tasks are learned end-to-end. What sort of hand-tuned/pre-trained constraints could be put on the monolingual grammars?
- The gradient of how much inductive bias an approach can encode goes:
- CFGs –> QCFGs –> QCFGs + Neural Net features (this paper) –> tranformers (completely no restraints put on what sort of dependencies can be represented)
- QCFGs were originally introduced to handle weird dependencies of the target on the source. Currently, the most performant approaches are based on transformers, which can model any dependency. But QCFGs give more explicit insight into what exact dependencies are being represented.
6. basics that I need to more fully understand
- synchronous tree adjoining grammars
- inside algorithm
- evidence lower bound
- unbiased monte-carlo approximation of the gradient – see reinforcement learning
- perplexity
- posterior regularization