UP | HOME

Merrill 2021 – Formal Language Theory Meets Modern NLP

Notes for merrill21_formal_languag_theor_meets_moder_nlp

Really nice overview of formal languages and complexity

1. regular languages

  • built from union, concatenation, and Kleene star operations
  • Kleene's Theorem: regular languages are the languages recognized by DFA and NFA (note that DFA and NFA recognize same set of languages!)

1.1. learning

  • trying to learn the smallest possible DFA that will match a set of strings is NP-complete
  • trying to learn smallest DFA, given polynomial amount of acceptance and equivalence queries is possible

2. data structures

  • with an infinitely long data structure (stack or counter), we can have more complex languages
  • strings recognized by push down automata and counter include \(a^nb^n\). Note that this language can't be recognized by DFA

2.1. limitations of counter

  • \(k\) -Dyck language is set of all matching paren strings where there are \(k\) types of parens
  • can't recognize \(k\) -Dyck language
    • why can't you just keep a counter for each type of paren? It matters which paren came earlier in the string, not just how many are open. For example, you won't be able to reject [(])
  • important: can't recognize complex context free languages that require compositionality

2.2. limitations of stack

  • can match any deterministic context free language, which is strict subset of CFL
  • push down automata not strictly more powerful than counter

2.3. non-determinism

  • CFLs are strict superset of DCFLs
  • DCFLs recognized by non-deterministic push down automata

2.4. learning

  • grammar induction problem: learn grammar rules from samples
  • frame data structures as differentiable objects
  • model learns to use data structure

3. aside: why is it called context free?

  • it's called context free, becaues each one of the productions rules is applied without attending to context. That is, when I apply \(S \rightarrow a\), I don't need to know that perhaps, the \(S\) is being expanded as part of \(aSa\) or \(aaSaa\)

4. where does language fall in chomsky hierarchy?

  • Turing Machines, and finite Turing Machines (Linear Bounded Automata) are at the top of the hierarchy
  • linguists believe language is mildly-context dependent, which will require and LBA
  • what is the right formalism? Tree Adjoining Grammar or Combinatory Category Grammar?

5. hypercomputation

  • more powerful than TMs? Doesn't that violate the claim that TMs are universal model of computation?
  • hypercomputation models are infinite computers vs TMs, which are finite computers that can recognize an infinite amount of strings
  • there are countable infinite TMs, so only countable infinite amount of languages can be recognized. But in fact there are uncountably infinite number of formal languages (?)

6. weighted finite state automata

  • Goal: given a string, assign a number
  • Architecture: Like a DFA, but the transitions are weighted
  • How do you get the number? Sum the path-scores for all the paths induced by the string in the DFA.
  • How do you ge tthe path-score? Multiply the transition weights

7. Henkel matrices

  • infinite matrix used to represent any formal language
  • connection to WFAs: For function \(f: \Sigma^* \rightarrow \mathbb{Q}\), Henkel matrix \(H_f\) has finite rank iff there exists a WFA computing \(f\). Furthermore \(\text{rank}(H_f)\) is the number of states in this WFA.

8. Neural networks

8.1. hypercomputation

Say that the neurons can be specified by real number weights of arbitrary precision. Then, the neurons are capable of hypercomputation, since each weight can encode

8.2. restriction to rational number weights

If the weights are rational numbers of arbitrary precision, then the neural networks are Turing-complete. However, note that arbitrary precision is not possible for any practical application. And it's unlikely that the right weights will be found by gradient descent.

8.3. binarized neurons

  • weights can still be rational numbers, activations are \({0,1}\).
  • much weaker than TM

9. saturated networks

  • previously, when we talked about NNs, we included the possibility of infinite precision weights
  • to relax this constraint, consider saturated networks, in which all the weights are multiplied by \(\rho\) where \(\rho\rightarrow\infty\).
  • then, e.g., anything like a \(\sigma\) function activation will become a step function

10. saturated networks as automata

  • empirical conjecture: the saturated network captures what is most stable about the original network

10.1. sRNNs

  • empirically equivalent to finite-state automata

10.2. sLSTMS

  • resemble counter automata

10.3. saturated stack RNNs

10.4. sCNNs

  • less expressive than FSA

10.5. sTransformers

  • limited form of counting

11. open questions

  • how to tell if model is learning deep formal knowledge, or spurious surface knowledge

12. Bib

Created: 2024-07-15 Mon 01:28