# Merrill 2021 – Formal Language Theory Meets Modern NLP

Notes for merrill21_formal_languag_theor_meets_moder_nlp

Really nice overview of formal languages and complexity

## 1. regular languages

- built from union, concatenation, and Kleene star operations
- Kleene's Theorem: regular languages are the languages recognized by DFA and NFA (note that DFA and NFA recognize same set of languages!)

### 1.1. learning

- trying to learn the smallest possible DFA that will match a set of strings is NP-complete
- trying to learn smallest DFA, given polynomial amount of acceptance and equivalence queries is possible

## 2. data structures

- with an infinitely long data structure (stack or counter), we can have more complex languages
- strings recognized by push down automata and counter include \(a^nb^n\). Note that this language can't be recognized by DFA

### 2.1. limitations of counter

- \(k\) -Dyck language is set of all matching paren strings where there are \(k\) types of parens
- can't recognize \(k\) -Dyck language
- why can't you just keep a counter for each type of paren? It matters which paren came earlier in the string, not just how many are open. For example, you won't be able to reject
`[(])`

- why can't you just keep a counter for each type of paren? It matters which paren came earlier in the string, not just how many are open. For example, you won't be able to reject
- important: can't recognize complex context free languages that require compositionality

### 2.2. limitations of stack

- can match any deterministic context free language, which is strict subset of CFL
- push down automata not strictly more powerful than counter

### 2.3. non-determinism

- CFLs are strict superset of DCFLs
- DCFLs recognized by non-deterministic push down automata

### 2.4. learning

- grammar induction problem: learn grammar rules from samples
- frame data structures as differentiable objects
- model learns to use data structure

## 3. aside: why is it called context free?

- it's called context free, becaues each one of the productions rules is applied without attending to context. That is, when I apply \(S \rightarrow a\), I don't need to know that perhaps, the \(S\) is being expanded as part of \(aSa\) or \(aaSaa\)

## 4. where does language fall in chomsky hierarchy?

- Turing Machines, and finite Turing Machines (Linear Bounded Automata) are at the top of the hierarchy
- linguists believe language is mildly-context dependent, which will require and LBA
- what is the right formalism? Tree Adjoining Grammar or Combinatory Category Grammar?

## 5. hypercomputation

- more powerful than TMs? Doesn't that violate the claim that TMs are universal model of computation?
- hypercomputation models are infinite computers vs TMs, which are finite computers that can recognize an infinite amount of strings
- there are countable infinite TMs, so only countable infinite amount of languages can be recognized. But in fact there are uncountably infinite number of formal languages (?)

## 6. weighted finite state automata

- Goal: given a string, assign a number
- Architecture: Like a DFA, but the transitions are weighted
- How do you get the number? Sum the path-scores for all the paths induced by the string in the DFA.
- How do you ge tthe path-score? Multiply the transition weights

## 7. Henkel matrices

- infinite matrix used to represent any formal language
- connection to WFAs: For function \(f: \Sigma^* \rightarrow \mathbb{Q}\), Henkel matrix \(H_f\) has finite rank iff there exists a WFA computing \(f\). Furthermore \(\text{rank}(H_f)\) is the number of states in this WFA.

## 8. Neural networks

### 8.1. hypercomputation

Say that the neurons can be specified by real number weights of arbitrary precision. Then, the neurons are capable of hypercomputation, since each weight can encode

### 8.2. restriction to rational number weights

If the weights are rational numbers of arbitrary precision, then the neural networks are Turing-complete. However, note that arbitrary precision is not possible for any practical application. And it's unlikely that the right weights will be found by gradient descent.

### 8.3. binarized neurons

- weights can still be rational numbers, activations are \({0,1}\).
- much weaker than TM

## 9. saturated networks

- previously, when we talked about NNs, we included the possibility of infinite precision weights
- to relax this constraint, consider
*saturated*networks, in which all the weights are multiplied by \(\rho\) where \(\rho\rightarrow\infty\). - then, e.g., anything like a \(\sigma\) function activation will become a step function

## 10. saturated networks as automata

- empirical conjecture: the saturated network captures what is most stable about the original network

### 10.1. sRNNs

- empirically equivalent to finite-state automata

### 10.2. sLSTMS

- resemble counter automata

### 10.3. saturated stack RNNs

- stack augmented RNNs are RNNs equipped with a stack data structure
- can model Dyck languages

### 10.4. sCNNs

- less expressive than FSA

### 10.5. sTransformers

- limited form of counting

## 11. open questions

- how to tell if model is learning deep formal knowledge, or spurious surface knowledge