Finite State Automata

Language Modeling


Most modern neural LMs define the probability \( \plm\left(\str\right) \) of a string \( \str \in \kleene{\alphabet} \) as a product of conditional probability distributions \( \plm \), i.e., \begin{equation} \plm\left(\str\right) \defeq \plm\left(\eos\mid\str\right) \prod_{t = 1}^{|\str|} \plm\left(\symt \mid \strlt\right), \end{equation} where \( \eos \notin \alphabet \) is a special end-of-sequence symbol.

A language model expressed as in Equation 1 is called autoregressive. Let \( \eosalphabet \defeq \alphabet \cup \left\{\eos\right\} \). Then, each \( \plm\left(\eossym_t \mid \strlt\right) \) is a distribution over \( \eosalphabet \). Additionally, we may also consider \( \varepsilon \)-augmented language models where each \( \plm \) is a distribution over \( \eosalphabet \cup \{ \varepsilon \} \) where \( \varepsilon \not\in \alphabet \) is a special symbol not in the alphabet that represents the empty string. This allows the language model to perform computations longer than the length of the string it generates. An autoregressive language model is called real-time if each \( \plm \) is only a distribution over \( \eosalphabet \), i.e., not over \( \eosalphabet \cup \{ \varepsilon \} \).

At a high level, we are interested in encoding real-time LMs as RNN LMs. To do so, we need a notion of equivalence between language models. In this paper, we will work with weak equivalence.

Definition: Two LMs \( \plm \) and \( \qlm \) over \( \kleene{\alphabet} \) are weakly equivalent if \( \plm\left(\str\right) = \qlm\left(\str\right) \) for all \( \str \in \kleene{\alphabet} \).

Regular Language Models

Probabilistic Finite-State Automata

Probabilistic finite-state automata are a well-understood real-time computational model. Definition: A probabilistic finite-state automaton (PFSA) is a 5-tuple \( \wfsatuple \) where \( \alphabet \) is an alphabet, \( \states \) is a finite set of states, \( \trans \subseteq \states \times \alphabet \times \Qnonneg \times \states \) is a finite set of weighted transitions where we write transitions \( \left(\stateq, \sym, w, \stateq^\prime\right) \in \trans \) as \( \edge{\stateq}{\sym}{w}{\stateq^\prime} \), and \( \initialf, \finalf\colon \states \rightarrow \Qnonneg \) are functions that assign each state its initial and final weight, respectively. Moreover, for all states \( \stateq \in \states \), \( \trans, \initialf \) and \( \finalf \) satisfy \( \sum_{\stateq \in \states} \initialf\left(\stateq\right) = 1 \), and \( \sum\limits_{\edge{\stateq}{\sym}{w}{\stateq^\prime} \in \trans} w + \finalf\left(\stateq\right) = 1 \).

Basic Concepts

We next define some basic concepts. A PFSA \( \fsa = \wfsatuple \) is deterministic if \( |\set{\stateq \mid \initialfFun{\stateq} > 0}| = 1 \) and, for every \( \stateq \in \states, \sym \in \alphabet \), there is at most one \( \stateq^\prime \in \states \) such that \( \edge{\stateq}{\sym}{w}{\stateq^\prime} \in \trans \) with \( w > 0 \). In this paper, we do not distinguish between a transition for a given symbol with weight \( w=0 \) and the absence of a transition on that symbol. That is, we assume there always exists a transition \( \edge{\stateq}{\sym}{w}{\stateq'} \in \trans \) for any \( \stateq, \stateq' \in \states \) and \( \sym \in \alphabet \), albeit possibly with \( w = 0 \). Such a choice turns out to be useful in our technical exposition; see Algorithm 1. Any state \( \stateq \) where \( \initialfFun{\stateq}>0 \) is called an initial state, and if \( \finalfFun{\stateq} > 0 \), it is called a final state. A path \( \apath \) of length \( \pathlen \) is a sequence of subsequent transitions in \( \fsa \), denoted as \[ \edge{\stateq_1}{\sym_1}{w_1}{\edge{\stateq_2}{\sym_2}{w_2}{\stateq_3} \cdots \edge{\stateq_{\pathlen}}{\sym_{\pathlen}}{w_{\pathlen}}{\stateq_{\pathlen + 1}}}. \] The yield of a path is \( \yield\left(\apath\right)\defeq \sym_1 \cdots \sym_{\pathlen} \). The prefix weight \( \prefixweight \) of a path \( \apath \) is the product of the transition and initial weights, whereas the weight of a path additionally has the final weight multiplied in. In symbols, this means

\[ \prefixweight(\apath)\defeq \prod_{t = 0}^\pathlen w_t, \]

\[ \weight(\apath)\defeq \prod_{t = 0}^{\pathlen+1} w_t, \]

with \( w_0 \defeq \initialf(\stateq_1) \) and \( w_{\pathlen+1} \defeq \finalf(\stateq_{\pathlen+1}) \). We write \( \paths(\fsa, \str) \) for the set of all paths in \( \fsa \) with yield \( \str \). The sum of weights of all paths that yield a certain string \( \str\in\kleene{\alphabet} \) is called the stringsum, given in the notation below:

\[ \fsa \left( \str \right) \defeq \sum_{\apath \in \paths\left( \fsa, \str \right) } \weight \left( \apath \right). \]

The stringsum gives the probability of the string \( \str \). We say a state \( \stateq \in \states \) is accessible if there exists a path with non-zero weight from an initial state to \( \stateq \). A state \( \stateq \in \states \) is co-accessible if there exists a path with non-zero weight from \( \stateq \) to a final state. An automaton in which all states are accessible and co-accessible is called trim.


PFSAs as Autoregressive Language Models. We now show how a PFSA $\fsa$ induces a LM $\plmA$ over strings $\str\in\kleene{\alphabet}$. The weights of all available transitions of a PFSA in state $\stateq$, together with the final weight, define a probability distribution over the next action, i.e., taking a transition or halting. We can translate this into a distribution over $\eosalphabet$ where $\eos$ corresponds to halting. In the following, we use the notation $\eossym\in\eosalphabet$ and $\sym\in\alphabet$ to distinguish between symbols that could be $\eos$ and those that cannot, respectively. Specifically, we can define the probability over $\eosalphabet$ as follows \begin{equation} \plmACFun{{\eossym_t}}{{\stateq}, \strlt} = \begin{cases} {\displaystyle \sum_{\edge{{\stateq}}{{\eossym_t}}{w}{\stateq'}}} w & \ifcond {\eossym_t}\in\alphabet\\ \finalf\left({\stateq}\right) & \ifcond {\eossym_t} = \eos. \end{cases}\label{eq:plmpfsa} \end{equation} Moreover, in a PFSA, the probability of $\eossym$ is conditionally independent of $\strlt$ given the state $q$, i.e., \begin{equation} \plmACFun{\eossym_t}{\stateq, \strlt} = \plmACFun{\eossym_t}{\stateq}.\label{eq:markov} \end{equation} Finally, using the law of total probability, we can define an autoregressive language model as \begin{align}\label{eq:pfsa-autoregressive} &\plmACFun{\eossym_t}{\strlt}\defeq \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\plmACFun{\stateq}{\strlt} \nonumber \\ &= \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\frac{\plmA\left(\stateq, \strlt\right)}{\plmA\left(\strlt\right)} \\ &= \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\frac{\plmA\left(\stateq, \strlt\right)}{\sum_{\stateq' \in \states} \plmA\left(\stateq', \strlt\right)}, \label{eq:pfsa-autoregressive-last} \end{align} where $\plmA\left(\stateq, \strlt\right)$ can be written as \begin{equation}\label{eq:next-state} \plmA\left(\stateq, \strlt\right) = \left(\overset{\rightarrow}{\initialf}^{\top} \prod_{s=1}^t \mT^{\left(\sym_s\right)}\right)_\stateq, \end{equation} where $\overset{\rightarrow}{\initialf}$ and $\mT^{\left(\sym_s\right)}$ refer to the vectorized initial function and symbol-specific transition matrices of the PFSA $\fsa$, respectively; see \cref{app:pfsa} for a full specification.

Definition: An LM $\plm$ is a regular language model if there exists a PFSA $\fsa$ whose induced language model $\plmA$ is weakly equivalent to $\plmA$.

PFSAs and FSAs. Although PFSAs share many properties with unweighted (boolean-weighted) finite-state automata, one important difference relates to determinization. In the unweighted case, the class of deterministic and non-deterministic FSAs are equivalent, i.e., any non-deterministic FSA has an equivalent deterministic FSA that accepts the same language. This result, however, does not hold for PFSAs: There exist PFSAs that admit no deterministic equivalent \citep{mohri-1997-finite, Buchsbaum1998}, meaning that non-deterministic PFSAs are strictly more expressive than deterministic ones. For example, the PFSA in \cref{fig:example-fslm} is non-deterministic and does not admit a deterministic equivalent, since $\plm(a b^n)$ cannot be expressed as a single term for arbitrary $n \in \NZero$.

Parity Finite State Automaton