Most modern neural LMs define the probability \( \plm\left(\str\right) \) of a string \( \str \in \kleene{\alphabet} \) as a product of conditional probability distributions \( \plm \), i.e.,
\begin{equation}
\plm\left(\str\right) \defeq \plm\left(\eos\mid\str\right) \prod_{t = 1}^{|\str|} \plm\left(\symt \mid \strlt\right),
\end{equation}
where \( \eos \notin \alphabet \) is a special __e__nd-__o__f-__s__equence symbol.

A language model expressed as in Equation 1 is called **autoregressive**.
Let \( \eosalphabet \defeq \alphabet \cup \left\{\eos\right\} \).
Then, each \( \plm\left(\eossym_t \mid \strlt\right) \) is a distribution over \( \eosalphabet \).
Additionally, we may also consider \( \varepsilon \)-augmented language models where each \( \plm \) is a distribution over \( \eosalphabet \cup \{ \varepsilon \} \) where \( \varepsilon \not\in \alphabet \) is a special symbol not in the alphabet that represents the empty string.
This allows the language model to perform computations *longer* than the length of the string it generates.
An autoregressive language model is called **real-time** if each \( \plm \) is *only* a distribution over \( \eosalphabet \), i.e., not over \( \eosalphabet \cup \{ \varepsilon \} \).

At a high level, we are interested in encoding real-time LMs as RNN LMs. To do so, we need a notion of equivalence between language models. In this paper, we will work with weak equivalence.

**Definition:** Two LMs \( \plm \) and \( \qlm \) over \( \kleene{\alphabet} \) are **weakly equivalent** if \( \plm\left(\str\right) = \qlm\left(\str\right) \) for all \( \str \in \kleene{\alphabet} \).

Probabilistic finite-state automata are a well-understood real-time computational model.
**Definition:** A **probabilistic finite-state automaton** (PFSA) is a 5-tuple \( \wfsatuple \) where \( \alphabet \) is an alphabet, \( \states \) is a finite set of states, \( \trans \subseteq \states \times \alphabet \times \Qnonneg \times \states \) is a finite set of weighted transitions
where we write transitions \( \left(\stateq, \sym, w, \stateq^\prime\right) \in \trans \) as \( \edge{\stateq}{\sym}{w}{\stateq^\prime} \),
and \( \initialf, \finalf\colon \states \rightarrow \Qnonneg \) are functions that assign each state its initial and final weight, respectively.
Moreover, for all states \( \stateq \in \states \), \( \trans, \initialf \) and \( \finalf \) satisfy \( \sum_{\stateq \in \states} \initialf\left(\stateq\right) = 1 \), and \( \sum\limits_{\edge{\stateq}{\sym}{w}{\stateq^\prime} \in \trans} w + \finalf\left(\stateq\right) = 1 \).

We next define some basic concepts.
A PFSA \( \fsa = \wfsatuple \) is **deterministic** if \( |\set{\stateq \mid \initialfFun{\stateq} > 0}| = 1 \) and, for every \( \stateq \in \states, \sym \in \alphabet \), there is at most one \( \stateq^\prime \in \states \) such that \( \edge{\stateq}{\sym}{w}{\stateq^\prime} \in \trans \) with \( w > 0 \).
*not* distinguish between a transition for a given symbol with weight \( w=0 \) and the absence of a transition on that symbol.
That is, we assume there always exists a transition \( \edge{\stateq}{\sym}{w}{\stateq'} \in \trans \) for any \( \stateq, \stateq' \in \states \) and \( \sym \in \alphabet \), albeit possibly with \( w = 0 \).
Such a choice turns out to be useful in our technical exposition; see Algorithm 1.
**initial state**, and if \( \finalfFun{\stateq} > 0 \), it is called a **final state**.
A **path** \( \apath \) of length \( \pathlen \) is a sequence of subsequent transitions in \( \fsa \), denoted as
\[
\edge{\stateq_1}{\sym_1}{w_1}{\edge{\stateq_2}{\sym_2}{w_2}{\stateq_3} \cdots \edge{\stateq_{\pathlen}}{\sym_{\pathlen}}{w_{\pathlen}}{\stateq_{\pathlen + 1}}}.
\]
The **yield** of a path is \( \yield\left(\apath\right)\defeq \sym_1 \cdots \sym_{\pathlen} \).
The **prefix weight** \( \prefixweight \) of a path \( \apath \) is the product of the transition and initial weights, whereas the **weight** of a path additionally has the final weight multiplied in.
In symbols, this means

\[ \weight(\apath)\defeq \prod_{t = 0}^{\pathlen+1} w_t, \]

with \( w_0 \defeq \initialf(\stateq_1) \) and \( w_{\pathlen+1} \defeq \finalf(\stateq_{\pathlen+1}) \).
We write \( \paths(\fsa, \str) \) for the set of all paths in \( \fsa \) with yield \( \str \).
The sum of weights of all paths that yield a certain string \( \str\in\kleene{\alphabet} \) is called the **stringsum**, given in the notation below:

\[ \fsa \left( \str \right) \defeq \sum_{\apath \in \paths\left( \fsa, \str \right) } \weight \left( \apath \right). \]

The stringsum gives the probability of the string \( \str \).
We say a state \( \stateq \in \states \) is **accessible** if there exists a path with non-zero weight from an initial state to \( \stateq \).
A state \( \stateq \in \states \) is **co-accessible** if there exists a path with non-zero weight from \( \stateq \) to a final state.
An automaton in which all states are accessible and co-accessible is called **trim**.

**PFSAs as Autoregressive Language Models.**
We now show how a PFSA $\fsa$ induces a LM $\plmA$ over strings $\str\in\kleene{\alphabet}$.
The weights of all available transitions of a PFSA in state $\stateq$, together with the final weight, define a probability distribution over the next action, i.e., taking a transition or halting.
We can translate this into a distribution over $\eosalphabet$ where $\eos$ corresponds to halting.
In the following, we use the notation $\eossym\in\eosalphabet$ and $\sym\in\alphabet$ to distinguish between symbols that could be $\eos$ and those that cannot, respectively.
Specifically, we can define the probability over $\eosalphabet$ as follows
\begin{equation}
\plmACFun{{\eossym_t}}{{\stateq}, \strlt} =
\begin{cases}
{\displaystyle \sum_{\edge{{\stateq}}{{\eossym_t}}{w}{\stateq'}}} w & \ifcond {\eossym_t}\in\alphabet\\
\finalf\left({\stateq}\right) & \ifcond {\eossym_t} = \eos.
\end{cases}\label{eq:plmpfsa}
\end{equation}
Moreover, in a PFSA, the probability of $\eossym$ is conditionally independent of $\strlt$ given the state $q$, i.e.,
\begin{equation}
\plmACFun{\eossym_t}{\stateq, \strlt} = \plmACFun{\eossym_t}{\stateq}.\label{eq:markov}
\end{equation}
Finally, using the law of total probability, we can define an autoregressive language model as
\begin{align}\label{eq:pfsa-autoregressive}
&\plmACFun{\eossym_t}{\strlt}\defeq \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\plmACFun{\stateq}{\strlt} \nonumber \\
&= \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\frac{\plmA\left(\stateq, \strlt\right)}{\plmA\left(\strlt\right)} \\
&= \sum_{\stateq \in \states} \plmACFun{\eossym_t}{\stateq}\,\frac{\plmA\left(\stateq, \strlt\right)}{\sum_{\stateq' \in \states} \plmA\left(\stateq', \strlt\right)}, \label{eq:pfsa-autoregressive-last}
\end{align}
where $\plmA\left(\stateq, \strlt\right)$ can be written as
\begin{equation}\label{eq:next-state}
\plmA\left(\stateq, \strlt\right) = \left(\overset{\rightarrow}{\initialf}^{\top} \prod_{s=1}^t \mT^{\left(\sym_s\right)}\right)_\stateq,
\end{equation}
where $\overset{\rightarrow}{\initialf}$ and $\mT^{\left(\sym_s\right)}$ refer to the vectorized initial function and symbol-specific transition matrices of the PFSA $\fsa$, respectively; see \cref{app:pfsa} for a full specification.

**Definition:**
An LM $\plm$ is a **regular** language model if there exists a PFSA $\fsa$ whose induced language model $\plmA$ is weakly equivalent to $\plmA$.

**PFSAs and FSAs.**
Although PFSAs share many properties with unweighted (boolean-weighted) finite-state automata, one important difference relates to determinization.
In the unweighted case, the class of deterministic and non-deterministic FSAs are equivalent, i.e., any non-deterministic FSA has an equivalent deterministic FSA that accepts the same language.
This result, however, does not hold for PFSAs: There exist PFSAs that admit no deterministic equivalent \citep{mohri-1997-finite, Buchsbaum1998}, meaning that non-deterministic PFSAs are strictly more expressive than deterministic ones.
For example, the PFSA in \cref{fig:example-fslm} is non-deterministic and does not admit a deterministic equivalent, since $\plm(a b^n)$ cannot be expressed as a single term for arbitrary $n \in \NZero$.