DPX4J2SS2HPH6GR4MALYCLRQL2NTDIQA5PCSAW5PPVURGRMSWSUQC \begin{abstract}We show that formal grammars can naturally be given semantics in thecategory of presheaves on strings. The monoid structure of stringsinduces Day convolution structure on grammars that models thesequencing operation of regular expressions and other forms offormal grammar. The closed structure on this category can be used todefine the \emph{derivative} of a language, etc.%% Presheaves over a monoidal category is a common way to construct%% models of separation logic. We show that we can use (ordered)%% separation logic propositions as grammars and proofs in ordered%% separation logic as parses and more generally parse-transformers, so%% that parsing is equivalent to a proof search in separation logic.\end{abstract}n
A grammar $g$ is\begin{enumerate}\item \emph{total} if for every $w$ there is at least one parse\item \emph{deterministic} if for every $w$ there is at most one parse\item \emph{empty} if for every $w$ there are no parses\end{enumerate}%% \begin{definition}%% A grammar $g$ is \emph{uniquely decidable} if there exists a%% ``complement'' $g'$ such that $g \vee g'$ is total and%% deterministic.%% \end{definition}
\section{Formal Grammars and their Semantics}
The presheaf of sets model of grammars allows us to reason about notjust whether a string parses, but also about \emph{ambiguity} of agrammar: whether there exists \emph{more than one} parse. This ismodeled by the grammar $g$ having more than one parse of the samestring $w$. We can use this as above to reason about when a grammar isambiguous or not, and if the grammar is unambiguous, we have a clear,deterministic specification for the parser.But what about parsing \emph{ambiguous} grammars? We can simply saythat a parser should produce any parse, or perhaps all of them, but inpractice many parser generators and programming languages supportinginfix/mixfix notation allow for \emph{associativity} and\emph{precedence} declarations. These are described by providing anambiguous grammar and disambiguating not by rewriting the grammar, butinstead by expressing a \emph{preference} between the multiplepossible parses. For instance, one of the most common sources ofambiguity in parsing is in associativity and precedence of binaryoperators. For an example of associativity, a grammar for subtractionmight initially be written as:\begin{mathpar}\begin{array}{rcl}E &::= & E - E \,|\,(E)\,|\,Number\\\end{array}\end{mathpar}But then we note that there are multiple, semantically differentparses of $3 - 4 - 5$: trees semantically equivalent to $(3 - 4) - 5$or $3 - (4 - 5)$. We can rule out the right-nested version by\emph{factoring} the grammar:\begin{mathpar}\begin{array}{rcl}E &::= & E - L \,|\,L\\L &::= & ( E ) \,|\, Number\end{array}\end{mathpar}This grammar has fewer parses than the previous: now $3 - 4 - 5$ willonly parse as a left-nested term since $4 - 5$ does not match the``leaf'' production $L$. Instead, most languages declare that $-$ is aleft associative operator. We can think of this as expressing a\emph{preference} for certain parses over others, specifically the tree% https://q.uiver.app/?q=WzAsNSxbMiwyLCJFXzIiXSxbMywxLCJFXzMiXSxbMiwwLCItIl0sWzEsMSwiLSJdLFswLDIsIkVfMSJdLFsyLDFdLFszLDRdLFszLDBdLFsyLDNdXQ==\[\begin{tikzcd}&& {-} \\& {-} && {E_3} \\{E_1} && {E_2}\arrow[from=1-3, to=2-4]\arrow[from=2-2, to=3-1]\arrow[from=2-2, to=3-3]\arrow[from=1-3, to=2-2]\end{tikzcd}\]is prefered over% https://q.uiver.app/?q=WzAsNSxbMSwwLCItIl0sWzAsMSwiRV8xIl0sWzIsMSwiLSJdLFsxLDIsIkVfMiJdLFszLDIsIkVfMyJdLFswLDFdLFswLDJdLFsyLDNdLFsyLDRdXQ==\[\begin{tikzcd}& {-} \\{E_1} && {-} \\& {E_2} && {E_3}\arrow[from=1-2, to=2-1]\arrow[from=1-2, to=2-3]\arrow[from=2-3, to=3-2]\arrow[from=2-3, to=3-4]\end{tikzcd}\]We can express this as an \emph{ordering}, where the prefered tree is``bigger''.Fortunately, we can express this with a small change. Rather thantaking functors from $\Sigma^*$ to \emph{sets} we can take functors to\emph{partially ordered sets}, or even more generally into\emph{categories}. So a grammar $g$ defines for each string $w$ acategory $g w$ whose objects are parses and for any two parses$p_1,p_2 \in g w$, we have a set $(g w)(p_1,p_2)$ of ``reasons toprefer $p_2$ over $p_1$''. Furthermore, preference is transitive andreflexive, with associativity and unit of transitive reasoning.Then the parsing problem can be rephrased as, given a syntacticdescription of a grammar $g$ and a string $w$, what is the \emph{best}parse in $g w$, if any? In categorical language, can we construct aterminal object of $g w$?%% \section{Ordered Separation Logic for Formal Grammars}%% Next, based on our algebraic constructions in the previous section, we%% describe a \emph{polymorphic ordered separation logic} for%% grammars. While this may seem a surprising application of separation%% logic, it is quite natural when considering the semantics: Day%% convolution of presheaves over a \emph{commutative} monoid is one of%% the oldest semantic interpretations of separation logic proofs. For%% parsing, we are taking presheaves over a non-commutative monoid (if%% $|\Sigma| \ge 2$). This non-commutativity is obvious for instance in%% regular expressions: $r_1r_2$ almost never presents the same language%% as $r_2r_1$.%% This consists of three judgments.%% \begin{enumerate}%% \item An \emph{open} grammar $g$ parameterized by a context of grammar%% variables $\Gamma = \gamma_1,\ldots$:%% \[ \Gamma \vdash g \]%% \item For each $\Gamma$ a judgment of \emph{parse terms} of a grammar%% $g$ parameterized by a ``bunch'' of parsed words $\bunch$, where $g$%% and $\bunch$ are both parameterized by $\Gamma$:%% \[ \Gamma | \bunch \vdash p : g \]%% \end{enumerate}
%% The intended denotational semantics is%% \begin{enumerate}%% \item A grammar $\Gamma \vdash g$ denotes a (mixed-variance?) functor%% $\sem{g} : \Lang^{|\Gamma|} \to \Lang$%% \item A parse term $\Gamma | \mathcal W \vdash p : g$ denotes a%% natural transformation $\mathcal W \Rightarrow g$ where a bunch $\mathcal W$%% is interpreted appropriately using the two monoidal structures on%% $\Lang$:%% \begin{align*}%% \sem{\cdot_\epsilon} &= \epsilon\\%% \sem{\mathcal W_m,w : g} &= \sem{\mathcal W_m}\sepconj \sem{g}\\%% \sem{\mathcal W_m,\mathcal W_a} &= \sem{\mathcal W_m}\sepconj \sem{\mathcal W_a}\\%% \sem{\cdot_\top} &= \top\\%% \sem{\mathcal W_a;\mathcal W_m} &= \sem{\mathcal W_a} \times \sem{\mathcal W_m}%% \end{align*}%% \end{enumerate}
%% To get a feel for the differences between the multiplicative and%% additive structure here are some examples.%% A parse term%% \[ \cdot_\epsilon \vdash p : g\]%% denotes a $g$-parse of the empty string, whereas a term%% \[ \cdot_\top \vdash p : g\]%% denotes a function from strings $w \in \Sigma^*$ to $g$-parses of $w$.%% The grammar $g \sepconj g'$ presents the language that parses words%% $w$ that can be split as $w_1w_2$ with a pair of parses $gw_1$ and%% $g'w_2$. On the other hand, $g \times g'$ is the intersection of the%% two languages: a parse of a word $w$ is a pair of a $g$ parse and a%% $g'$ parse of the entire word. $g \times 1$ is equivalent to $g$,%% whereas a parse of $w$ using $g \sepconj 1$ is a $g$-parse of a prefix%% $w'$ of $w$, and correspondingly $1 \sepconj g$ parses words with a%% $g$ suffix. On the other hand $\epsilon \times g$ is the restriction%% of $g$ to only the parses of the empty string.%% \begin{figure}%% \begin{mathpar}%% \inferrule%% {\Delta, \gamma \vdash g \and \textrm{where $\gamma$ strictly positive}}%% {\Delta \vdash \mu \gamma. g}%% \end{mathpar}%% \caption{Ordered Separation Logic}%% \end{figure}
%% Given any string $w$ over our alphabet, there is a corresponding%% grammar $\lceil w \rceil$ that accepts precisely $w$ with a single%% parse. Then we can formulate the \emph{parsing problem} as a%% \emph{program synthesis} problem:%% \begin{quote}%% Given a closed grammar $\cdot \vdash g$ and a word $w$, attempt to%% construct a parse $x : \lceil w \rceil \vdash p : g$%% \end{quote}%% Or, even more naturally, we can break w up into its constituent%% characters:%% \begin{quote}%% Given a closed grammar $\cdot \vdash g$ and a word $c_1\ldots_n$,%% construct a parse $x_1 : \lceil c_1 \rceil,\ldots_n \vdash p%% : g$%% \end{quote}%% \begin{figure}%% \begin{mathpar}%% \inferrule{g \in \mathcal G}{\mathcal G \vdash g}%% \end{mathpar}%% \caption{Context-free Expressions}%% \end{figure}\section{Semantic Actions and Grammar/Parser Combinators}%% While the presheaf presentation directly captures the notion of a%% parse tree, in a program the parse tree described by a grammar is%% often massaged to express precedence data and so the tree structure of%% a parse is more complex than the actual data being parsed. For%% instance, an AST data type for a single binary arithmetic operation%% can be represented as a binary tree with numbers at the leaves, such%% as the following Haskell/Agda-like data definition:%% \begin{verbatim}%% data Tree where%% op : Tree -> Tree -> Tree%% leaf : Num -> Tree%% \end{verbatim}%% If the single operator op is defined to be left-associative (such as%% subtraction/division), a typical grammar for parsing these expressions%% would be defined by \emph{factoring} the grammar into two mutually%% recursive non-terminals, \emph{E} for expression and \emph{L} for%% leaf:%% \begin{mathpar}%% \begin{array}{rcl}%% E &::= & E - L \,|\,L\\%% L &::= & ( E ) \,|\, Number%% \end{array}%% \end{mathpar}%% This roughly corresponds to the following mutually recursive datatypes:%% \begin{verbatim}%% data Expr where%% op : Expr -> Leaf -> Expr%% leaf : Leaf -> Expr%% data Leaf where%% parens : Expr -> Leaf%% num : Number -> Leaf%% \end{verbatim}%% I.e., an \texttt{Expr} is a non-empty list of \texttt{Expr + Number}.%% This mismatch is usually handled by \emph{semantic actions}, a kind of%% fold over the parse tree that is compositionally expressed alongside%% the definition of the parser itself. This can be seen as a manually%% performed optimization to make up for the compiler's inability to%% \emph{fuse} the two functions of parsing (an unfold) with constructing%% the intended AST (a fold). However, semantic actions become more%% useful when we have a \emph{compositional} language of grammars with%% corresponding semantic actions, such as in parser combinator%% libraries.%% Parser combinator libraries are typically based on some notion of%% backtracking error monad: a term of type \texttt{Parser Sigma A} is a%% parser of strings from the alphabet Sigma that on correct parses produces%% an associated value of type \texttt{A}.%% %%% Intuitively, this should be semantically (though not performantly)%% isomorphic to a grammar $g$ paired with a function from $g$-parses to%% $A$.%% We can formalize this by noting that there is a functor, the%% ``Konstant'' functor $K$ from $\Set$ to $\Set^{\Sigma^*}$ that sends a%% set $A$ to the constant presheaf $K(A)_w = A$. Then we can take the%% comma category $\Set^{\Sigma^*}/K$. The objects of this category are%% pairs of a grammar $g$ and a set $A$ with a function%% \[ \prod_{w \in \Sigma^*} g_w \to A \]%% Then the projection functor $cod : \Set^{\Sigma^*}/K \to \Set$ is an%% opfibration. The pushforward is the ``map'' operation on grammars.\section{From Grammars to Parsers}A grammar $g$ defines for each string $w$ a category of parses andpreferences. The task of a \emph{parser} is, given a syntacticdescription of a grammar and a string, to produce a \emph{best} parse,i.e., a terminal object of the category $g w$ of parses.