POPL 2025
Sun 19 - Sat 25 January 2025 Denver, Colorado, United States

\emph{Context-free language (CFL)} reachability is a standard approach in static analyses, where the analysis question (e.g., is there a dataflow from $x$ to $y$?) is phrased as a language reachability problem on a graph $G$ wrt a CFL $\mathcal{L}$. However, CFLs lack the expressiveness needed for high analysis precision. On the other hand, common formalisms for \emph{context-sensitive} languages are too expressive, in the sense that the corresponding reachability problem becomes undecidable. \emph{Are there useful context-sensitive language-reachability models for static analysis?}

In this paper, we introduce \emph{Multiple Context-Free Language (MCFL)} reachability as an \emph{expressive} yet \emph{tractable} model for static program analysis. MCFLs form an infinite hierarchy of mildly context sensitive languages parameterized by a \emph{dimension} $d$ and a \emph{rank} $r$. Larger $d$ and $r$ yield progressively more expressive MCFLs, offering \emph{tunable} analysis precision. We showcase the utility of MCFL reachability by developing a family of MCFLs that approximate interleaved Dyck reachability, a common but undecidable static analysis problem.

Given the increased expressiveness of MCFLs, one natural question pertains to their algorithmic complexity, i.e., \emph{how fast can MCFL reachability be computed?} We show that the problem takes $O(n^{2d+1})$ time on a graph of $n$ nodes when $r=1$, and in $O(n^{d(r+1)})$ time when $r>1$. Moreover, we show that when $r=1$, even the simpler membership problem has a lower bound of $n^{2d}$ based on the Strong Exponential Time Hypothesis, while reachability for $d=1$ has a lower bound of $n^{3}$ based on the combinatorial Boolean Matrix Multiplication Hypothesis. Thus, for $r=1$, our algorithm is optimal within a factor $n$ for all levels of the hierarchy based on the dimension $d$ (and fully optimal for $d=1$).

We implement our MCFL reachability algorithm and evaluate it on underapproximating interleaved Dyck reachability for a standard taint analysis for Android. When combined with existing overapproximate methods, MCFL reachability discovers \emph{all} tainted information on 8 out of 11 benchmarks, while it has remarkable coverage (confirming $94.3%$ of the reachable pairs reported by the overapproximation) on the remaining 3. To our knowledge, this is the first report of high and provable coverage for this challenging benchmark set.