October 28, 2010

When writing LL(*): The Foundation of the ANTLR Parser Generator, we collected data about open-source projects using ANTLR grammars and also collected grammar analysis data about six large grammars. This document describes our methodology. See the paper for full detail/results. We also include material here that did not fit into the 10 page submission paper.

Actions in open-source grammars

We found 110 grammars and removed duplicates and anything else that would blur the statistics. Result:

22/89 = .247 No actions
67/89 = .752 Actions

41 grammars are ANTLR v2 and 48 are ANTLR v3. 43/89 grammars are tree grammars.

Open-source grammar review to discover those with/without actions (Format is filename 0|1. 1 indicates the grammar was deemed to have actions.)

Methodology used to examine open-source grammars.

Project directory names (we cannot publish actual projects for license reasons). We did "antlr grammar" searches at google code and sourceforge. A search for these names yields the projects. We pulled 110 grammars out of these roughly 30 projects.

EduXcomp JUnC Noop abtools-GRAMMAR booggie-1.0.1
cesta-0.1.2 dsl4j funky-1.3.00 gwt-xml-editor
highlighter jai-tools kconfig-g koopa laurentpetit-ccw-b432012
makeme-0.02 modsl-read-only qvtparser_1.0.0 qvtrel2op_1.0.0 ontostep-plugin-0.9.0
php-grammar-in-antlr sandbox spapper_v1.2 sparkle-g spydoc_1_0_pre-alpha
svparser v2kparse-1.7 xpa xqgrammar xruby-0.3.3a-src

ANTLR download statistics

Here are snapshots from Google Analytics (data only includes January 9, 2008 - October 28, 2010):

Empirical grammar and parser runtime analysis

Links to sample grammars Links to sample input Here are the raw results of static and runtime analysis for the 6 sample grammars:

Theoretical results

The paper focuses on the practical issues of nondeterministic parsing and introduces our algorithm for constructing LL(*) parser lookahead DFA. We could not fit any theoretical results, but in the following section, we show that we can convert any PEG to an equivalent ANTLR grammar. With the addition of unrestricted semantic predicates, we can reach further into the context-sensitive languages than PEGs can, however.


Discovery of visible semantic predicates

The formal semantics of predicated grammars and analysis algorithm in the submitted paper require that disambiguating predicates appear at the left edge of ambiguous productions. This is cumbersome in practice and forces programmers to duplicate predicates. Fortunately, grammar analysis can automatically discover and hoist predicates from productions further down the derivation chain into parsing decisions without predicates.

Here is a description of visible predicates and modified DFA construction algorithm.

ANTLR v3 LL(*) Lookahead DFA Construction Implementation

Class NFAToDFAConverter in the ANTLR 3.3 distribution has a Java implementation of this algorithm.