October 28, 2010
When writing LL(*): The Foundation of the ANTLR Parser Generator, we collected data about open-source projects using ANTLR grammars and also collected grammar analysis data about six large grammars. This document describes our methodology. See the paper for full detail/results. We also include material here that did not fit into the 10 page submission paper.
Actions in open-source grammars
We found 110 grammars and removed duplicates and anything else that would blur
the statistics. Result:
22/89 = .247 No actions
67/89 = .752 Actions
41 grammars are ANTLR v2 and 48 are ANTLR v3. 43/89 grammars are tree
Open-source grammar review to discover those with/without
actions (Format is filename 0|1. 1
indicates the grammar was deemed to have actions.)
Methodology used to examine open-source grammars.
Project directory names (we cannot publish actual projects for license reasons). We did "antlr grammar" searches at google code and sourceforge. A search for these names yields the projects. We pulled 110 grammars out of these roughly 30 projects.
ANTLR download statistics
Here are snapshots from Google Analytics (data only includes January 9, 2008 - October 28, 2010):
Empirical grammar and parser runtime analysis
Links to sample grammars
Links to sample input
Here are the raw results of static and runtime analysis for the 6 sample grammars:
- Java input, RatsJavaParser.java, is ANTLR-generated parser for RatsJava.g, which we derived from GPL Rats! (Empirical results in paper were run w/o GPL license on top of this file.)
- C input, pre_javaParser.c, is ANTLR-generated parser for Java.g run through C preprocessor
- LingToSqlSamples.cs taken from Microsoft C# Samples
- TSQL sample cobbled together from Microsoft Transact-SQL Reference; includes table creation, complicated stored procedures, updates, insertions, and queries. Copyright prevents reproduction here.
- Northwind.vb taken from Microsoft VB.NET Samples
The paper focuses on the practical issues of nondeterministic parsing
and introduces our algorithm for constructing LL(*) parser lookahead DFA.
We could not fit any theoretical results, but in the following section,
we show that we can convert any PEG to an equivalent ANTLR grammar. With
the addition of unrestricted semantic predicates, we can reach further into the
context-sensitive languages than PEGs can, however.
Discovery of visible semantic predicates
The formal semantics of predicated grammars and analysis algorithm in the submitted paper require that disambiguating predicates appear at the left edge of ambiguous productions. This is cumbersome in practice and forces programmers to duplicate predicates. Fortunately, grammar analysis can automatically discover and hoist predicates from productions further down the derivation chain into parsing decisions without predicates.
Here is a description of visible predicates and modified DFA construction algorithm.
ANTLR v3 LL(*) Lookahead DFA Construction Implementation
Class NFAToDFAConverter in the ANTLR 3.3 distribution has a Java implementation of this algorithm.