Test-Driven Development with ANTLR

Skip to end of metadata
Go to start of metadata

Just as your code can benefit from complete unit tests, so can your grammars. Here's an example of developing a simple grammar for CSV using Java, ANTLR and TestNG.

CSV defined

CSV (Comma Separated Values) is a file format commonly used for exchanging spreadsheet data and which generally follows these rules:

  1. A record consists of multiple fields seperated by commas. Each record ends with a line feed (a/k/a newline) or carriage return + line feed.
  2. A field may not contain spaces, commas, double-quotes, line feeds, or carriage returns unless the field itself is wrapped in double-quotes on each end.
  3. To insert a double quote into a quoted field, double it (i.e. use "")
  4. Spaces are ignored immediately before and after a comma.
  5. White space at the front or end of the record is not allowed unless part of a quoted field.

Prerequisites for this tutorial

  • You have ANTLR and TestNG installed. (You can use the same ideas with other unit testing frameworks such as JUnit)
  • You can run ANTLR to generate Java sources from a grammar.
  • You can build and run Java code.
  • (optional) You have JDK 5 or later installed. If not, you won't be able to use some of the constructs in this example and will need to translate back to 1.4 or earlier.

Develop a basic CSV parser

Basic setup

Every CSV line ends in a newline (line feed) or carriage return + line feed. Let's set that up as our basic grammar.

CSV.g

Create the corresponding test harness.

CSVTests.java

Generate the Java files for this grammar, then run the test.

(tick) The test should pass.

Return a result

We'll want to get a list of strings back from each line of CSV. In true TDD form, we'll alter the test then update the grammar to make the test correct again.

CSVTests.java fragment

(minus) As expected, this fails to compile: line is declared as final public void line

Now update the grammar so this compiles.

CSV.g fragment

(minus) Regenerate the grammar. The test compiles, but you get a NullPointerException. Look at the generated code in CSVParser.java:

Generated code...

Oops! We need to initialize the result. Edit your grammar again:

CSV.g fragment

(tick) Run ANTLR to regenerate your files and run the test again. The test should pass.

Extract a single word

A record in CSV is a series of fields separated by commas and ending in a newline or CRLF. We'll start by testing a single field in isolation then building back up to testing a whole record's worth of fields.

Start by adding a new test:

CSVTests.java

(minus) Of course, this won't compile. We need to define field in the grammar, doing just enough to keep the tests working:

CSV.g
What's that funny character?

"~" means "not" and is used to match any item that's not in a set.

See The Definitive ANTLR Reference, page 95.

Support multiple fields

Let's try multiple simple fields:

CSVTests.java fragment

This is going to take a few more changes than just defining a line as field,field,field...

  1. We need to add each field's value to the line.
  2. A record such as a,,b should write out an empty field in the middle.
  3. Testing shows a potential nondeterminism between <empty field>NEWLINE and NEWLINE.
CSV.g

(tick) This works, but line is getting cluttered.

Introducing scopes

We've been passing a value back from field to line, but there's another way to pass information between rules: dynamic scopes. This is a good place to see how they work.

CSV.g with scope

Since field no longer returns a string, we'll need to alter the test to pass the value through line and add a newline to the end of the line:

CSVTests.java, new field test via line

Quoting, part 1

CSV requires that fields that contain special characters (newline, return, double-quote, comma, space) be surrounded by double quotes.

CSVTests.java fragment

You can treat this like a multi-line comment, grabbing all characters between the opening quote and the first closing quote found ("nongreedy" behavior):

CSV.g

(tick) This gets the job done.

Quoting, part 2

Let's allow quotes inside quoted fields. CSV uses "" to represent " in the final output.

CSVTests.java fragment

The QUOTED lexer rule is clearly the place to put this. But think about how you would do it. [Seriously, go try some solutions before using mine. – RDC]

Here's one solution, including a bit of post-processing code to convert "" to ":

CSV.g fragment

Remove spaces around commas

CSV ignores leading and trailing spaces, so we should do the same. Here's the test case:

CSVTests.java fragment

Surprise – it passes! This is unexpected, but a look at the error output shows what's happening:

line 1:3 no viable alternative at character ' '
line 1:4 no viable alternative at character ' '
line 1:6 no viable alternative at character ' '
line 1:7 no viable alternative at character ' '
line 1:8 no viable alternative at character ' '
line 1:15 no viable alternative at character ' '

ANTLR 3's error recovery is taking care of this by skipping the space characters (which are unrecognized by the UNQUOTED rule. If we want to catch the error instead, we'll have to alter the error recovery mechanism.

Do you use continuous builds?

Consider disabling automatic error recovery while developing and testing your grammar. Otherwise, you may miss problems that crop up during automated unit testing (e.g. when using CruiseControl).

Intercepting errors

Since the last test passed when it should have failed, but printed a series of warning messages, let's intercept the error information and retain it for later use.

Start by altering the test to extract stored error messages from the lexer. This also requires setting the code that creates the parser and lexer:

CSVTests.java fragment

(minus) Of course, this doesn't compile: CSVLexer doesn't implement getExceptions.

A little work with a debugger shows that the lexer calls reportError when it encounters an error:

Override this method in the lexer and add the exception to a list (all of the changes are in @lexer::members):

CSV.g

Regenerate the lexer and parser, then run the tests.

(minus) The test fails, as we expected.

Ignore whitespace around commas (we mean it this time...)

Now we can alter the grammar to ignore leading and trailing spaces around commas. The answer turns out to be ridiculously simple:

CSV.g fragment

(tick) The test passes.

Labels:
  1. Nov 15, 2008

    Is it possible to add a maven pom.xml file for this?

     So that it will be easier to get started for beginners like me. That would be great. Thanks.

  2. May 28, 2009

    There's a typo at the end of the CSV.g snippet under the Extract a single word section. Currently reads:

     but should read:


  3. Sep 11, 2009

    Dedicated to other newbies who failed to reach the first "green tick" above:-

    1. Copy and Paste the text of CSV.g into ANTLRworks then do a:-
    Generate->Generate Code  // or do control-shift-G
    This creates a directory called
    output
    and inserts the generated files:-
    CSVParser.java
    CSVLexer.java

    2. Copy and paste CSVTests into e.g. wordPad and add these lines at the top:-
    import java.io.*;
    import org.testng.TestNG;
    import org.testng.annotations.*;
    import org.apache.tools.ant.*;
    import org.antlr.runtime.*;

    and then save it as "CSVTests.java" in the same directory i.e. in "output".

    3. At the command line, cd into
    output
    and compile (and "link") all three .java files by doing this:-
    javac  -cp  \TestNG\testng-5.10\testng-5.10-jdk15.jar;\ANT\apache-ant-1.7.1\lib\ant.jar;\antlr-3.1.3.jar  CSVLexer.java  CSVParser.java  CSVTests.java

    Then run the tests by doing this:-
    java  -ea  -cp  \TestNG\testng-5.10\testng-5.10-jdk15.jar;\ANT\apache-ant-1.7.1\lib\ant.jar;\antlr-3.1.3.jar;.  org.testng.TestNG  -testclass  CSVTests.class

    Evidently you will need to change various names and paths to suit your versions and "positions" of ANT, ANTLR and TestNG.

    And it won't work because either I made a mistake (how do you do a control-shift-C/V in a Window's console?) or you did e.g. did you miss out the current directory in the argument following the -cp (classpath) i.e. the dot after the semicolon:-
    ;.
    in other words allow CSVTests.class (in current directory) to be found too.