Terence Parr Blog from Apr 11, 2011

Skip to end of sidebar Go to start of sidebar
Sample Context-sensitive Lexer in ANTLR v3

Scannerless parsing generators have an advantage over separate lexers and parsers: it's much easier to create Island grammars, combine components of grammars, and deal with context-sensitive lexical constructs. I still think I prefer tokenizing the input, but thought I would run an experiment to see what a scannerless ANTLR grammar would look like.

I started out with the grammar that contained an LL(*) but non-LL(k) rule (stat). Because we're looking at characters as tokens, referencing rule id on the left edge of the second two alternatives of stat represents an infinite, left prefix. id also conflicts with the keyword rule kreturn. Here's the grammar:

T.g

The DFA that predicts alternatives of rule stat looks complicated, but it's really just trying to see past the letters to the characters that follow. LL(*) handles this with no problem but LL(k), with its fixed k lookahead, couldn't predict alternatives. Here is the DFA:

The trick to making this work is to create a stream of tokens that are really characters. The problem is that, since were making a parser grammar not a lexer grammar, ANTLR thinks that the character 'a' is really some random token type instead of the ASCII code. Using the tokenNames array in the generated parser, the following class figures out what the right token type is for each input character.

CharsAsTokens.java

To try this out, the following test rig works. The only difference is that we are using a weird kind of lexer:

Test.java

For input:

input
return 23;
a = b;

We get the following output (the hash table printed out first is the mapping of ASCII code to token type). we ignore any input character for which the grammar has no reference, such as newline whitespace. Finally, I print out the token list before parsing and then, generating with the trace option, we see the entry and exit rule events during the parse:

$ java org.antlr.Tool T.g
$ javac *.java
$ java Test input
{48=4, 49=5, 50=6, 51=7, 52=8, 53=9, 54=10, 55=11, 56=12, 57=13, 59=14, 61=15, 97=16, 98=17, 99=18, 100=19, 101=20, 110=21, 114=22, 116=23, 117=24}
no token type for char ' '
no token type for char '
'
no token type for char ' '
no token type for char ' '
no token type for char '
'
[[@0,0:0='r',<22>,1:0], [@1,1:1='e',<20>,1:1], [@2,2:2='t',<23>,1:2], [@3,3:3='u',<24>,1:3], [@4,4:4='r',<22>,1:4], [@5,5:5='n',<21>,1:5], [@6,7:7='2',<6>,1:7], [@7,8:8='3',<7>,1:8], [@8,9:9=';',<14>,1:9], [@9,11:11='a',<16>,2:1], [@10,13:13='=',<15>,2:3], [@11,15:15='b',<17>,2:5], [@12,16:16=';',<14>,2:6], [@13,0:0='<no text>',<-1>,3:1]]
enter prog [@0,0:0='r',<22>,1:0]
enter stat [@0,0:0='r',<22>,1:0]
enter kreturn [@0,0:0='r',<22>,1:0]
exit kreturn [@6,7:7='2',<6>,1:7]
enter e [@6,7:7='2',<6>,1:7]
enter integer [@6,7:7='2',<6>,1:7]
enter digit [@6,7:7='2',<6>,1:7]
exit digit [@7,8:8='3',<7>,1:8]
enter digit [@7,8:8='3',<7>,1:8]
exit digit [@8,9:9=';',<14>,1:9]
exit integer [@8,9:9=';',<14>,1:9]
exit e [@8,9:9=';',<14>,1:9]
exit stat [@9,11:11='a',<16>,2:1]
enter stat [@9,11:11='a',<16>,2:1]
enter id [@9,11:11='a',<16>,2:1]
enter letter [@9,11:11='a',<16>,2:1]
exit letter [@10,13:13='=',<15>,2:3]
exit id [@10,13:13='=',<15>,2:3]
enter e [@11,15:15='b',<17>,2:5]
enter id [@11,15:15='b',<17>,2:5]
enter letter [@11,15:15='b',<17>,2:5]
exit letter [@12,16:16=';',<14>,2:6]
exit id [@12,16:16=';',<14>,2:6]
exit e [@12,16:16=';',<14>,2:6]
exit stat [@13,0:0='<no text>',<-1>,3:1]
exit prog [@14,0:0='<no text>',<-1>,3:2]
ST v4 speed test with ANTLR

I just tested the new version of ANTLR that uses ST v4 not v3. In terms of code generation, it's just about twice as fast given plenty of memory (750M). To process Jim Idle's 15296 line TSQL grammar, it takes 1760ms instead of 3624ms, though that doesn't alter the overall wall clock performance much. It still takes about 12 seconds to process the grammar. It generates a whopping 168,677 lines of Java code (not including the lexer). That gives us about 95,830 generated lines of code per second versus 46,540 lines per second previously.

Given more restrictive memory, 175M, it's a completely different story. The wall clock for ANTLR with ST v4 on TSQL is only about 15.5s vs 24s for the version with ST v3. Looking at just the code generation component, ST v4 generates code in 2667ms vs 11394ms. It's about 4.2x faster with restricted memory (smile) This can matter if you use ST in a larger application that needs lots of RAM. I took the ram up to 230M before the ST v3 version was able to complete without speed problems from memory pressure.

So, ST v4 seems to be about 2x faster than v3 when there's no memory pressure / GC thrashing. With memory constraints, ST v4 is much faster because it seems to use less memory.