Scannerless parsing generators have an advantage over separate lexers and parsers: it's much easier to create Island grammars, combine components of grammars, and deal with context-sensitive lexical constructs. I still think I prefer tokenizing the input, but thought I would run an experiment to see what a scannerless ANTLR grammar would look like.
I started out with the grammar that contained an LL(*) but non-LL(k) rule (stat). Because we're looking at characters as tokens, referencing rule id on the left edge of the second two alternatives of stat represents an infinite, left prefix. id also conflicts with the keyword rule kreturn. Here's the grammar:
The DFA that predicts alternatives of rule stat looks complicated, but it's really just trying to see past the letters to the characters that follow. LL(*) handles this with no problem but LL(k), with its fixed k lookahead, couldn't predict alternatives. Here is the DFA:
![]()
The trick to making this work is to create a stream of tokens that are really characters. The problem is that, since were making a parser grammar not a lexer grammar, ANTLR thinks that the character 'a' is really some random token type instead of the ASCII code. Using the tokenNames array in the generated parser, the following class figures out what the right token type is for each input character.
To try this out, the following test rig works. The only difference is that we are using a weird kind of lexer:
For input:
return 23; a = b;
We get the following output (the hash table printed out first is the mapping of ASCII code to token type). we ignore any input character for which the grammar has no reference, such as newline whitespace. Finally, I print out the token list before parsing and then, generating with the trace option, we see the entry and exit rule events during the parse:
$ java org.antlr.Tool T.g
$ javac *.java
$ java Test input
{48=4, 49=5, 50=6, 51=7, 52=8, 53=9, 54=10, 55=11, 56=12, 57=13, 59=14, 61=15, 97=16, 98=17, 99=18, 100=19, 101=20, 110=21, 114=22, 116=23, 117=24}
no token type for char ' '
no token type for char '
'
no token type for char ' '
no token type for char ' '
no token type for char '
'
[[@0,0:0='r',<22>,1:0], [@1,1:1='e',<20>,1:1], [@2,2:2='t',<23>,1:2], [@3,3:3='u',<24>,1:3], [@4,4:4='r',<22>,1:4], [@5,5:5='n',<21>,1:5], [@6,7:7='2',<6>,1:7], [@7,8:8='3',<7>,1:8], [@8,9:9=';',<14>,1:9], [@9,11:11='a',<16>,2:1], [@10,13:13='=',<15>,2:3], [@11,15:15='b',<17>,2:5], [@12,16:16=';',<14>,2:6], [@13,0:0='<no text>',<-1>,3:1]]
enter prog [@0,0:0='r',<22>,1:0]
enter stat [@0,0:0='r',<22>,1:0]
enter kreturn [@0,0:0='r',<22>,1:0]
exit kreturn [@6,7:7='2',<6>,1:7]
enter e [@6,7:7='2',<6>,1:7]
enter integer [@6,7:7='2',<6>,1:7]
enter digit [@6,7:7='2',<6>,1:7]
exit digit [@7,8:8='3',<7>,1:8]
enter digit [@7,8:8='3',<7>,1:8]
exit digit [@8,9:9=';',<14>,1:9]
exit integer [@8,9:9=';',<14>,1:9]
exit e [@8,9:9=';',<14>,1:9]
exit stat [@9,11:11='a',<16>,2:1]
enter stat [@9,11:11='a',<16>,2:1]
enter id [@9,11:11='a',<16>,2:1]
enter letter [@9,11:11='a',<16>,2:1]
exit letter [@10,13:13='=',<15>,2:3]
exit id [@10,13:13='=',<15>,2:3]
enter e [@11,15:15='b',<17>,2:5]
enter id [@11,15:15='b',<17>,2:5]
enter letter [@11,15:15='b',<17>,2:5]
exit letter [@12,16:16=';',<14>,2:6]
exit id [@12,16:16=';',<14>,2:6]
exit e [@12,16:16=';',<14>,2:6]
exit stat [@13,0:0='<no text>',<-1>,3:1]
exit prog [@14,0:0='<no text>',<-1>,3:2]
3 Comments
Hide/Show CommentsApr 12, 2011
Peter Kooiman
I'm thinking, won't whitespace handling (or the lack thereof) bite you in the ass on this?
Eg input
"return er:"
when the intent was "return er;"
will happily parse as if you had written "returner:"
enter prog [@0,0:0='r',<7>,1:0]
enter stat [@0,0:0='r',<7>,1:0]
enter id [@0,0:0='r',<7>,1:0]
enter letter [@0,0:0='r',<7>,1:0]
exit letter [@1,1:1='e',<8>,1:1]
enter letter [@1,1:1='e',<8>,1:1]
exit letter [@2,2:2='t',<9>,1:2]
enter letter [@2,2:2='t',<9>,1:2]
exit letter [@3,3:3='u',<10>,1:3]
enter letter [@3,3:3='u',<10>,1:3]
exit letter [@4,4:4='r',<7>,1:4]
enter letter [@4,4:4='r',<7>,1:4]
exit letter [@5,5:5='n',<11>,1:5]
enter letter [@5,5:5='n',<11>,1:5]
exit letter [@6,7:7='e',<8>,1:7]
enter letter [@6,7:7='e',<8>,1:7]
exit letter [@7,8:8='r',<7>,1:8]
enter letter [@7,8:8='r',<7>,1:8]
exit letter [@8,9:9=':',<6>,1:9]
exit id [@8,9:9=':',<6>,1:9]
exit stat [@9,0:0='<no text>',<-1>,2:1]
exit prog [@10,0:0='<no text>',<-1>,2:2]
Not quite what was intended..fixing that does not at first glance seem easy, you'd need to handle whitespace explicitly in the parser grammar, not nice.
Apr 12, 2011
Terence Parr
Most definitely a problem. Rats! solves so I'll look. Also note that "return = 4;" also works
Ah. You must say "Space" everywhere in a PEG; e.g., see Rats! example:
http://www.cs.nyu.edu/rgrimm/xtc/rats-intro.html
The expanded Java 1.5 grammar mentions Space 89 times. grep shows:
Apr 12, 2011
Peter Kooiman
Ouch. Having to deal with Spacing 89 times is well, painful, and I would venture, error prone. +1 for tokenizing
.
I see Rats! eases the pain somewhat with the concept of a "void production" to eat, then ignore Spacing (which includes comments) :
If you were to amend the ANTLR sample you would have to deal with whitespace noise in matches for id and integer, no such thing as a void production in ANTLR.