Interfacing StAX to ANTLR
With version 1.6 Java now supports the StAX XML API . Using this API you can ask for events like you would do to an ANTLR lexer. To me this sound like an invitation to interface this XML capability to ANTLR. And that's what I have done prototypically. This short article is the report of how it more or less worked out.
Why in the first place?
Like it or not. XML has its place in IT. Every Java configuration file I have seen in the last few years is either a simple name/value property file or in XML. Also, Java processing has improved over the years. It became more and more simple to deal with it. Using something like Spring, there even is no need to read in XML yourself any more. However, if you have to, using SAX and also StAX you still have to keep track of XML events and their context. E.g., for many tasks you have to record the current XML element and sometimes even in which XML element it is in. This is necessary as you have to know where a certain text (PCDATA) event belongs to. This will require something like a stack. To be more precise, all XML documents can be described using S-grammars which are a strict subset of LL(1) grammars. And, as you know, ANTLR supports a superset of LL(1). This means all valid XML documents can easily be parsed by ANTLR using its LL(*) algorithm and LL(*) grammars. The idea is to let StAX take the lexer part and write ANTLR3 parsers for every XML format you want to read in.
In a nutshell
Imagine you have an XML like that (taken from the StAX tutorial)
and you want to parse that and emit all the important information. For example like this
Using SAX and even StAX you would at least have to remember the latest element name. This is necessary in order to correctly assign subsequent character data to the right element. This would mean even more work if you had to trace the whole hierarchy in which an element is in. And, you know, keeping state like this really results in ugly code. What about that instead:
I have to admit: not very obvious at the first glance. However, if you take a second look, most of that might remind one of the good old DTD, plus some XML tags, plus some Java Code. But, this sort of ANTLR3 grammar can actually parse the above XML input and generate the output accordingly! And, my first version of the glue code to interface StAX to ANTLR is about 100 lines of code only. Tiny!
If you are not impressed, that's ok. Keep using the DOM where parsing code for the above XML would be longer than the glue code I have talked about. Or keep using SAX where the code would be even longer, plus you get a headache on top. StAX alone could easily do with that example, but more nested structures would make quite some code bloat as well. Interestingly, the boiler plate code you would have to write using StAX is pretty much the same as the code ANTLR3 generates from the above grammar.
Finally, ahem, err, if you actually are impressed, I have to confess that the above grammar isn't exactly a working one, but it is pretty close. I have used some make-up to make it more attractive. Now that I have actually managed to attract you let's go for the real stuff.
Translating StAX XML events to ANTLR tokens
First problem: ANTLR expects tokens with an integer token type. An ANTLR parser uses this type to identify a token as what it is. Usually an ANTLR generated lexer takes care of doing this. As we replace such a lexer with our StAX input we need to do some work here. The code that does this is the core of the glue code. It parses in the token file (containing textual token name/type pairs) that ANTLR generates from a grammar along with the parser source code. Like that:
Now when the parser asks for the next token using method nextToken, the StaxTokenSource gets the next XML event from StAX and uses the mapping to infer the right token type (slightly simplified):
In case the XML event is text, we pass this to the token. Finally, here is getANTLRType that finds the token type for the XML event:
How would your grammar REALLY look like
This is the complete, real, no omission, no make-up grammar:
Besides minor differences to the grammar presented before, you can see that the way you express start and end tags is rather ugly. You take the name of the tag in upper case and add "_START" if it is a start tag or "_END" if it is an end tag. This is a limitation which I have no solution for right now . Maybe changes to ANTLR3 would be necessary to make the grammar more natural.
Putting it all together
Finally, you need some glue code to put the parts (XML input, token definition file, and parser) together. Here it is
OK, we did not quite make it. The grammar looks a little bit ugly and we can not process attributes, yet. But that is something one can work on. Additionally, the generated code is readable for people with some parser knowledge. It does not quite look as good style hand written StAX code. An obvious solution would be to simplify the ANTLR output templates as we can be sure we only need to handle S-grammars here. Might be fun.
Anyway, I hope to have shown that with the combination of StAX and ANTLR XML processing can be fast, memory efficient and fun.