The ANTLR 3 examples include 'island-grammar' which shows how to parse part of the input using an alternative grammar by invoking a new Lexer+Parser combination on the input when a certain token is recognized (specifically, when a start-of-comment marker is seen).
This demonstrates a nice, simple case, where the lexer can identify the embedded language on its own. More dificult to handle is the case where only the parser is able to see the point where the embedded language begins in the input.
Consider that in the 'island-grammar' example, the Lexer of the 'outer' grammar will spot the start of the section to be handled by island grammar when it sees the token '/**' (the start of a Javadoc-style comment section). This token has no other meaning within the language being defined in simple.g.
If the token that marks the start of the island grammar within your input can also has other uses within your language, the basic 'island-grammar' technique can't be used.
For instance, look at an example of a regular expression literal:
This assigns the regular expression ' b; f = r' (with the flag 'm') to the variable 'r'.
Compare that with the following line of code:
Here, two statements appear on one line; the assignment of 'a / b' to the variable 'r' and the assignment of 'r/m' to the variable 'f'.
Clearly, the lexer for this language will not be able to determine on seeing '/' if this should represent a DIVIDE token, or the start of a REGEX token, since all that follows the '/' is the same in both of the above examples. It's only the context of what came before the '/' that allows correct recognition.
It may be possible to have your lexer track the last significant token emitted, and to process the input following '/' differently depending on whether the token preceding it was an IDENT or an ASSIGN, for instance. This page demonstrates another approach, which is to have the parser direct a temporary switch to an island grammar.
Watch that lookahead
The first problem with altering the way that input is processed from a parser action is that the standard ANTLR 3 TokenStream implementations snarf the entire input in one go at the point of construction, and then feed the parser with tokens one-by-one from an internal list.
Since the tokens of the island grammar are not compatible with tokens of the outer grammar, we need to avoid this behavior (imagine the regular expression literal /"/ which consists of just a double-quote, if the outer-grammar were to try and interpret that, it would bomb-out thinking it was looking at an unterminated string literal) .
The metaas project includes an example implementation of TokenStream that lazily loads tokens from the underlying TokenSource (it also implements a doubly-linked list of Token objects, for other reasons):
Building a parser to use this TokenStream will then look like,
We extend our grammar definiton to deal with this kind of literal, and add the output of the island grammar to the AST being built,
Magic needs to happen in the handleRegexp() method, but it will need access to lots of the internal state of the lexer to work. Specifically, we'll have to provide,
- The CharStream implementation (ANTLRReaderStream), so that we can get at the as-yet unprocessed input
- The lexer, so that we can potentially re-initialize its input at exactly the point where the island grammar input finishes
These are not normally accessible from the parser, so we need to define a few things in the @parser::members section so that they can be supplied by the calling code:
Now we should have the pieces we need to implement the handleRegexp() method:
This does seem to be a pretty complex way of doing things, but it also seems to work. Repeated string-copying in this implementation probably also means that it isn't the speediest solution.