Another solution for Island Grammars Under Parser Control

Skip to end of metadata
Go to start of metadata

Author: Break  

liujing.break@gmail.com

Have read the article in

http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control

For this same problem, I just copy the orginal sample here:

The Problem

The ANTLR 3 examples include 'island-grammar' which shows how to parse part of the input using an alternative grammar by invoking a new Lexer+Parser combination on the input when a certain token is recognized (specifically, when a start-of-comment marker is seen).

This demonstrates a nice, simple case, where the lexer can identify the embedded language on its own. More dificult to handle is the case where only the parser is able to see the point where the embedded language begins in the input.

Consider that in the 'island-grammar' example, the Lexer of the 'outer' grammar will spot the start of the section to be handled by island grammar when it sees the token '/**' (the start of a Javadoc-style comment section). This token has no other meaning within the language being defined in simple.g.

If the token that marks the start of the island grammar within your input can also has other uses within your language, the basic 'island-grammar' technique can't be used.
For instance, look at an example of a regular expression literal:

r = / b; f = r/m;
This assigns the regular expression ' b; f = r' (with the flag 'm') to the variable 'r'.

Compare that with the following line of code:

r = a / b; f = r/m;
Here, two statements appear on one line; the assignment of 'a / b' to the variable 'r' and the assignment of 'r/m' to the variable 'f'.

Clearly, the lexer for this language will not be able to determine on seeing '/' if this should represent a DIVIDE token, or the start of a REGEX token, since all that follows the '/' is the same in both of the above examples. It's only the context of what came before the '/' that allows correct recognition.

My solution

Using the Gated Semantic Predicates.

A very simple sample grammar to mock the problem situation ( like ECMAScript with regular expression support)

Sample.g

The island grammar for parsing regular expression, this parser must return the last token of all valid content.

Now, we need add a method predicateRegex in above Sample.g, it has 4 functions:

  1. consumeReg() method, it even needs to be executed during Backtracking, so we need to define @synpredgate to rewrite the action condition.
  2. Call the island grammar parser
  3. Skip some underneath characters to hide the island grammar content from the main lexer, so that the main Parser can totally ignore them.
  4. Memoiaze the island parser's parsing result for mutiple predicates backtraces. We don't want the island parser be executed multi times.
    Above JSRegexParser and JSRegexLexer is the grammar parser for parsing regular expression, it is compiled from JSRegex.g .

RemovableTokenStream is the extended TokenStream which made by me to manipulate underneath char stream.

To execute our main parser, it must use RemovableTokenStream instead of CommonTokenStream.

RemovableTokenStream.java extends UnbufferedTokenStream, here I recommend to use unbufferredTokenStream for the island Lexer creation as well, since I haven't tested if CommandTokenStream works for my solution.


OK, done.

Now you may try the sample parse content


liujing.break@gmail.com

Labels: