Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



Aspects, Actions, Rewriting, Attributes

RSS Feed

June 25, 2007

Tree and token stream rewriting. Starting with simple token rewriting. I've thought about this and I think we can fold rewriting patterns into grammar composition and token stream stuff. Here's a simple rule that just replaces a token:

decl : 'int' ID ';' -> 'var' ID ';' ;

Presumably a variant of the TokenRewriteStream can insert tokens not just text. This would just replace the start..stop with the right hand side. If 'var' is a string, it would have to be tokenized. Hmm...only reason to go back to tokens is if you are going to do multiple parses. If going to text, just use output=template.

Ok, if going token to token, should really be able to use same lexer as we're going language L to L.

term	:	ID		-> INT["1"]
	|	ID '^' INT	-> INT ID '^' INT[ival($INT.text)-1]
	|	INT ID		-> INT
	|	c=INT ID '^' e=INT
	        -> INT[ival($c.text)*ival($e.text)] ID '^' INT[ival($e.text)-1]
	;

where ival(token) returns the integer value of the token.

Syntax-directed token rewrites

That is not as obvious as an exemplar-based rewrite where i,j,k,x,y,z,v are IDs and c,n,m are ints and s,t,u are strings (Idea I took from Andy Tripp):

term:
	"<x>" -> "1"
	"<x>^<n>" -> "<n><x>^" {ival($n.text)-1}
	"<c><x>^<n>" -> {ival($c.text)*ival($n.text)} "<x>^" {ival($n.text)-1}

The right hand side would be tokenized statically (given a pre-existing lexer) except the actions, which must be tokenized after evaluation:

term:
	"<x>" -> INT["1"]
	"<x>^<n>" -> $n $x '^' {ival($n.text)-1]}
	"<c><x>^<n>" -> {ival($c.text)*ival($n.text)} "<x>^" {ival($n.text)-1}

Andy has rules like:

v + x --> v.add(x)

But, for speed and sake of context, I like specifying the rule that matches the input. I also like delimiters as I might want to match white space at some point:

expr:
	"<x> + <n>" -> "<x>.add(<n>)"

We need the <...> because we can't correctly/generally identify x and n without them. Need to identify the elements on the left and tokenize the right if we are going to token stream. First break up into rule elements and literals:

x=ID '+' n=INT -> $x '.add(' $n ')'

then tokenize:

x=ID '+' n=INT -> $x '.' ID["add"] '(' $n ')'

where we assume that the ".add()" notation is also in the language we are translating.

If going to text we can leave as a template. In that case, the rewrite string is converted to have x, n as template attribute refs <x> and <n>:

x=ID '+' n=INT -> template(x={$x}, n={$n}) "<x>.add(<n>)".
StringTemplate thought. And, if we can do that, why can't we use that for output=template? The elements in <...> refer to labels. Can't do <ID> and such though can we? Sure. why not
ID '+' INT -> "<ID>.add(<INT>)"

What about property refs? Sure. That string is really:

ID '+' INT -> template(ID={$ID}, INT={$INT}) "<ID>.add(<INT>)"

where ID and INT can be template attributes like any other (just reusing the names).

Uh oh. We want to keep the same tokens if possible rather than retokenizing I'd say. So can't convert to string and then retokenize. Need to keep tokens in tact as somebody might annotate tokens with info during translation. We'll accept a string with holes but no ST extensions unless you jump into a {...} action and use %{...}.

Repeated elements

Take TXL's example of dot product: (1 2 3).(3 2 1).

// grammar rules
dot : vector '.' vector ;
vector : '(' INT+ ')' ;

// rewrite rules
dot:
'(' n+=ID+ ").(" m+=INT+ ')' -> {dotProduct($n, $m)}

That rewrite would get translated (double quote string tokenized) to

'(' n+=ID+ ')' '.' '(' m+=INT+ ')' -> {dotProduct($n, $m)}

where the {...} action would return a string that is tokenized at run-time. The $n and $m are List objects per usual ANTLR interpretation.

Ok, now, let's turn to rule references.

Rule references

stat:
	"if ( true ) <s1=stat> else <s2=stat>" -> "<s1>"
	"if ( true ) <stat>" -> "<stat>"

Also,

expr:
	"<multExpr> + 0" -> "<multExpr>" // or $multExpr
	"0 + <multExpr>" -> "<multExpr>"

The 0 will be tokenized as INT. That has to be converted to a predicated element. See next section.

More Andy rewrites:

strcasecmp(v1, v2) != 0 --> !v1.equalsIgnoreCase(v2)
strcasecmp(v1, v2) == 0 --> v1.equalsIgnoreCase(v2)
strcasecmp(v1, v2) --> v1.compareToIgnoreCase(v2)

Seems like I need to use <expr> not v1, v2:

"strcasecmp(<v1=expr>, <v2=expr>) != 0" --> "!<v1>.equalsIgnoreCase(<v2>)"
"strcasecmp(<v1=expr>, <v2=expr>) == 0" --> "<v1>.equalsIgnoreCase(<v2>)"
"strcasecmp(<v1=expr>, <v2=expr>)" --> "<v1>.equalsIgnoreCase(<v2>)"
function:
	"void main(int argc, char *argv[]) {" stat+ '}' ->
<<
public class MyClass {
    public static void main(String[] args) {
	<stat+>
    }
}
>>

Predicates

What about conditions patterns must satisfy? Should it be just a predicate on the front or preds on the individual elements? Need to constrain tokens:

"x + 0" -> "<x>"

becomes

x=ID '+' i=INT {$i.text.equals("0")}? -> $x

Have to allow predicates also on left:

atom:
	{inMethod}? ID -> "locals[" {$ID.localIndex} "]"
		    ID -> "global_<ID>"

Input traversal and multiple passes

How many times are the rules applied and in what order? First, the rewrites are only applied in the grammatical context specified by the rule. Then, the rules are ordered by precedence just like autobacktracking mode.

After each replacement, I assume we'd continue parsing after the rewritten input construct and look for the next match. Would we then just rewind the stream and do another pass until nothing changes? Perhaps, but it's pretty inefficient to pass over the entire input again just to look for a few patterns on expr. I think TXL will repeatedly apply the expr rewrite rules for a single expr until nothing changes and then continue. Maybe we could do this as an option? Or, for efficiency, we could map all locations in the input that start an expr and then do like an increment parse, jumping from expr to expr. The problem is that we'd lose context; i.e., which rule invoked that expr.

Another way would be to make the parse tree during a first pass and then alter the parse tree during each pass. We could then jump straight to the expr subtree root nodes and look upwards for the context.

Actually, repeatedly applying the expr rule set until nothing changes ala TXL is probably the right idea. No other rewrite rules can match the expr input so only a single overall pass is necessary. So there is local iteration, but each rule set represents a single pass. You could define multiple rule sets to define multiple passes; e.g., remove identify operations then define implicit scalars.

Avoid parse tree construction to save time/space.

When we have input f(f(x)) and we are translating f() to g(), when does the arg list get translated?

primary:
	"f(<expr>)" -> "g(<expr>)"

First we apply the rewrite to the outside and get g(f(x)). Then we try the rewrites again for primary. We can't jump over the args. OH! It's invoking rule expr so it will get back down to primary and do it recursively. Marvelous. This way it kind of manually says which rewrites to go do.

Pattern matching implementation

I think we can just list the patterns first in the indicated rule and turn on autobacktracking.

expr
options {backtrack=true;}
	:	x=ID '+' y=INT {is "0"}? -> $x
	|	y=INT {is "0"}? '+' x=ID -> $x
	|	normal-expr-alternatives
	;

How do we make it spin around expr though until it doesn't change? Perhaps a method that invokes the rule:

while ( changed ) {
	int m = input.mark();
	expr();
	input.rewind(m);
}

But, every reference to a rule with rewrites would have to call a meta-function like this instead of the rule directly. Hmm...does that mean the grammar must be regenerated every time I add a rewrite to a new rule? Is this related to grammar composition where I pull in components of a previous grammar?

Hidden channels

How do we handle comments in between tokens we rewrite? Ric Klaren had the idea to match off-channel tokens on the left so we can reference on the right.

// move comments in front of assignments to end
assign:
	.COMMENT ID '=' expr ';' -> ID '=' expr ';' COMMENT

where the dot in front means "hidden" like in UNIX filenames.

May 9, 2006

Gotta figure out how to deal with actions during backtracking. Started FAQ entry which explains the issue.

What if we could auto unroll any object/data-structured you identified somehow as transactional? You would give me an undo() method for each object, like a hashtable. I could record the arguments to pass to undo so it knows how to undo. This would not be perfect as the arguments could point at stuff that could change (like a stringbuffer), but would be a poor man's transactions.

We need to keep in mind that MOST of the time, this feature is unnecessary. It's only needed when you need both syn and sem preds and they interact.

Let's see. Here's a simple example to chew on:

{
Scope scope = null; // current scope
}

a    :    (block '.')=>block '.' // dummy rule to force backtrack of block
    |    block '!'
    ;

block
    :    {scope = new LocalScope(scope);} // push new scope
        '{' statement* '}'
        {scope = scope.parent;} // pop scope
    ;

statement
    :    "int" ID ';' {scope.define($ID.text, new VariableSymbol($ID.text));}
    |    "if" ...
    ;

First time through on input

{ int i; }.

ANTLR backtracks through block and statement doing no actions. Then upon success, rewinds, and does it again with feeling. Upon input

{ int i; }!

ANTLR backtracks through block and statement doing no actions. Upon failure, it rewinds and does 2nd alt, calling block with feeling.

This is the normal case. What if we needed a semantic predicate in statement that needed to know whether an ID was a var?

statement
    :    "int" ID ';' {scope.define($ID.text, new VariableSymbol($ID.text));}
    |    "if" ...
    |    {isVar(input.LT(1))}? ID ... // some alt predicated upon ID being a variable name
    ;

Here, you MUST have the define() function call executed or else the ...? sem pred will always fail.

Alright. We need to unroll the define() action for sure but we could avoid if by throwing out the entire scope. If we executed the push/pop scope actions, the definitions would actually disappear already and automatically! Wow, interesting!

Let's keep going with the undo() method though. I'd need to track the scope and ID.text and VariableSymbol objects so undo would know what to do. How can I know what to track? Impossible. Users will have to provide a specific undo action that I can execute:

statement
    :    "int" ID ';'
        {scope.define($ID.text, new VariableSymbol($ID.text));}
        @undo {scope.remove($ID.text);}
    |    "if" ...
    |    {isVar(input.LT(1))}? ID ... // some alt predicated upon ID being a variable name
    ;

That would unroll it no problem. Let's assume that plain ... actions always exec when an @undo is provided (heh, I like that).

Where would the undo action get executed though? What if the undo action references a local variable? It won't even compile if I move it out to the commit/rollback of the synpred up in a(). The undo cannot be done in the rule as statement will be called multiple times. It has t be at the transaction end (synpred fail/success) where I roll back the input. Oh! Also, if it references input.LT(1) that will be at some unknown location during the undo action's rollback! There would have to be lots of restrictions on the undo actions. For example, I can see already that $ID.text will be unavailable in the rollback action in a().

Ok, taking the cue from Jeff Barnes, perhaps the undo action actually creates an object that parameterizes the action and executes it during rollback in reverse order:

@undo(Scope s=scope, String id=$ID.text) {s.remove(id);}

Here it would evaluate scope and $ID.text in it's original context but execute the action during rollback. The object could be:

class Action324 extends UndoAction {
    Scope s;
    String id;
    public Action324(Scope s, String id) {...set fields...}
    public void undo() { s.remove(id); }
}
The action would be translated to:
undoActions.add(new Action324(scope,$ID.text));

Later we can walk undoActions and undo everything.

Hmm...would this work? Surely in this case. The key is passing all necessary data to the undo as evaluated in its original context.

January 6, 2006

I've been resisting the temptation to introduce a new symbol to handle templates. For example, to avoid adding a new special symbol, I introduced $templates::foo(args) as the template constructor to build foo, which translates to:

st = templateLib.getInstanceOf("foo");
st.setAttribute("arg1", ...);
...

With other template libs, you'd do $Java::method(...) etc... Now I'm finding real situations where ST integration can be improved. The ctor problem is reasonably solved, but setting attributes still sucks: st.setAttribute("arg1", e1); is repeated many times all over. A better notation would be $st.arg1 = e1; but that is highly ambiguous with the $x.y notation. Anyway, after looking at a number of real examples now, I find the urge to introduce a new symbol to help the human brain separate rule/scope attribute references from template syntactic sugar. I propose the following (developed in collaboration with Jean Bovet and influenced by Hartmut Kocher's suggestions):

  • %foo(...) ctor (even shorter than $templates::foo(...))
  • %(...) anonymous template from string expr
  • %x.y = z; set template attribute y of x (always set never get attr) to z [languages like python without ';' must still use the ';' which the code generator is free to remove during code gen]
  • %(expr).y = z; template attribute y of StringTemplate-typed expr to z

What about other template scopes?

  • %Java::method(...) scoped constructor
  • %CPP::method...)

Make sense? Objections? hard to tell until you see a real example ;) I see a lot of

a : ID -> {new StringTemplate($ID.text)} ;

which would be now

a : ID -> {%($ID.text)} ;

or even simpler:

a : ID -> %($ID.text) ;

I see lots of setAttribute stuff like:

a[int foo] : ID
	{
	StringTemplate x = $templates::foo();
	...
	x.setAttribute("arg1", $ID.text);
	x.setAttribute("arg2", $a.foo);
	$st = x;
	}
  ;

Now it would be:

a[int foo] : ID
	{
	StringTemplate x = %foo();
	...
	%x.arg1 = $ID.text;
	%x.arg2 = $foo;
	$st = x;
	}
  ;

So x is just a variable like you expect; to use shortcut access to ST stuff, you need the %. Note how the $ stuff is now clearly separated from the % template stuff.

December 23, 2005

On Dec 23, 2005, at 3:12 AM, Hartmut Kocher wrote:
I think there are some common scenarios:

1) Call different templates (even when I only return one template, I might use some templates during calculation). We could use a common syntax here like $ST["name"] to extract a template form the group. This would be a reserved word.

Yes, i have been working with this cool start-up company recently and I have been trying to use the template stuff. I have not needed multiple templates from a single rule, but let me add my own frustrations here and we can probably come up with a nice bit of extension. John Mitchell suggested this exact strategy...get something working (avoiding the complicated thing I was designing) and then try to build something with it! ;)

So, I do this a few times in a Java grammar:

StringTemplate fcallST = templateLib.getInstanceOf("fcall"); fcallST.setAttribute("ids", $type.st); // type is a rule reference above $anEnclosingTemplateVisibleHere.setAttribute("fcalls",fcallST);

Wouldn't it be nice to do as you say (a rewrite like constructor but in an action):

StringTemplate fcallST = $fcall(ids=$type.st);

and then we could have a syntax like pyStringTemplate to use array syntax instead of setAttribute:

$anEnclosingTemplateVisibleHere["fcalls"] = fcallST;

The problem of course is clear: The dollar syntax $x.y is way to simple to handle the above and it's ambiguous anyway. We'd need a keyword as you say I think, but that makes it messy.
Actually, given that I must be consistent across output options, how will we build trees inline?

AST t = ^($ID.text, $INT.text);

I suppose we could make ^ significant like we did with #(...) before. Hmm...What would a node constructor be?

^ID["blort]

I suppose that would be ok.

So, do we need a special symbol for inline stringtemplate stuff? Ick. @ is action, $ is attribute. Hm...keyword is probably better. Ack...can't just randomly look for a keyword...need a symbol. Hmm...well, what about, symbol plus braces:

StringTemplate fcallST = ${fcall(ids=$type.st)};
${anEnclosingTemplateVisibleHere["fcalls"] = fcallST};

I don't like overloading $ though but I hate to create another special symbol just for templates like:

StringTemplate fcallST = %{fcall(ids=$type.st)}
%{anEnclosingTemplateVisibleHere["fcalls"] = fcallST};

Is there some generic thing I could do with %...? Ick. Would they have to nest?

%{anEnclosingTemplateVisibleHere["fcalls"] = %{fcall(ids=$type.st)}};

Probably. To me that is not very readable to the casual observer where the explicit is very readable.

Perhaps your $templates::fcall(ids=$type.st) would work with "templates" being the keyword / built-in scope of templates. Actually that would work. :)

StringTemplate fcallST = $templates::fcall(ids=$type.st);
$anEnclosingTemplateVisibleHere.setAttribute("fcalls",fcallST);

It saves a bit of typing and is more clear. :)

Better yet: How about defining a string template group in the grammar:

group MyTemplates; // This only a defines a property in the parser to set a string template group. Later, we could allow to define the templates inline.

I think you want this to be dynamic so, as you request below, you can reset the template group for each pass using setTemplateLib().

Then one could call templates of the group using a syntax like MyTemplates::Name. There would be a default group, which is used if the group name is omitted.

We can just use the $templates::name(...) generic mechanism so we can dynamically change the actual group. Your way might be useful with the ST interface I plan on implementing. In this way, I could verify that your template names actually make sense. Heh, i like it!

-> xxx calls default group "xxx" and returns this as result

-> MyTemplates::xxx calls "xxx" in the the MyTemplates group and returns this as result.

Interesting. You want to be able to define multiple template libraries? Ok, so

$templates::fcall(...)
$MyTemplates::blort(...)

I dislike overriding the $scope::x syntax so much, but it seems like it would work. How would we specify the new template groups?

templates Java;
templates Debug;

then you could say $Java::fcall(...) and $Debug::dump(...). But, the lib names should be interfaces so you're not locked into a specific template group, right? This would generate methods like:

setJavaTemplateLib(...)

in the parser so you could set the actual lib.

How often would this multiple template lib thing be used do you think?

Within actions $MyTemplates::xxx would call the template.

Now, we could nest them:

-> ResultTemplate(MyTemplates::xxx($Name), MyTemplates::yyy($Id)), or whatever...

Yup.

Oh, right we'd need to allow

-> Java::fcall(...)

syntax too.

2) Build a collection of templates. These are useful to pass them to other templates...

Currently, only the result of a rule can be added to a list. This could be expanded to support multiple lists, one per template.

Just allow the same syntax to be used in front of each template call: myList+=MyTemplates::xxx adds the template to a list called myList.

You can pass in multiple templates to fill with templates you create in that rule. You can pass multiple return values back and just add to a list manually. How often do you need to return multiple templates though? When I need this I just access the outer template where they will go and then setAttribute for each new template. I've yet to need multiple return templates.

Maybe we should add the possibility to define lists in a scope:

scope global {
Templatecollection myList; // or something like that
}

Use it like this: global::myList+=MyTemplates::xxx(...).

We're getting closer and closer to the simple neutral action language I was thinking of adding later ;) Hm...this is getting a big much. I think:

global::myList.add($MyTemplates::xxx());

would be just as good ;)

Overall, I think, template manipulation within ANTLR should be expanded. Of

I plan on it...just trying to let actual need guide us. :)

course, I can code that now within actions but it would be far more user friendly if normal cases could be expressed in a standard and uniform way.

Yes, let's come up with a nice extension.

I have several 2.7.5 grammars in C# and Java, which only differs in the action code because of different method and class names. Using ANTLR 3 with templates I was hoping to move all differences to a set of templates. This only works if templates can be manipulated within ANTLR itself.

Can't you just a different template group with setTemplateLib?

BTW: I was thinking about using multiple passes using the same action code, but using different template groups for each pass...

Are you using setTemplateLib?

Let me add my own frustrations now.

This is a drag:

packageDefinition
	:	'package' identifier SEMI -> {$identifier.st}
	;

Could we do this?

packageDefinition
	:	'package' identifier SEMI -> identifier
	;

What if we always returned the template for a single element?

type
	:	identifier
	|	builtInType
	;

the template for identifier / builtInType would be autocopied to $type.st return value. I don't like this actually... it's not consistent with having multiple elements in a production. Perhaps

type
	:	identifier -> identifier
	|	builtInType
	;

is "better". Weird looking though.

Here is a case where I might have just plain foo instead of a.b.c. In either case I need to set the return type. I hate doing ( x -> template(...) ) syntax in the middle of a rule. I want that -> on the end! Here is what I do now:

identifier
	:	(i=IDENT -> qid(id={$i.text}))
		( DOT i=IDENT {$identifier.st.setAttribute("id",$i.text);} )*
	;

Oh, actually we can do this properly even now. Duh, just queue up the IDs.

identifier
	:	ids+=IDENT ( DOT ids+=IDENT )*
                -> qid(id={toStrings($ids)})
	;

November 23, 2005

I've changed my mind on this syntax too (yet more changes in my tests/examples). John Mitchell and others warned me that access to dynamically scoped stuff vs parameters etc... should be syntactically different. I agreed but was loath to use an extra valuable char like @, which I note is now very useful for actions.

Anyway, imagine how things currently work:

function returns [String funcName]
scope {
	String name;
}
	: "function" ID {$name=$ID.text; $funcName=$name;} args body ;

body : ... {System.out.println("fname="+$function.name);} ;

Since function calls body, body can access the dynamic scope of function. You must explicitly say was is visible with the scope notation. Setting the dynamic value $name looks exactly like the parameter $funcName. In the body rule, you must specifically reference the scope with the attribute as $function.name, but this still doesn't scream "the value is in another rule, dude!"

I am proposing to avoid use of a special '@' symbol or whatever in favor of the C++ style scope override "::". THat said, I think the distinction should be sort of "local" vs "global" so that you can say $name inside the rule that defines the scope, but you must use $scope::name in an invoked rule like body:

function returns [String funcName]
scope {
	String name;
}
	: "function" ID {$name=$ID.text; $funcName=$name;} args body ;

body : ... {System.out.println("fname="+$function::name);} ;

What about shared global scopes? I propose that we always have to use the fully-qualified scope in this case.

scope Symbols {
	List names;
}

classDef
scope Symbols;
init {
	$Symbols::names = new ArrayList();
}
	:	... {$Symbols::names.add("foo");} ;

body
scope Symbols;
init {
	$Symbols::names = new ArrayList();
}
	:	... {$Symbols::names.add("foo");} ;

Speaking of which, we need to be able to specify some init code in the global scope so we can avoid code duplication and do avoid forgetting. I shied away from it originally because I didn't know what the syntax would look like and whether it would be useful. This bit me hard yesterday when building an ANTLR+ST example. How about just adding the init action into the scope?

scope Symbols {
	List names;
	init {
		$Symbols::names = new ArrayList();
	}
}

Actually I'll be using probably @init ... (yet another tweak to the syntax...sorry! It's rapidly converging though I'm letting actual experience show me my faulty syntax).

November 18, 2005

Ok, talked to John Mitchell for like 2 hours by phone yesterday (well, actually today the 17th, but I wanted this entry to be separate so I bumped the day). He convinced me that the ANTLR+StringTemplate integration should be braindead simple, which means it's not as automated as I was going to make it. It's very clear now and experience will then guide me to make shortcuts later. :)

Ok, the idea is simple:

  • All rules return a template even if it's null because you didn't use ->
  • There are 5 possible right-hand-sides for template rewrites:
    -> template(arglist) "..."
    -> template(arglist) <<...>>
    -> namedTemplate(arglist)
    -> {free expression}
    -> // empty
    
    where template is a keyword and is analogous to lamba in Python.
  • The arglist is a set of template attribute assignments:
    attr=expr where expr is either a free form target language expression. If your action requires a comma you must wrap the expression in curlies. For example,
    ass : ID '=' expr -> assign(lhs=$ID.text, rhs=$expr.template);
    
    a : ID -> foo(arg={arbitraryFunc($ID.text,23)}) ;
    
    block : s=stat -> slist(stats=$s) ; // $s is stat_return
    
    block : (s+=stat)+ -> slist(stats=$s) ; // $s is List<stat_return> 
    
    block : s=stat -> template(stats=$s) "<stats.template>"
    
    block : (s+=stat)+ -> slist(stats=$s) ; "{ <stats:{s|<s.template>;}> }"
    
    /** the hardway--pass in a List<StringTemplate> */
    block 
    init {
       List x=new ArrayList();
    }
     : (s=stat {x.add($s.template);})+
       -> slist(stats=$x) // $x is List<StringTemplate>
     ;
    
    Passing list of rule templates from a rule invocation sequence will be common and you want to avoid having to say .template all the time in the template, a helper method would be very useful so you could write the rule like this:
    block : (s+=stat)+ -> slist(stats={toTemplates($s)}) ;
    
    where toTemplates() dups the incoming list and replaces the elements with the associated templates. For a list of Token objects, it would create a list of Strings. Perhaps we can have ANTLR generate this method in the output parser.
  • Also, you have the predicated rewrite list:
    a : A B -> {p1}? foo(a=A)
            -> {p2}? foo(a=B)
            -> // return nothing
    
  • You may refer to the current rule's return template with $rule.
  • Subrules set the overall rule's template just like with ASTs:
    a : ID ( '=' INT -> init(d=$ID.text,v=$INT.text)
           | -> plain(d=$ID.text)
           )
      ;
    

Actions

Currently, one cannot specify a set of instance vars/methods for the lexer in a combined grammar:

header {
package org.antlr.foo;
}
grammar T;
{int i;}
a : A ;
A : 'a' ;

The int i; action only goes in the parser. Further note that the header works for Java, but C++ with it's requirement of topologically sorted elements (ick), you need multiple headers. We solved this in v2 by having named headers like:

header "bob" {
filthy C++ stuff
}

header "frank" {
even more C++ cruft
}

Seems we need to generalize this for all languages and make a consistent means of setting actions in various places. ANTLR should not care where these go; the code generator either would use them or not. It would just convert the actions to template attributes, which the code gen was able to use or ignore. Perhaps the Target object should indicate the valid set of these; probably.

Anyway, how about this:

grammar T
options {language=Java;}

@header {
package org.antlr.foo;
}

@members {
int i;
}

@lexer_members {
// whatever goes to lexer
}

a : A ;
A : 'a' ;

The lexer_members name would actually be translated to members by ANTLR for the code generator when it handled the lexer grammar (antlr pulls apart the above combined grammar into two grammars and runs itself recursively on the lexer grammar). So, in a pure lexer grammar, you'd do:

lexer grammar L;
options {tokenVocab=T;}
@members {
// whatever goes in lexer's field/methods area
}
A : 'a';

What about rule init actions? Now we do:

rule
options {...}
init {...}
  : ...
  ;

Instead perhaps we do:

rule
options {...}
@{...}
  : ...
  ;

Where we avoid using a name with the @ as it's clear it can only be the init action. Actually, what if we add the notion of a "finally" action:

rule
options {...}
@init {...}
@finally {...}
  : ...
  ;

Yeah, let's make it @init. These actions are translated to attributes of the rule template in the code generator.

Sounds ok, but I don't like that the @ char conflicts with the StringTemplate usage for specifying regions. This brings us to the StringTemplate code gen integration.

Altering code generation

To alter the output code is easy with v3: just change the templates, but then you have to get ANTLR to see your new templates etc... I think John Mitchell suggested putting the templates directly in the grammar (probably at the bottom). This way the templates are convenient to override and they are carried along with the templates. How about the following notation:

grammar T;
a : A B ;

codegen {
/** Override how tokens are matched to print out tracing info */
tokenRef(token,label,elementIndex) ::= <<
System.out.println("about to match token <token>");
<super.tokenRef(...)>
>>

/** count how many elements we match */
@element.prematch() ::= "this.elementsMatched++;"
}

The @element.prematch() means to override the region defined in template element, which is defined as for Java:

/** Dump the elements one per line */
element() ::= <<
<@prematch()> <! leave a hole for people to stick debugging code etc... !>
<it.el><\n>
>>

Issue: the @ inside the codegen section doesn't mean action like it does outside per the above discussion...it means StringTemplate region.

Issue: The curlies on the outside could actually be hard to match. I'd have to call the StringTemplateGroup group file lexer to correctly match the templates and know when to stop parsing at the final right curlie. Hmm...

November 17, 2005

Ok, after yesterday's efforts, I think I have narrowed down my problems to:

  1. What does template(attr=$label) mean? Is it the Token object / return value structure or the text of a token / template returned by the rule? I'm thinking it's the text / template. If you want the raw object, use attr={$label}.
  2. In a parser rule that constructs AST with multiple-valued elements, you must say:
    a : ID+ -> ID+ ;
    
    with the rewrite rule indicating the cardinality via the + sign. With templates, right now you say:
    a : ID+ -> template(ids=ID) ;
    
    Should we make it
    a : ID+ -> template(ids=ID+) ;
    
    If so, do we need the star and question mark too?
  3. When you reference a label with multiple-values
    a : ids+=ID -> template(names=$ids)
    
    do you get the Token objects (or rule return value structures) or do you get the text/templates? Also, do we need $ids+ to indicate cardinality again? By default, the ids list is a List<Token>, so if we want it to mean text, I'll have to copy the list and make the elements the .getText() of the tokens.
  4. How can you set a template attribute multiple times? In an action, you can only do $names=ID.text, which is clearly a normal assignment. Shall I update the entire labeling mechanism to let you set previously defined attributes instead of creating a new one?
    a returns [List ids] : ($ids+=ID)+ ;
    
    and
    a returns [String id] : $id=ID ;
    // versus
    a returns [String id] : ID {$id=$ID.text} ;
    
    Yeah, but what about types? That will fail. id would need to be Token not String. What if the label is a template attribute and appears lexically after the usage?
    a : $ids+=ID -> template(ids);
    
    If $ids is a template argument, the += argument should collect text, but the normal case would be to collect Token objects. I don't like that context-sensitivity!
  5. All rules return a template, but is there a default action. So, currently
    a : ID ;
    
    returns nothing as does
    a : '(' args ')' ;
    
    There is just no way to know whether people want the hidden tokens and what the whitespace rules should be. Until I can figure that part out, I'll have to make that return nothing I guess.
  6. Template arguments implicitly define attributes that actions and label assignments can set. What is the type? Either List<String> or List<StringTemplate> or String, but you don't know which. Some places may say $label=ID and some others may say $label+=ID in different rules! Seems like I should just make all template attributes List<Object>. I wish I had some notation to say call template.setAttribute. What about $template[label]="foo" like PyStringTemplate? That still looks like a simple assignment I guess.
  7. If there is only one template argument such as -> type_user_object(ID) let them avoid the assignment? No because it implies def of ID as an attribute in the rule scope.

Ok, so it looks like I still am not even sure of the scope of my issues.

Defining template attributes

Heh, i know the list of referenced attributes. I should be able to do this:

a : ID '=' INT ';' -> "<$ID> := <$INT>;" ;

and

a : ID (',' ID)* -> "<$ID+>" ;

where I need $ID instead of ID since the templates are like actions--the ANTLR-related stuff has to stand out.

Why can't I refer to the grammar elements directly ???? rather than using template argument attributes as in

a : ID '=' INT ';' -> template(lhs=ID,rhs=INT) "<lhs> := <rhs>;"

Oh, for external templates, we'll need it anyway:

a : ID '=' INT ';' -> assignment(lhs=ID,rhs=INT);

Also, it gets messy when you add ST syntax:

a : ID (',' ID)* -> "<$ID+:{n|<n>.}>" ;

I don't like augmenting ST syntax. Hmm...Also what about:

a : ids+=ID ids+=ID -> "<$ids+>"

Perhaps I treat templates like actions and translate $ids+ to antlr_list_ids and $id to antlr_id or some such. In templates though the + is probably unnecessary as ST assumes 0, 1, or more elements by default.

Do I really want to encourage inline templates? Sure looks good for journal papers.

November 16, 2005

I completed my rewrite of the ANTLR+ST example and it highlights a few things:

  • 2 out of about 12 or so rules do not return a template, hence, the norm is to return a template; switch my default from previous blog entry
  • it's unclear how to set attributes of a template from another rule.
  • how do you return a template for a rule with multiple elements? Since it's text, don't you need the hidden channel tokens like whitespace and comments to go back out as well? If so, do you do all in front of the rule and between elements in the rule? What if you don't want them? It highlights the need for the cool idea from Ric Klaren: allow matching of hidden tokens on left hand side and then ref them in the rewrite rule.
  • if I refer to a multi-valued element in rule, the rewrite must use + or * to indicate it's multi-valued. With template args, I can do ids=ID even if ID has 10 values. Do I do ids=ID+? That's ok I suppose.
  • speaking of multiple values, how do you set a template attribute multiple times? If you use {$names=$ID.text\} in an action, how does that simple assignment do the usual StringTemplate thing and add it in? Could use += but it's in a raw action. Perhaps $names+=ID within the grammar itself? Looks ok, but is a major change. Further, do you get the text or the Token object or what when you do that operation? Sometimes you'll want one and sometimes the other.

Let's do one possible rewrite of cminus.g from Language Translation Using ANTLR and StringTemplate side by side:
ANTLR v2 ANTLR v3
program returns
  [StringTemplate code=template("program")]
  :   (declaration[code])+
  ;
Here we explicitly mention the template to create and what attribute arguments should be created implicitly for rule program:
program
  : declaration+ -> prog(globals,functions)
  ;
I want to be able to separate the variables and functions in the output even if they are mixed in the input. Template parameters without assignments are like implicit rule attributes; it's like doing:
program
scope {
  List globals,functions;
}
  : declation+ ...
  ;

That is, only if they have no assignment on them in the parameter list to the template!

declaration[StringTemplate code]
{StringTemplate f=null,v=null;}
  : v=variable
    {code.setAttribute("globals", v);}
  | f=function
    {code.setAttribute("functions", f);}
  ;
The following doesn't really make sense due to setting an attribute multiple times with an assignment in raw code doesn't track all variables etc...
// note: no return template wanted!
declaration
    :   v=variable {$program.globals=$v;}
    |   f=function {$program.functions=$f;}
    ;
What about using += syntax and within the grammar?
declaration
    :   $program.globals+=variable
    |   $program.functions+=function
    ;
It obscures the grammar unfortunately. Also, what does it add though? The template for variable? The entire return value struct? I don't want to pretend to know anything about attribute types (some languages aren't statically typed anyway).

What if we had some assignment notation after the -> so that we can get rid of the grammar obscurity and let's me avoid allowing arbitary attribute setting masquerading as a label assignment:

declaration
    :   variable -> $program.globals+=variable
    |   function -> $program.functions+=function
    ;

It would also make more sense if you had more things to set:

declarator
    :  ID '[' INT ']'
       -> $decl.var=ID, $decl.arraySize=INT
    ;

versus

declarator
    :  $decl.var=ID '[' $decl.arraySize=INT ']'
    ;

Hmm...you know that is the most natural. Seems that assignments after the -> rewrite are inconsistent. Inline, they don't look too bad I guess. Still the issue remains of what exactly is assigned to those attributes. Text, Token object?

What about

declarator
    :  ID '[' INT ']'
       {$decl.var=$ID.text; $decl.arraySize=$INT.text;}
    ;
I don't like the need for an action, but it is much easier to see the grammar.

variable returns [StringTemplate code=null]
{StringTemplate t=null,d=null;}
  : t=type d=declarator SEMI
    {
    if ( currentFunctionName==null ) {
      code = template("globalVariable");
    }
    else {
      code = template("variable");
    }
    code.setAttribute("type", t);
    code.setAttribute("name", d);
    }
  ;
A predicated rewrite works great here. Also note that type=type means set the template's type attribute to the template returned from calling rule type. Same for name=declarator. What if one of those rules were called more than 1 time though? Should I use name=declarator+?
variable
  : type declarator SEMI
    -> {$function.name==null}?
       globalVariable(type=type,name=declarator)
    -> variable(type=type,name=declarator)
  ;

Note the ref to $function.name, which refers to the name of an enclosing function if any. See attribute definition in the function rule below.

Ack, how do we know below in an invoked rule what $variable.type sets since both templates have a type attribute? Both? Won't know until we create the template. Perhaps the union of all template parameters gets created as a scope of normal rule attributes and then they are set regardless of which one eventually gets used. Here actually nobody sets $variable.name$ or whatever so it's cool--we use rule return templates.

declarator returns
  [StringTemplate code=null]
  :   id:ID {code=text(id.getText());}
  ;
I don't know what to generate by default for a production so you have to be explicit; I allow single element refs (just like AST rewrites):
declarator
    :   ID -> ID
    ;
function returns
  [StringTemplate code=template("function")]
{StringTemplate t=null, f=null;}
  : t=type id:ID
    {currentFunctionName=id.getText();}
    LPAREN
    (   f=formalParameter
      {code.setAttribute("args", f);}
      (   COMMA f=formalParameter
        {code.setAttribute("args", f);}
      )*
    )?
    RPAREN
    block[code]
    {
    code.setAttribute("type", t);
    code.setAttribute("name", id.getText());
    currentFunctionName=null;
    }
  ;
Here I'm using a dynamically-scoped variable, name, that is visible to anybody within a rule invoked from rule function so they can test if they are in a function (name!=null).
function
scope {
    String name;
}
scope slist;
    :   type $name=ID LPAREN 
        ( formalParameter
          ( COMMA formalParameter )*
        )?
        RPAREN
        block
        -> function(type=type, fname=ID,
                    locals=$slist.locals,
                    stats=$slist.stats,
                    args=formalParameter)
    ;
I'm very uncomfortable with the args=formalParameter assignment as formalParameter will very often be multi-valued. I'd prefer args=formalParameter*, but is that weird? Well, not if we assume that the rhs of an assignment can be a token, rule, label, or suffixed version with * or + or ?. Hmm...

Note the use of $slist.locals. The rule function shares a scope called slist with rule stat so that the block rule can set these attributes.

scope slist {
    List locals, stats;
}

These are then passed to template function as template attributes.

formalParameter returns
  [StringTemplate code=template("parameter")]
{StringTemplate t=null,d=null;}
  :   t=type d=declarator
    {
    code.setAttribute("type", t);
    code.setAttribute("name", d);
    }
  ;
formalParameter
    :   type declarator
        -> parameter(type=type,name=declarator)
/*
or:     $type=type $name=declarator
        -> parameter(type,name)  ?????
*/
    ;

Is that weird though to set template attributes before you see them lexically?

type
  returns [StringTemplate code=null]
  :   "int"   {code=template("type_int");}
  |   "char"  {code=template("type_char");}
  |   id:ID
    {
    code=template("type_user_object");
    code.setAttribute("name", id.getText());
    }
  ;
Wow, this one is a nice example; shows the terseness and lack of freeform actions:
type:   "int"  -> type_int()
    |   "char" -> type_char()
//or |  "char" -> "char"      // ref to token
//or |  "char" -> "character" // string
//or |  "char" -> template() "character" // same
    |   ID     -> type_user_object(name=ID)
    ;

The "char" string in the first "//or" production refers to the token but if the string on the rhs does not exist on the left, then it's just a string

block[StringTemplate st]
{StringTemplate s=null,v=null;}
  :   LCURLY
    ( options {warnWhenFollowAmbig=false;}
    : v=variable
      {st.setAttribute("locals",v);}
    )*
    ( s=stat
      {st.setAttribute("stats",s);}
    )*
    RCURLY
  ;
This rule doesn't return a template, it just sets locals and stats attributes of scope slist, which is a global shared named scope used by function and stat (for statement lists in curlies).
block
    :  LCURLY
       ( $slist.locals+=variable )*
       // makes list of templates not structs!
       ( $slist.stats+=stat )*
       RCURLY
    ;

This tracks the template results of variable and stat, but what if you wanted to push the entire return value struct to the template (which will include stuff like the return values and start/stop tokens etc..)? How to specify the two cases?

Perhaps to get return value structs, you need to use actions as they are the only things that can see them anyway. The following rewrite manually adds the entire return value struct to the $list.locals List object:

block
    :  LCURLY
       ( variable {$slist.locals.add($variable);} )*
       ( stat {$slist.stats.add($stat);})*
       RCURLY
    ;

Hmm...what about moving the assignments to the -> rewrite.

block
    :  LCURLY
         ( variable )* ( stat )*
       RCURLY
       -> $slist.locals=variable+, $slist.stats=stat+
    ;

I really like the clean look and now it's clear that ref to variable+ means all of the template results. still, if you want the return value structs, you can use a label and action:

block
    :  LCURLY
         ( v+=variable )* ( s=+stat )*
       RCURLY
       -> $slist.locals={$v}, $slist.stats={$s}
    ;

stat returns [StringTemplate code=null]
{StringTemplate e;}
  : code=forStat
  | e=expr SEMI
    {
    code=template("statement");
    code.setAttribute("expr", e);
    }
  | block[code=template("statementList")]
  | code=assignStat SEMI
  | SEMI
  ;
This rule shares slist scope so that rule block can set the values needed by the statementList template.
stat
scope slist;
    : forStat -> forStat
    | e=expr SEMI -> statement(expr=expr)
    | block -> statementList(locals=$slist.locals,
                             stats=$slist.stats)
    | assignStat SEMI -> assignStat
    | SEMI
    ;

I cannot use statementList(locals,stats) to define the attributes in lieu of scope slist because block sets attributes for both stat and function. block cannot use $rule.attribute notation as the rule is ambiguous. Must use a shared scope.

Does the isolated SEMI production return anything? Again, what about whitespace or comments with it? Perhaps all rules return a template, but what is the default action for a production? Probably to return a null template.

forStat returns
  [StringTemplate code=template("forLoop")]
{StringTemplate e1=null,e2=null,
 e3=null,s=null;}
  :   "for" LPAREN e1=assignStat
    SEMI e2=expr
    SEMI e3=assignStat
    RPAREN block[code]
    {
      code.setAttribute("e1", e1);
      code.setAttribute("e2", e2);
      code.setAttribute("e3", e3);
    }
  ;
This one is pretty clean except for e1=$1 assignment. Does this pass in the return value struct or just the template associated with assignStat invocation? To be consistent, it should be the resulting template, but how do you get the return value struct? Oh, e1={$e1} because an action's version of $e1 is different than the value in a rewrite! ha! Good.
forStat
scope slist;
    :   "for" LPAREN e1=assignStat
        SEMI e2=expr
        SEMI e3=assignStat
        RPAREN
        block
        -> forLoop(e1=$e1,e2=expr,e3=$e3)
    ;

Ah. Yes, if we want the return value struct, do this:

        -> forLoop(e1={$e1},e2={$expr},e3={$e3})

In an action, $e1 is the return value struct.

Using label assignments, we could also do this:

forStat
scope slist;
    :   "for" LPAREN $e1=assignStat
        SEMI $e2=expr
        SEMI $e3=assignStat
        RPAREN
        block
        -> forLoop(e1,e2,e3)
    ;

assignStat returns
  [StringTemplate code=null]
{StringTemplate a=null;}
  :   id:ID ASSIGN a=expr
    {
    code=template("assign");
    code.setAttribute("lhs",id.getText());
    code.setAttribute("rhs",a);
    }
  ;
assignStat
    :   ID ASSIGN a=expr
        -> assign(lhs=ID, rhs=expr)
    ;

or

assignStat
    :   $lhs=ID ASSIGN $rhs=expr
        -> assign(lhs,rhs)
    ;
expr returns [StringTemplate code=null]
  :   code=condExpr
  ;
expr:   condExpr -> condExpr
    ;
Again, this is shorthand for returning the template returned by condExpr.
condExpr returns
  [StringTemplate code=null]
{
StringTemplate a,b=null;
Token op=null;
}
  :   code=aexpr
    ( {op=LT(1);}
      (EQUALS | LT) b=aexpr
      {
      a = code;
      if ( op.getType()==EQUALS ) {
        code=template("equals");
      }
      else code=template("lessThan");
      code.setAttribute("left", a);
      code.setAttribute("right", b);
      }
    )?
  ;
Here is an example where a subrule sets the rule's template according to syntax:
condExpr
    :   a=aexpr
        (   (  EQUALS b=aexpr
               -> equals(left=$a,right=$b)
            |  LT b=aexpr
               -> lessThan(left=$a,right=$b)
            )
        |   -> $a // else just aexpr
        )
    ;
aexpr returns [StringTemplate code=null]
{StringTemplate a,b;}
  :   code=atom
    (   PLUS b=atom
      {
      a = code;
      code=template("add");
      code.setAttribute("left", a);
      code.setAttribute("right", b);
      }
    )*
  ;
In this example, we need to build a nested template just like we'd do for building a tree. First we set aexpr's template with (atom->atom) then refer to this previous value with $aexpr in the loop. $aexpr gets reset upon each iteration of the (...)* loop.
aexpr
    :   (atom->atom)
        ( PLUS b=atom -> add(left=$aexpr, $b) )*
    ;
atom returns [StringTemplate code=null]
  : id:ID
    {code=template("refVar");
    code.setAttribute("id",id.getText());}
  | i:INT
    {code=template("iconst");
    code.setAttribute("value",i.getText());}
  | LPAREN code=expr RPAREN
  ; 
atom
    : ID -> refVar(id=ID)
    | INT -> iconst(value=INT)
    | LPAREN expr RPAREN -> expr
    ; 
Here I'm assuming that the template dumping expressions puts a whole bunch of extra parentheses around expressions so I can't remove the explicit ones returning just expr for the 3rd production. If I wanted the parentheses, I'd have to figure out what this means:
LPAREN expr RPAREN -> LPAREN expr RPAREN
I don't know what to do about whitespace. Is there a space in between each? What if there were a comment in the original input? What do we do? I presume we'd need to do this now
LPAREN expr RPAREN -> parenthesizedExpr(e=expr)

which lets me put whatever whitespace I want. The problem is that comments will not be passed through!

November 15, 2005

More thoughts...trying to redo the article I did on ANTLR+ST using new notation. Some problems. Trying to fix.

We'll need some basic template construction for single elements, for example:

declarator
    :   ID -> ID
    ;

and

stat: ...
    | assignStat SEMI -> assignStat
    ;

November 14, 2005

Ok, I've done lots of thinking about adding template rewrite rules to ANTLR v3.

General points:

  • Use option output=template analogous to output=AST.
  • Templates can be in place or a reference to a template defined in a StringTemplateGroup somewhere.
  • Rewrites set a return template accessible by the invoking rule.
  • Alternatives without a template rewrite yield a rule with a null template result.
  • Template argument list is a list of assignments mapping antlr world to template attributes.
  • A new property, template, is available when doing template rewrites for rule labels; e.g., $r.template.

Rewrites

Normally you will want to separate the templates into a separate group so you can plug in another group of templates for a retargetable code generator:

m : ID INT -> decl(a=ID,b=INT)
  ;

where template decl might look like:

decl(a,b) ::= "<a> = <b>;"

which generates stuff like "x = 3;" and so on.

You can have multiple rewrites:

a : A -> foo(a=A)
  | B -> foo(a=B)
  | C  // blank template by default
  ;

You could specify that last alt with an empty rewrite also:

...
  | C ->
  ;

Predicated rewrites. Just like tree rewrites, you can predicate the templates:

a : A B -> {p1}? foo(a=A)
        -> {p2}? foo(a=B)
        -> // return nothing

Here's a real example:

variable
    :   type declarator SEMI
          -> {currentFunctionName==null}?
             globalVariable(type=type,name=declarator)
          -> variable(type=type,name=declarator)
    ;

Reference to templates in a StringTemplateGroup file. The code gen will add setter/getter methods for accessing a template lib pointer so that the translator is retargetable.

Inline templates

Sometimes you will just want to build something quickly and will want templates inline:

m : ID INT -> template(a=ID,b=INT) "<a> = <b>;"
  ;

where template(...) is like lamba in some languages; that is, you are saying that a piece of code is coming--here, a template is approaching. W/o it, the arg list (a=ID,b=INT) might look like subrule or something:

m : ID INT -> (a=ID,b=INT) "<a> = <b>;" // too confusing
  ;

I also contemplated:

m : ID INT -> "a=ID,b=INT | <a> = <b>;" // weird
  ;

which is sort of similar to the anonymous template stuff in StringTemplate.

Big templates. When the template gets big or multi-line, you can use the <<...>>:

m : ID INT
    -> template(a=ID,b=INT) <<
<a> = <b>;
>>
  ;

As you can see though, large templates should really be pulled out and put into a StringTemplateGroup file.

Template arguments

One of the trickiest parts of this is the bridge between ANTLR and StringTemplate. The argument list is an assignment where the left-hand-side (lhs) is a StringTemplate attribute and the right-hand-side (rhs) is some ANTLR value. You may reference:

  • tokens. a=ID or a=';' or a="int"
    x : ID -> foo(a=ID) ;
    y : ';' -> foo(a=';') ;
    z : "int" -> foo(a="int") ;
    
    where those literals are valid tokens in the grammar. All tokens get pushed in as Token objects. If your Token's toString() doesn't just print the text, you'll have to register a renderer object that does the right thing or modify toString() to do what you want.
  • strings. A string literal such as "hello" is considered just a string if it is not a defined grammar literal (i.e., it is not a token reference).
  • rules. Rule references get pushed in as the template returned from that rule:
    x : r    -> foo(a=r) ; // push in r's return template
    y : (r)+ -> foo(a=r) ; // push in all r return templates
    
    The rule return value struct is not pushed in.
  • labels.
    x : i=ID -> foo(a=$i) ; // push in Token object for ID
    y : r    -> foo(a=$r) ; // push in r's return value struct
    z : (i+=r)+ -> foo(a=$i) ; // push List of all r return value structs
    
  • actions. In some cases, you may want to push in an arbitrary attribute value into the template or you might want to access an attribute property:
    x : ID -> foo(a={$ID.text}) ;
    x : r -> foo(a={$r.text}) ;
    
    The action is any valid Java expression returning an Object. Actually, I think int and such will work as I have some helper methods that wrap the object for you.

Once inside a template, you are free to access properties of the incoming object whether it is a Token or some rule's return value structure.

a : b c ID -> template(x=$b,y=c,z=$ID) "<x.v> <y> <z.text>"
b returns [int v] : 'B' {$v=3;} ;
c returns [int w] : 'C' {$w=4;} -> template(c=C) "<c.text>" ;

Dumping rule a's template, you'd see: "3 C abc" if the ID is abc.

May 1, 2005

I'm neck deep in the action translation at the moment; mostly refactoring and adding unit tests after speaking with John Mitchell yesterday at length. After he smacked me around a little, I am convinced that I need to make some changes. One is that we will use $x to refer to the stuff you're used to like parameters, return values, token labels, and rule labels; @x.y will refer to the new dynamic attribute stuff that you've been hearing about. John and I agreed that a separate symbol is justified as the dynamic attributes are so different than the regular attributes.

Anyway, on to simple $x references. One of the things that drives me nuts in 2.x is that sometimes you can forget the #x or $x and have it still get through ANTLR to compile and run...but #x, for example, is different than x! Ack! This kind of bug has trapped me for hours sometimes.

My proposal is that all labels and simple attributes like parameters are generated with a prefix or something so that you cannot accidentally reference them in an action. This helps me track which return values you access, for example, which helps me generate better code. If nobody references a rule return value, I won't generate code for it. :)

So if you had:

a[String s, int x] returns [float y]
  : id=ID f=field {s, x, y, id, f.z;}
  ;
field returns [int z]
  : ...
  ;

you'd get compilation errors on s, x, y, id, and f. You need to do

a[String s, int x] returns [float y]
  : id=ID f=field {$s, $x, $y, $id, $f.z;}
  ;

Note that multiple return values require that I build a struct or simple class to hold the values so s and x do not even exist as simple variable references.

January 30, 2005

I finished off a first pass of the new attribute stuff in early January, but am just getting around to documenting it. Here is my semi-useful test case:

grammar t;

scope symbols {
    static int n;
    List names;
}

{
public void init() {
    @symbols.n = 0;
}
}

classDef
    scope symbols;
    init {
        @symbols.names = new ArrayList();
    }
    :   id=ID ( f=field {int x=@f.stop.getTokenIndex();} )* (m=methods)*
    ;

field returns [int i]
    :   decl ;

decl:   type="int" id=ID ';' {@symbols.names.add(@id.text);}
    ;

methods
    :   "method" id=ID slist {@symbols.names.add(@id.text);}
    ;

slist
    scope symbols;
    init {
        @symbols.names = new ArrayList();
    }
    :   '{' (decl)* stat '}'
    ;

stat:   ID '=' INT 
    |   "return" INT
    ;

INT :   ('0'..'9')+
    ;

ID  :   ('a'..'z')+
    ;

The rules classDef and slist share a single stack of symbols attribute scopes. Each entry into one of those methods pushes a new symbols object onto the same stack. Rules methods and decl may access the top of the attribute stack with @symbols.n or @symbols.names. Dynamic scoping. Yum!

The init{...} action is just a regular rule init action, but it is used to initialize the new scope. I didn't think a constructor-like thing associated with the scope{...} was warranted.

Scopes unify concepts

Everything is a scope now. The rule arguments and return values are all just scopes (note multiple return values now fall out naturally). The one difference is that rule arguments and return values are lexically scope. I don't want a future rule reference to see arguments of a previous call; those should be "private" to the rule.

Note the (useless) action

{int x=@f.stop.getTokenIndex();}

that asks the reference to rule field for it's stop token. Each rule reference can access start and stop tokens that indicate the first and last token matched within that rule. Nice, eh? The return value of field is:

protected static class scope_field_return {
    Token start=null, stop=null;
    int i;
};

Token Scopes

References to tokens are now seen as var=TOKEN and @var is a scope of attributes. Every label on a token has an implicit scope defined as:

scope TokenReferenceScope {
  String text;
  int line;
  int pos; // char position in line
  int channel; // token channel
  int index; // index of token in token stream
  int start; // start character in char stream
  int stop;
  AST tree; // reserved; might be "ast" later
}

So, properties of Token objects are also just another kind of "scope". :)

I am contemplating allowing @ID as a scope so @ID.text could be used without resorting to a label on the ID if it is unique in the production.

static attributes

As I mentioned previously, sometimes you want an attribute in a scope to only have one value. This means static in C++ and derivatives, but in our case it's a wee bit more complicated--we need to have one value shared only within the parser not across multiple instances of the parser object. Hence, we need static attributes to actually be implemented with an instance variable. Note that in a program with only one parser instance, static attributes map to static parser fields. Currently in Java, for

scope symbols {
    static int n;
    List names;
}

ANTLR generates:

// scope symbols
protected static class static_scope_symbols {
    int n;
}
static_scope_symbols static_symbols = new static_scope_symbols();
protected static class scope_symbols {
    List names;
}
protected Stack stack_symbols = new Stack();

Translation of @-references

In ANTLR 2.x, each language target implementor had to provide a grammar that would parse actions and translate $getText and other junk. In ANTLR 3.0, the code generator does all the parsing and merely asks the templates to translate individual references via a few templates. For example, here are some of the attribute translation templates for Java:

parameterAttributeRef(scope,attr) ::= "<attr.name>"

returnAttributeRef(scope,attr) ::= "retval.<attr.name>"

scopeRef(scope) ::= "((scope_<scope.name>)stack_<scope.name>.peek())"

/** Define an attribute scope */
attributeScope(scope) ::= <<
// scope <scope.name>
protected static class static_scope_<scope.name> {
    <scope.attributes:{<if(it.isStatic)><it.decl>;<endif>}; separator="\n">
}
static_scope_<scope.name> static_<scope.name> = new static_scope_<scope.name>();
protected static class scope_<scope.name> {
    <scope.attributes:{<if(!it.isStatic)><it.decl>;<endif>}; separator="\n">
}
protected Stack stack_<scope.name> = new Stack();<\n>
>> 

Accessing non-top-of-stack scopes

Sometimes you will want to walk up the stack of attribute scopes to see previous values. I think I will simply let the user ask for the underlying stack data structure so they can use target language constructs and methods to do whatever they want. I can imagine something like:

scope symbols {
  List names;
}
...
r : ... {Stack scopes = @symbols.scopes;} ... ;

or perhaps @symbols.stack is better.

December 31, 2004

Implicitly-named scope attributes

long stream of consciousness that ultimately convinced me of the right syntax/semantics, though I didn't summarize my findings in one spot.

The "explicitly-named-scope-only" spec is bothering me. In other words, sometimes you want to share attribute scopes like symbols, but other times you don't want to share at all like an attribute to hold "name of the current method I'm parsing".

methodDef
attributes [String name]
    :   "method" m=ID {@name=@m.text;} ...
    ;
...
atom:   v=ID  {System.out.println("found ref "+@v.text+" in "+@name);}
    |   INT
    ;

Now, name is a horribly unspecific identifier just like a global variable called name would be. This is the original reason for having named scopes. On the other hand, we could ref it as @methodDef.name and then it's actually a good name and follows the @scope.attribute syntax perfectly. Ok, so I'll allow you to define attributes within the named scope of a rule; makes good enough sense to me.

If you are in rule r, then you should be able to refer to r.attr with either @r.attr or @attr? I definitely do not want to inherit attribute names into invoked rules:

a attributes [int x] : ... b c {@x=3;} ... ; // @x ok here
b : ... {@x=3;} ... ;   // NO!
c : ... {@a.x=3;} ... ; // ok

It's too easy to name two attributes the same thing and then mistakenly get the wrong value 10 levels down in the call chain (for big grammars anyway). For example, is @name the class name or method name. Better to be specific. If you need ambiguity about where the symbol is defined such as symbols scope entered in classDef and slist, then you need to create a named scope. On the other hand @x within rule a is probably ok. Perhaps I should do an attribute lookup in the immediately surrounding scope (return values, parameters, implicitly-defined attributes) if you have just @attributeName. Handy when the rule name is long.

So, full syntax for rule definition vis-a-vis attributes:

rule[arguments]
  returns [return-values]
  attributes [attributes]
  scope scope-names
  : ...
  ;

Return values are like attributes visible to that rule and to the calling rule (but read only to the calling rule). Arguments are like attributes visible only to that rule. Attributes are visible to that rule and any rule invoked from that rule (read-write); they are exactly like local variables that are dynamically scoped and visible to any method that can see it's stack activation record.

Specifying type names across targets

Java and C++ differ for variable definitions in that Java easily separates type and name of the variable whereas C++ has the name in the declarator part as in char *p[3] (array of 3 char pointers). So, how do we keep a language independent syntax for argument definitions and so on? The answer is that ANTLR must ignore the definitions of arguments, return values, and attributes; it treats them as actions. The only restraint is that the type specifier cannot have a comma or semicolon in it because those will separate the definitions (and any square brackets must balance). How can you have a comma in a type? C++ allows char (*p)(int,float) for example.

So, you can have whatever funky stuff you want as long as ANTLR is able to pull the individual declarations out:

a[char *p[3], int x] : ... ; // C,C++

If your language can do without static typing like Python, then do this:

a[x,y] : ... {@x=@y} ... ; // Python

So, if you defined a named attribute scope such as this:

scope symbols {
  static int i;
  Symbol *names[];
}

you'd see the two variables i and names defined as-is in the generated C++ as part of the symbols scope definition. They will probably just end up fields in a class definition.

Heh, what is the type of a scope? For example, if you want to record a scope in an instance variable for use after parsing, you need to know the type. Given:

scope symbols {
  List names;
}

what is the type? Perhaps:

{
protected scope_symbols rememberMe;
}
...
classDef
  scope symbols;
  init {
    this.rememberMe = @symbols;
  }
  :  ...
  ;

The scope_symbols is target and code-generator specific, but so what. All the type names are target specific already.

Multiple return value syntax

Hmm...how about

a
{
    int x;
    String s;
}
    : [x,s]=b[34]
    ;

b[int i] returns [int duh, String blort]
    : B {@duh=@i+1; @blort="hi";}
    ;

It looks like parallel assignment. Makes more sense than x,s=b[34].

This brings up a lexical issue for ANTLR. When it sees [...] out of context (as the lexer has no context; only the parser has context), which of the following is it?

  1. the left-hand-side of an assignment (list of variables); e.g., [a,b]
  2. arguments (list of expressions); e.g., [32,"[tom,dick,harry;",x]
  3. parameter definitions (comma-separated list of definitions); e.g., [int x, int y]
  4. attributes (semicolon-separated list of definitions); e.g., [int x; String foo]

The expression list is the worst: could be anything in quotes for example. The comma and semicolon do not separate the expressions.

Ok, only solution is to totally ignore what is inside of [...] in all cases except that when the [...] is interpreted as a list of attributes. In this case, each target must indicate what the separator is between attribute definitions. For Java/C/C++ and most languages the comma would be replaced with a semicolon. Hmm...I don't like that asymmetry though with named scopes, which are already semicolon separated. Certainly it would be weird to have all semicolons:

a[int x; int y]                   // weird
  returns [String name; int q]
  attributes [int a; String b]
  : ...
  ;

What about attributes being semicolon-separated all the time:

scope symbols {
  List names;
}

classDef[int arg1, int arg2]
  returns [int foo]
  scope { // implicit one must be first
    int x;
    int y;
  }, symbols, anotherScope
  :  ...
  ;

Rather than attributes keyword, we'd use an unnamed scope (implicitly using the rule name as the scope name). Interesting. At least the semicolon thing would be consistently used only in curlies (like tokens section too, right?). You can have multiple scopes; in this case, rule classDef has an implicitly-named scope and it uses symbols scope.

Initialization of attributes

How do you set the initial values of attributes? You might want the attributes of scope symbols to be initialized differently depending on where the scope is used (classDef vs slist). Do we put in a constructor? Nope, because you might need to exec some code to determine the constructor arguments. Easier to just let the init action of a rule set the attributes:

classDef
  scope symbols
  init {
    @symbols.numNames = 0;
    if ( some-complex-expression) {
      @symbols.names = new List();
    }
    else {
      @symbols.names = new List();
    }
  }
    : "class" ID '{' (fieldDef)* (methodDef)* '}'
    ;

Better to have something simple, obvious, and flexible than something that looks good in the manual or in a journal paper. ;) You can define your own void init(scope_symbols s) method if you want to factor out the initialiation code. This solution is also much more natural for non-OO languages.

More Syntax Musings

Note that for the init-action, I have added init. I just noticed that this rule header is getting rather complicated. The {...} might look like an explicitly-defined scope without the init or a separator like ';' after scope. All is cool as long as the keywords are there.

Ooops. What about options? At the moment, they are just ID=value (comma-separated) before the ':' of a rule or subrule. Add options keyword back in?

classDef[int arg]
  returns [String name]
  scope {int name;}, symbols, another
  options greedy=true, ast=false
  init {
    @symbols.numNames = 0;
    if ( some-complex-expression) {
      @symbols.names = new List();
    }
    else {
      @symbols.names = new List();
    }
  }
  : ...
  ;

Still needed for subrules I suppose:

( greedy=true,ast=false init {i=3;...} : ',' ID )*

Boy that looks so weird now without semicolons between elements in the rule and subrule. Ok, perhaps add separators:

classDef[int arg]
  returns [String name]
  scope {int name; int n;}, symbols, another;
  options greedy=true, ast=false;
  init {
    @n = 0;
    @symbols.numNames = 0;
    if ( some-complex-expression) {
      @symbols.names = new List();
    }
    else {
      @symbols.names = new List();
    }
  }
  : ...
  ;

where I have not added semicolons after [...] or after {...}. For subrules:

( options greedy=true,ast=false; init {i=3;...} : ',' ID )*

What if we got rid of returns in favor of old PCCTS syntax:

rule[int x, int y] > [String returnValue]
    :   ...
    ;

Nah...returns is just too obvious to get rid of and > is nonobvious.

What if there are only options:

( options greedy=true,ast=false; : ',' ID )*

Then the semicolon is not a separator but a terminator and looks pretty superfluous. Ugh. If we made semicolon a separator, it might be better, but it would force some order on the rule header:

rule[args]
  returns [return-values]
  options or scopes separated by semicolons
  init {...}
  : ...
  ;

Actually specifying semicolons as separators is rather hard in a context-free grammar. Ack.

I might need the init last anyway so ANTLR has seen all definitions before it has to look up the @refs.

Attribute cardinality

Do we need cardinality modifiers to indicate how many values an attribute holds? I guess not when int and List make it pretty clear:

scope symbols {
  int numNames;
  List names;
}

numNames can only have one value. Setting it twice must reset it. Only in StringTemplate, where there are no attribute types, do you have issues with setting an attribute more than once.

Now, what about static vs dynamic?

The static keyword works, but that means I have to parse the damn {...}, which could be really wild for some languages. Heck, python would look like this probably:

scope symbols {
  static numNames=0
  names=[]
}

I guess if I require static as the first word of any definition within curlies, I could pull it out and then give it to the StringTemplate that generates scopes in the code generator. Uh oh, there is no separator for Python. How do I know where the start of the declarations are?

Ok, looks like any set of declarations will have to be semicolon terminated and then I'll send a list of attribute definitions to the code generator. I'll do the same for argument lists. I'll have the ANTLR lexer grab as a big chunk of text anything in {...} and [...] and pull it apart later when I know what is inside by context. Come to think of it, I definitely need a list of attribute definitions so I can pull static attributes out for use with C output.

Attribute operators

Do we want accessor methods for setting/getting attributes? You might want to override what setting an attribute means. For example, when you set a parent attribute that points to the previous scope you might want it to go to the parent and addChild or something. For now, I'll not add method syntax to the attributes.

Oh, what about accessing scopes other than the top of the stack? What about the "previous"? @symbols[-1] for previous? @symbols[i] is the ith element from 1..n for depth i? @symbols[n] would be the top of stack. Depth is better than size or else people will think 0..n-1 instead of 1..n. I guess we'd need @symbols.depth or @symbols.size as a built in attribute. 0..n-1 is more natural to programmers, but 1..n depth avoids a constant i-1 everywhere.

Tokens, labels, and attributes

It has always been a problem explaining the difference between ':' and '=' in ANTLR. The ':' is for labels and '=' is for return values. Let's just get rid of the ':' and use '='. For token references and rule references both:

a
  init {
    int x,y;
  }
  : id=ID x=b {y=@id.line+x;}
  ;
b returns [int foo]
  : {@foo=32;}
  ;

So @id is the named scope for the attributes of a token. Uh oh, how do you refer to the return value of a rule and the tree that it will create (or later perhaps the text it matched)? Ok, let's try again. Get rid of the parallel assignment concept and access the return values as attributes, which is more consistent.

a : id=ID x=b {y=@id.line+@x.foo;}
  ;
b returns [int foo, int bar]
  : {@foo=32;}
  ;

Heh, this is nice. You don't have to define the result of calling a method. That always drove me nuts in 2.x. To get an int return value I had to go to the init-action and define a temporary local. Ick.

Cool. We have syntax now for the tree created or the text matched by a rule: @x.tree and @x.text. Consistent with tokens then: @id.tree and @id.text. I like it.

A token has attributes as well; by default it has line, charPositionInLine, and text. Does a tree? Sure: token and then the children. Should we allow people to define this in the grammar like this:

/** Define the payload for a Token */
Token {
  List args; // for example, if you are lexing XML tags
}

/** Define the payload for an AST */
AST {
  AST def;
  List uses;
}

Not sure. It's pretty easy to let them define a new Token subclass, but then I guess I couldn't check references like @x.tree.def for validity at compile-time.

Just because they can define the attributes, doesn't mean they can define methods. I have not found myself defining anything other than accessor methods for Token so that's probably ok. AST on the other hand can have lots of added functionality.

What about heterogeneous trees? Seems that you don't want to define a bunch of different types in the grammar file; you'd have trouble with inheritance relationships and so on. The target language should really be the one to define different ASTs, but that makes them inconsistent with Token definitions.

For a first pass, let's leave out Token and AST definitions. The new Token definition has just about anything you'd want for now. Worry about creating application-specific tokens and trees later.

StringTemplate integration

What happens when we integrate ST whose attributes are totally untyped? Surely it's attributes will have to be visible so you can set them:

rule => { dept,names+ : <dept> First names: <names>}
  : {@names="Tom"; @names="Jerry"; @dept="sales";} ...
  ;

I have added Smalltalk-like argument syntax to define the attributes and then have cardinality operators. More often for large (retargetable) translators, you'd refer to the template by name rather than using an anonymous template:

rule => department
  : {
    @department.names="Tom";
    @department.names="Jerry"; // names now multi-valued
    @department.dept="sales";
    }
    ...
  ;

elsewhere, in a StringTemplateGroup file, you'd have department defined with parameters and cardinality:

department(dept,names+) : "<dept> First names: <names>"

Are the attributes of department visible in rules invoked by rule? They'd have to be, but would need to be fully qualified (just as they need to be fully qualified within rule). So I'd push a new StringTemplate object for each invocation of rule. Nice. Consistent. I like it.

Actions and code generation

In ANTLR 2.x there are things like $getText and #foo that need to be parsed out of actions and translated. Each code generation target must build one of these translators. I'm going to invert it for 3.0 so that I'll cut out the @scope.attribute references and then ask a template to translate it for whatever target and then stick the resulting template back into the action.

Attribute persistence

How to keep a tree or list/set of attribute scopes around after parsing? In other words, what if you want the tree of symbols scopes to hang around after you process the data?

The easiest thing to do for a symbol table "tree of scopes" is link up the scopes with a "parent" pointer and a list of children.

{
protected scope_symbols symbolTableRoot;
}

scope symbols {
  List names;
  List children;        // which scopes are children?
  scope_symbols parent; // who's your daddy? ;)
}
classDef
  scope symbols;
  init {
    symbolTableRoot = @symbols;  // no parent; we are the root
  }
  : ...
  ;
slist
  scope symbols;
  init {
    @symbols.parent = @symbols[-1];      // track your parent
    @symbols[-1].children.add(@symbols); // track children
  }
  : ...
  ;

After parsing, symbolTableRoot would point to the root of the symbol table.

If you are building a parse tree or an AST, then pointers to the scopes could automatically be kept for easy access later.

December 30, 2004

More thoughts on attributes now that I've gotten error recovery working well. Talked to John Mitchell on the phone to refresh my memory. He stressed named scopes and cardinality specification and making clear the static vs dynamic distinction. After the conversation and walking around, I think I have a decent first pass at what attributes should look like.

There are a number of issues:

  1. named attribute scope declaration syntax
  2. named scope definition syntax for use in rules
  3. actions that reference attributes and associated runtime scoping errors

Introduction

First the basic idea. Every time your parser enters a left curly such as in class definition or statement list in Java or C++ like programming language your symbol table must enter a new symbol scope. Rather than having a global symbol table that tracks scopes, you could define a local variable, "List names;", in rule classDef and slist to track the symbols in a class or statement list scope. Unfortunately, rules called from classDef and slist could not see that local because most programming languages (assume we're generating parsers in Java) have lexical scoping (that is, variables "die" at the right curly}). So you'd have to pass the list down as a parameter to every rule invoked below. Ugly. Wouldn't it be nice if the invoked rules could simply see the local variable? In our case, the fact that locals are allocated on the stack is the key not the fact that they are locally and lexically scoped (we want the opposite, right?).

Anyway, we want dynamically scoped variables; we'll call them attributes and have named scopes to reduce collision between attributes that have the same name. In the case of the list of names, you could define a scope such as:

scope symbols {
    List names;
}

and then have rules indicate that they are in that scope:

classDef
 scope symbols
    : "class" ID '{' (fieldDef)* (methodDef)* '}'
    ;
fieldDef
    : type v:ID {@symbols.names.add(v.getText());} ';' // define symbol
    ;
...
slist
 scope symbols
    : '{' (stat)+ '}'
    ;
...
atom: v:ID {if (@symbols.names.contains(v.getText())) {...}} // ref symbol
    | INT
    | ...
    ;

Consider what the parse tree might look like for input:

class T {        // enter named scope symbols0
  <fieldDef>
  method foo() { // enter named scope symbols1
    <s1>
    {            // enter named scope symbols2
      <s2>
    }
  }
}

where <...> indicates some appropriate input. Including named attribute symbols, here is what it might look like:

There are three scopes for this input and like magic the parser would create three scopes tracking them in scope-like fashion so that as you leave a rule, the top scope is popped off.

Accessing an attribute in a grammar action would access the top scope by default though we should allow for random access to the stack of scopes. The diagram shows three references to @symbols. The value depends on where in the grammar the @symbols reference is and how deeply nested the parser is. For example, the same grammar action executes @symbols in rule atom but for our input the action actually references different names lists.

The implementation for this scheme is pretty obvious:

static class scope_symbols {
    List names;
};

/** Track stack of scope_symbols */
Stack symbols = new Stack();
...
public void classDef() {
    symbols.push(new scope_symbols());
    match(CLASS);
    match(ID);
    ...
    match(RCURLY);
    symbols.pop();
}
...
public void atom() {
    if ( LA(1)==ID ) {
        match(ID);
        if (((scope_symbols)symbols.top()).names.contains(v.getText())) {...}
    }
    ...
}

Static vs Dynamic Attributes

Sometimes you do not want multiple copies of an attribute, but yet you want it encapsulated/scoped with other attributes. For example, imagine we want to track globally the number of symbols. You can add the variable to scope symbols, but indicate that it's static vs dynamic:

scope symbols {
    static int numNames = 0;
    List names;
}

This amounts to an instance variable really, but for targets like C you would need to encapsulate numNames in the struct holding attributes of symbols. So, there is only one copy of numNames no matter how many scopes are created. This is directly analogous to the difference between static class variables and nonstatic instance variables in object oriented languages.

In fieldDef then you could increment the count:

fieldDef
    : type v:ID {@symbols.names.add(v.getText()); @symbols.numNames++;} ';'
    ;

where "@symbols.numNames++" would translate to "((scope_symbols)symbols.top()).numNames++".

Parameters, Return Values and Attributes

Perhaps parameters and return values (yes, multiple return values) should be attributes within the scope named after the rule. Heh, I like it. So

foo[int y] returns [int x]
    : ... {@foo.x=@foo.y;} ...
    ;

Not bad actually. This would distinguish them nicely from local variables and token/rule labels:

foo[int y] returns [int x]
{
    int local=@foo.y;
    String s;
}
    : id:ID {s=id.getText();} ... {@foo.x=@foo.y; local++;} ...
    ;

Actually should labels on token references be "scopes" for the attributes (which are properties, in this case) of the Token? So the reference to id.getText() should be @id.text perhaps.

Think about later

  • how are attributes related to parameters, return values, rule/token labels.
  • cardinality of attributes; i.e., what happens when you set an attribute multiple times?
  • do attributes get automatically copied to ASTs and parse trees?
  • do tokens and ASTs have "attributes" in the same sense?
  • target language neutral syntax for accessing stack, adding values to multi-valued attributes etc...
  • target neutral syntax for specifying type of attributes. What about languages that can do without static typing like Python? What if we simply didn't type attributes? Hmm...makes primitives hard like numNames above as we'd have to cast a lot and create a new Integer each time. :(

October 27, 2004

Dynamic attribute scopes

Regarding attribute scopes (where you don't have to pass a value down as a parameter down the call chain) such as:

rule 
scope foo {
  String x;
}
  : ... a ... ;

a : @foo.x=ID ... ;

I ran this scoped attribute / dynamic scoping attribute stuff intended for 3.0 past an attribute grammar expert at dinner. She agreed it was interesting; she suggested that Uwe Kastens had already done something similar. I just found some of his lecture notes. He calls it "Dependency pattern INCLUDING".

"An attribute at the root of a subtree is used from within the subtree. Propagation through the context in between is omitted."

So that is what we are talking about with dynamic scoping 'cept that we use named scopes rather than rulename.attr. Thankfully the antlr2004 workshop discovered this need!

Rewrite rules

Many rewrite engines exist such as TXL, Stratego, DMS, etc... They are huge, however, and clearly not intended as little tools that get integrated into somebody else's application. Further, they are extremely complicated; the average programmer will surely be intimidated. Large declarative systems also pose problems. Anybody who has done a bit of prolog realizes that when a bunch of rules don't get the right result, it's pretty hard to figure out why not.

Seems to me that we should be able to produce a rewrite system that is more deterministic. It could be constructive/imperative such as aspects that get attached to a grammar or declarative with a small set of rules.

(I'm tossing in another reference to James Gosling's ACE syntax-based C preprocessor. It's pretty nice, if specific to C. It is a declarative style macro system).

Declarative

Yesterday I had a list of student usernames and had to generate perforce revision control client specs. First, I used awk to generate Java code:

users.put("bzhao", new Integer(9));

from data like this:

7 parrt
9 bzhao

Then I built a Java program that walked the users HashMap and used a StringTemplate to dump out all the specs:

Client: <user>.nfs
Owner:  <user>
Root:   /home/<user>/depot/main
View:
        //depot/cs601/group<n>/main/... //<user>.nfs/...

Now, awk is powerful enough to do this itself I guess with some work, but it seems that we should be able to do a simple declarative tool that handled it better. I'll use the task as an example. Perhaps, something like this:

replace <inputRecord> with <<
Client: <user>.nfs
Owner:  <user>
Root:   /home/<user>/depot/main
View:
        //depot/cs601/group<n>/main/... //<user>.nfs/...
>>

where <INT> and <ID> could be some predefined tokens and <inputRecord> is some rule in a grammar like:

inputRecord : n:INT user:ID ;

That would make the replace rule kind of an aspect attached to the inputRecord rule that would get executed whenever it matches a line of input.

You could inline the rule too:

replace "n:<INT> user:<ID>" with ...

where the tool would implicitly create an anonymous rule that matched INT followed by ID.

If we don't match replace patterns up with grammar rules, then an "engine" would have to search the entire tree looking for subtree matches. That would be pretty slow in the worst case, but for one-off transformations like the above, who cares? Still I like the "aspect" nature of this where you decorate a grammar in a separate spec.

Aspects

So what if we take a stock grammar for Java, for example, and then allow a rewrite spec to feed off of it like a series of aspects? For example, one could do this:

replace <statement>:"lhs:<ID> = rhs:<expr>;" with "<lhs> := <rhs>;"

I don't like all the labels in there but how else to reference what is matched for <ID> and <expr>? Experience with ANTLR 2.x suggests that allowing the user to reference <ID> when unique causes trouble...actually I think that was because we allowed ID not #ID to mean the tree and ID-as-tree conflicted with ID-as-token-type-constant. So perhaps when unique references:

replace <statement>:"<ID> = <expr> ;" with "<ID> := <expr>;"

The key here is that you get to "speak" in the source language directly rather than in the terms of parse trees or ASTs (you don't care anymore)!

Notice that you must specify what kind of construct you are matching with the "<statement>:" syntax; that's the aspect part where you are decorating a stock grammar w/o having to modify it directly. This could solve a lot of our "how do we reuse grammars?" problems.

Ok, so the with clause would actually be a StringTemplate so you could do cool formatting stuff. For example, if you had a grammar for a language that had the enum construct:

enum Days {Mon, Tue, Wed, Thu, Fri, Sat, Sun};

you could convert to Java like this:

replace <enum>:"enum name:<ID> { <exprList> };" with <<
<exprList:{public static final int <name>_<it>=<i>;}>
>>

where <it> is the standard StringTemplate iterated value as it walks the tokens matched by exprList; <i> is also a standard attribute that iterates 1..n. Ack! I just realized that the commas would still be in the token list matched for exprList. Do we want to allow properties of the rule to be the tokens with a particular type? It would look like:

replace <enum>:"enum name:<ID> { <exprList> };" with <<
<exprList.ID:{public static final int <name>_<it>=<i>;}>
>>

Hmm...not very general I guess. Another way is more explicit, but does not reuse the input grammar. Perhaps the <enum>: reference on the front just says apply this pattern when you see a node in the parse tree that is enum:

replace <enum>:"enum name:<ID> { <ID> <(',' ID)*> };" with <<
<exprList.ID:{public static final int <name>_<it>=<i>;}>
>>

where <(',' ID)*> is an input pattern conforming to an ANTLR grammar fragment. Wow...might make sense to a human, but not sure how we'd implement that.

Predicated rewrites

Sometimes you don't want all syntactic patterns to rewrite, just some depending on a predicate. For example, you might want to rewrite the Day enumeration one way and all others another way.

replace <enum>:"enum name:<ID> { <exprList> };" with
( {condition1}? "<exprList.ID:{public static final int <name>_<it>=<i>;}>"
| {condition2}? "<exprList.ID:{public static final short <name>_<it>=<i>;}>"
)

where the conditions would be arbitrary code written by the programmer just like in an ANTLR grammar. This is a big difference from "closed" systems like TXL that ask you to use their predicate mechanism; there's can probably be proved to do correct translations but ours would be easier for programmers and more flexible in the sense that a programmer could use his/her own data structures.

The other difference of this approach to TXL and other systems seems to be that I would make it very clear when something gets executed and would probably generate a new version of the underlying grammar that had these rewrite actions in them so you could learn exactly what was happening and you could still debug with your favorite debugger. I would like to resist a black box "engine" if we can.

It seems a crucial idea that you do rewrites in a series of stages, with a single stage doing a single well-defined task. Otherwise, you have the problem of a single rewrite generating output that another rewrite in the same stage should rewrite. This is like in the C preprocessor where the result of one macro must invoke anothe macro to get expanded to text. This leads to madness for complicated macros. Separating into separate stages is a good idea. The only trouble is that once you do a rewrite, your next stage's "input" no longer conforms to the original grammar. TXL and friends build combined grammars so that at any point you can get the parse tree for a constrct in the input or output language. Not sure what to do here.