Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.2
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



2003 Cabal Notes

Terence Parr, Loring Craymer, Monty Zukowski

Tree stuff

Build/specify (parse tree too?) syntax

Loring's tree construction syntax:

	#{ tree action }
	#{ #(a b c) } // tree constr action
	#[...] // node constructor (same as is now)
	#{ a^ } // make a new root called a to current tree
	#{ {pred1}? #(a b c) // apply only upon pred1
	 | {pred2}? #(b a c)
	 | #(c a b)
	 }
	#{ RETURN () } // How to wipe current tree:
	Everything done on-demand like current system
	a is label or #[...]

symlinks:

	#{ "A" } // leaves a symlink label
	#{ "A" = tree_expr } // resolve the symlink
	// can happen multiple times and in either order

Loring's concept is similar to the @-variables of sorcerer. Like a local variable, each invocation of a rule (or #{"A"} in his case) creates a new version of the variable. A reference to the label/symlink accesses the most recently defined variable. His syntax implies "leave a placeholder" not label a tree node. The implementation describes it best. Each label acts like a stack of values; push a new value upon label def. The labels are naturally globally visible. The #{"A"=tree_expr} really says to insert tree_expr where "A" is above. We used the example:

ifstat : IF^ #{"decl"} expr THEN! stat ;
expr   : d:e2 #{"decl"=d} (PLUS^ d1:e2 #{"decl"=d1})* ;

or to have a #(DECL d d1 d1 d1) created:

ifstat : IF^ #{#(#[DECL] #{"decl"})} expr THEN! stat ;
expr   : d:e2 #{"decl"=d} (PLUS^ d1:e2 #{"decl"=d1})* ;

Like a string template (doc with holes), this is a tree with holes you can fill in.

Loring says that once parsing is done, he backpatches all "holes" by recursively resolving. He just does bookkeeping during the parse rather doing on-demand linking up.

Rewrite syntax for tree construction:

list : x:a ( COMMA y:a )* -> x y ;

or

i : IF e:expr THEN s:stat (ELSE s2:stat)?
    -> #(IF e s s2)
  ;

s2 would either be null and leave a placeholder or just not appear. This stuff doesn't work well for expr...still would use annotation.

Labels vs direct reference to elements: Post to group to find out.

Parse trees: why not? Easy and could almost be a runtime option (wouldn't need to regenerate parser from grammar).

Monty talks about a JCL problem where you see statements but an EXEC statement starts a new subtree. A null or other EXEC ends current subtree and bumps you back up. Using Loring's notation of symlinks, you define an EXEC insertion point (symlink) only when you see EXEC. All other statements then add themselves to the current EXEC. Hmm...can't figure out what to do on NULL statement:

stmt: EXEC^ #{"exec"}
    | NULL // need to terminate the scope
    | s:STAT1 #{"exec"=s}
    | s2:STAT2 #{"exec"=s2}
    ;

Ok, now we're thinking "-> new tree" syntax is not worth adding.

Gen a tree grammar

To make things easier, "no antlr modifying code inside regular actions." He'll make this a module. :) To push forward changes, use

diff3 -m t0annotated t0 t1

where t0 is the first tree grammar generated, t1 is the generated result from modifying text parser grammar. t0annotated has the t0 with actions. Actually, monty is right: we can back out the actions and then we don't need the original t0. :)

Tree rewrite

Same as tree construction.

Labeled subrules:

term
    :   expr MULT^ e:(ID EXPONENT^ expr)
    ;
to get
#(MULT 3 #(EXPONENT ID 4))

In Loring's notation, we'd now do:

term
    :   expr MULT^ (id:ID EXPONENT e:expr)! #{#(ex id e)}
    ;

Hmm...notation is kinda nutty with the (...)! Ask the interest list.

tree -> text

Talking about AST to flat token list or to parse tree. Trying to incorporate StringTemplate. Perhaps string template is added to AST walk to use for rule returns.

Maybe the use of templates for the various components of your output language like method, decl, class def etc... and then you still use an ASt tree walk to drive the translation, but it uses the string templates to actually spit out stuff. That way you can change the templates w/o having to mess with the grammar.

Hmm...maybe generate parse tree from AST and then use regular Java grammar to "unparse" it back to text. Perhaps even generate a token stream from the AST and then have the parser walk that list dumping things in the right order.

search/apply to tree

Add tscanf("#(... %1 ...)", List-to-fill).

Make sure we have the findalltree() method there to look for all expr or something satisfying a predicate or something.

attribute syntax

We want all $foo and #foo stuff to be better hooked into code generator. Loring goes so far as to suggest we move the $foo.setText(...) outside the action. We need to allow IF statements around it however and we have no good answer for that. Ask the group.

We decided it was a good thing to define token and AST properties in ANTLR syntax so you can know when somebody refs/sets the properties and it's language independent. The code generator can dump the right thing.

token string text; int line; List args; // for html args for example

AST token originalToken;

xml read, type relationship

Code generator should be able to generate ASTs that satisfy DOM trees or whatever as long as it's compatible with the run-time they supply. So that walking routines and tree walkers etc... can use the same AST type.

XML reading/writing should be standard for serializing trees.

StringTemplate, code gen templates

Ok, concerning string template and trees for final code emission. Trying to pull data from tree into string template and have it run the show seemed wrong because you have to have a tree grammar fragment in order to compute the values:

#(DECL id:ID args:. slist:.) => "method <id> <args:arg()> { <slist:stat()> }"

They said well, just put the template into the parse tree. That smelled wrong. I think (and still do after this is all over that we might be able to get string template to run things). For now, we agreed some that associating a template with a tree grammar rule is good concrete approach. Each rule would push a new scope and create a new string template with for rule r there is a template r. It makes new variables available that even invoked rules could set. Labeling an item (atom or rule reference) sets the associated attribute for most recent hole on the scope stack. example:

----tree parser----
method : #("method" name:ID args slist) ;

args : #(ARGS (argid:ID)+) ;

slist  : #(SLIST (slist:stat)+) ;

stat   : binop:#(op:EQUALS lhs:ID rhs:INT)
       | postop:#(op:INC id:ID)
       | if:#(IF e:expr THEN slist)
       ;

----templates-----
method ::= "
method <name> (<argid>) {
	<slist>
}
"

if ::= "if (<e>) <slist>"

binop ::= "<lhs> <op> <rhs>"

postop::= "<id><op>"
-------------------

Rule args does not create a new template and, hence, does not create a new scope of symbols. Labels act assignments to the most recently defined hole of that name. A labeled tree indicates what template to create for that tree. Hmm..doesn't work if the tree is flat with no root...only can label first element.

ISSUES:

  1. how to duplicate so that you can take a method head and stick it in a .h file and .c file. Tried duplicating in the sense we have two stacks when you say methodHead needs the full function and just the head, but it makes most sense in the antlr world to have two passes: one for the .h and one for the .c. Changes to the original grammar can be pushed forward to the passes. :)
  2. scope overrides. I don't like the idea that you can call something simply <id> in a template. Too ambiguous...you might set the wrong value by mistake. What if we allowed method.name instead of name at the label/assignment?

Actually label does not look right. For example "slist:stat" is a little weird since it means put stat into the slist hole of a template. "slist:=stat" is better since it implies slist is assigned the value of stat invocation.

Another thought is to merge the string templates directly into the grammar that way you don't need to figure out when to create a template:

method : #("method" name:=ID args slist)
	=>
	"method <name> (<argid>) {
		<sl>
	}"
	;

args : #(ARGS (argid:=ID)+) ;

slist  : #(SLIST (sl:=stat)+) ;

stat   : binop:#(op:EQUALS lhs:=ID rhs:=INT) => "<lhs> <op> <rhs>"
       | postop:#(op:=INC id:=ID) => "<id><op>"
       | if:#(IF e:=expr THEN slist) => "if (<e>) <sl>"
       ;

What about precedence for expressions? You need parens around the subexprs sometimes if prec of current op is higher than subexpr op. Can simply ref <lparen> in the expression template and then use a generic setAttribute() method of the tree parser to set the value for the currently scoped attribute name (such as lparen).

What if we allowed a predicate to allow a choice of templates on the end:

rule : alt
	=> {pred1}? "template1"
	=> {pred2}? "template2"
     ;

Then values set with := assignments are queued up and then at end of scope when you must choose a template, it flushes the values from the stack before it pops the scope.

What about case of needing to create a template for implicitly-defined scalars. There is no template associated with the, say, "a=1" statement to stuff into decls somewhere. One solution is to put an anonymous template at the def site:

func : funcname:=ID LPAREN arglist:=args RPAREN slist 
       => "<funcname> (<arglist>) {
              <decls:"int <lhs>;">
              <slist>
          }"
     ;
...
assign : decls.lhs,lhs:=ID EQUALS e:=expr
       => "<lhs> = <e>;"

This is a double assignment (with scope override) but it doesn't say when you are done createing the anonymous template. Need to do this probably; move template to area where you know template will be complete:

assign : lhs:=ID EQUALS e:=expr
       => "<lhs> = <e>;"
       => decls:="int <lhs>;"

The "flush" operation pushes values into the multiple templates. The new scope for this assign rule is the set of all attributes referenced in all => templates. Multiple refs to <id> in same scope yield same value(s).

Can move these anonymous templates to a template grammar file and then refer to them by name so that you can factor these out for reuse within grammar or from within another grammar.

Concept: the rule yields a template (or more) and then you can do "attr:=ruleRef" to get the value above. The decls:="int <lhs>;" yield actually doesn't return anything--it sets an attribute in an existing template. A purely return value concept is totally totally wrong as it implies the order of output is the same or similar order as input tree. The tree walk just walks and sets values then the overall template is dumped at the end.

Using string templates that actually produce tree templates. For output template

"if (<e>) <slist>"

you want tree

 if
 |
<e> -- <slist>

If you have a parser that generates trees then can easily parse the template with holes to get the right tree:

ifstat : "if"^ LPAREN expr RPAREN slist ;

What if the template is ambiguous? Will have to wed the template to a particular parser rule. Might have to refactor your grammar if you want to leave a hole of a particular sequence of tokens. Normally though it'll be obvious stuff like expr or stat or decl or method. Still you might not specify enough text on the left edge for me to predict the correct alternative. Ah. If you insist that all holes correspond directly to a grammar (parser grammar) rule, then you'll never have an issue, but there are no decisions: the entire rule is replaced by rule_epsilon special token type.

Somehow we must have the ability to say what start rule to use (don't forget the EOF issue) in the parser corresponding to the template text. Each hole also must specify the rule it replaces. So you have <attributeName:ruleName>. Technically then you can't just specify a string template and then decide w/o modification to build trees. Though, you can go back the other way. If you specify a tree template it's easy to ignore the rulenames after the colons and then generate text.

So that is the consistency thingie for trees and text output using the same syntax and concepts. Queue up values and then fill string or tree template.

If you want to do replacements like "i=<expr>" you can simply record the token positions that begin/end expr rather than creating parse tree.

Code reuse (pushing forward changes); for whole grammars; Composition of grammars

The problem is that people need to add actions to stock grammars to make them do something useful. But, when the original say Java grammar is changed, you want all the grammar changes to push forward to everybody's specific grammar.

Want antlr-specific diff3 not regular diff3 'cause want it to merge at alternative level.

Forking new copy of grammar and push forward looks like this:

      g0
     / |
    g1 |
    |  |
    | g0'
    | /|
    |/ |
    g1' (created from diff3 g1 g0 g0')

This pushes changes from g0 to g0' into g1 to yield g1'

Seems like this really is a process or principle that people use in conjunction with their RCS as opposed to a specific tool we build (minus the antlrdiff3). Any tool would have to pretty much use an RCS to manage g0 and g0' and we'd have no idea how they have set up their depot so couldn't effectively build a tool to work in all environments. Just write a good paper on how to reuse grammars. Even if you just cp into a different dir and manually do the diff3, at least you can push forward changes.

Aspects. add tags that can be filled by separate aspect file. What if don't have tag in right spot? Hard to read and figure out what actions do in aggregate. hard to figure out where are errors are due to preprocessing. We'll leave this to users who want it; they can build the tools (Bogdan has one).

Composition.

Nice to allow in workbench (click on the tokens you want) to generate a lexer. I'm less sure it's useful for grammars. Hmm...might as well cut-n-paste or use RCS.

Be nice to have an idiom plugin or just a special expression idiom where you can specify the operators and precendence and atom rule and it generates the rules or possibly a precedence parser. Might not be good enough for the wackiness in Java, but would work for vast majority of simple languages. :)

Conclusion: we need not add anything to ANTLR grammars like inheritance, composition, etc... because it's not right and adds lots of complexity. Will have to have a workbench to supply ease-of-use stuff. Keep the tool core!

Lexer impl (DFA or top-down); insert multiple tokens ability like for python?

Special case the single char tokens like operators. Inline rules that are not invoked by other rules (which is usually all of them (top level token rules)). LL(1) rules go into switch, others go into if-then-else after switch. Then use syn preds to distinguish any others that are nondeterministic. Implies there will never be a warning from antlr that it cannot distinguish between tokens.

Idioms; common task amelioration

A poorman's version of emitting back out with no is to remember the char position for each token and then just sort and print as a backend. Does not allow reordering, but allows you to manipulate the nodes (i.e., the properties) of the tree and even augment for say instrumentation. If you wanted to change the char position, you could actually fake reordering. Then just walk and print "in order" and boom you have your text back out including the whitespace if you want on a hidden channel.

To do a simple source to source using same input/output language, should be easy to have it read in and spit out as an identify transform. Then can add ! and other string templates to insert in output stream to do augmentations like code instrumentation. This should be made easy. Feature is "string template" that is inserted into output stream at that point. Easy and consistent. :) Input token stream to output token stream...just insert a new set of tokens and can feed new parser. Or just dump text and feed into another lexer/parser pair. Hmm...to add "instrument("x")" to the stream, you'd need to convert this to a stream of tokens. But, you have a lexer since you're using one. Could use it to build the token stream from the text and then spit this token stream out. Could have an echoToken(Token t) method that either spits as text or token stream; you could decide. :)

See expression idiom from above.

How about an option or idiom that says "ignore whitespace". I always use the same thing anyway.

Testing

Levels (or what to test):

  1. Overall: given grammar and input, what is output (text, tree, token stream, ...)
  2. Code gen: given grammar what is code generated output
  3. Internal ANTLR unit testing: LL(k) analysis, utility routine testing, ANTLR syntax testing.
  4. Run-time library unit testing

Support:

  1. Comparison of output (text, tree, token stream, ...) routines; will have to be normalized
  2. Test harness to say "here is the grammar, here is the input, here is the expected output component"
  3. jUnit Test harness for basic unit testing.
  4. Test harness for checking analysis: dump all lookahead to text and then check for validity.

In general, comparison of text needs to be tolerant of whitespace and formatting. Worst case is checking the java code generated by the code generator. Perhaps normalize the txt by reading in with Java grammar and dumping or something.

How to run:

  1. Overall system or process for doing all the tests. Could do an html based thing in lieu of a gui.
  2. I like a simple command line "run these tests" kinda thing.

Workbench requirements and wishlist

  1. Syntax diagram and highlight path of lookahead or ambiguous paths
  2. Interpret or at least quickly test an input sequence against the grammar and/or show resulting tree
  3. Refactoring of grammars
  4. Repository support for composition or code reuse
  5. Visual transformations

None of this seems to need to be in eclipse.

ANTLR implementation details

Intermediate form that can be interpreted as well as used to generate code from? That form could be read from disk by any language then. E.g., python could read and then generate python. Most important output from antlr is the lookahead analysis. Good for testing etc...

LL(k), generalized lookahead; sem pred hoisting

Module boundaries

separate runtime/tool lib. Standard library spec for trees etc... to aid porting to new output languages

Build issues