Tree construction

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

There are two mechanisms in v3 for building abstract syntax trees (ASTs): operators and rewrite rules.

Operators

Nodes created for unmodified tokens and trees for unmodified rule references are added to the current subtree as children.

Operator

Description

!

do not include node or subtree (if referencing a rule) in subtree

^

make node root of subtree created for entire enclosing rule even if nested in a subrule

No Format
additiveExpression
	:	multiplicativeExpression ('+'^ multiplicativeExpression)*
	;

That is the same as the following in rewrite notation from the following section:

No Format
additiveExpression
	:	(a=multiplicativeExpression->$a) // set result
                (    '+' b=multiplicativeExpression
                     -> ^('+' $additiveExpression $b) // use previous rule result
                )*
	;

Rewrite rules

The rewrite syntax is more powerful than the operators. It suffices for most common tree transformations.
While the parser grammar specifies how to recognize input, the rewrites are generational grammars, specifying how to generate output. ANTLR figures out how to map input to output grammar. To create an imaginary node, just mention it like the following example (UNIT is a node created from an imaginary token and is used to group the compilation unit chunks):

No Format
compilationUnit
    :   packageDefinition? importDefinition* typeDefinition+
        -> ^(UNIT packageDefinition? importDefinition* typeDefinition*)
    ;

ANTLR tracks all elements with the same name into a single implicit list:

No Format
formalArgs
	:	formalArg (',' formalArg)* -> formalArg+
	|
	;

If the same rule or token is mentioned twice you generally must label the elements to distinguish them. If you want to combine multiple elements into a single list, list labels are very handy (though in this case since they have the same name ANTLR will automatically combine them):

No Format
('implements' i+=typename (',' i+=typename)*)?

Here is the entire rule:

No Format
classDefinition[MantraAST mod]
	:	'class' cname=ID
		('extends' sup=typename)?
		('implements' i+=typename (',' i+=typename)*)?
		'{'
		(	variableDefinition
		|	methodDefinition
		|	ctorDefinition
		)*
		'}'
		-> ^('class' ID {$mod} ^('extends' $sup)? ^('implements' $i+)?
		     variableDefinition* ctorDefinition* methodDefinition*
		    )
	;

Note that using a simple action in a rewrite means evaluate the expression and use as a tree node or subtree. The mod argument is a set of modifiers passed in from an enclosing rule.

Deleting tokens or rules is easy: just don't mention them:

No Format
packageDefinition
	:	'package' classname ';' -> ^('package' classname)
	;

If you need to build different trees based upon semantic information, use a semantic predicate:

No Format
variableDefinition
	:	modifiers typename ID ('=' completeExpression)? ';'
		-> {inMethod}? ^(VARIABLE ID modifiers? typename completeExpression?)
		->             ^(FIELD ID modifiers? typename completeExpression?)
	;

where inMethod is set by the method rule.

Often you will need to build a tree node from an input token but with the token type changed:

No Format
compoundStatement
	:	lc='{' statement* '}' -> ^(SLIST[$lc] statement*)
	;

SLIST by itself is a new node based upon token type SLIST but it has no line/column information nor text. By using SLIST[$lc], all information except the token type is copied to the new node.

Using a rewrite rule at a non-extreme-right-edge-of-production location is ok, but it still always sets the overall subtree for the enclosing rule.

No Format
'if' '(' equalityExpression ')' s1=statement
( 'else' s2=statement -> ^('if' ^(EXPR equalityExpression) $s1 $s2)
|                     -> ^('if' ^(EXPR equalityExpression) $s1)
)

You may reference the previous subtree for the enclosing rule using $rulename syntax

No Format
postfixExpression
	:	(primary->primary) // set return tree
		(	lp='(' args=expressionList ')' -> ^(CALL $postfixExpression $args)
		|	lb='[' ie=expression ']'       -> ^(INDEX $postfixExpression $ie)
		|	dot='.' p=primary              -> ^(FIELDACCESS $postfixExpression $p)
		|	c=':' cl=closure[false]        -> ^(APPLY ^(EXPR $postfixExpression) $cl)
		)*
	;

Imaginary nodes

References to tokens with rewrite not found on left of -> are imaginary tokens.

No Format
d : type ID ';' -> ^(DECL type ID) ; // DECL is imaginary

or

No Format
call : lp='(' ID args ')' -> ^(CALL[$lp] ID args) ;

Here, the CALL node has its line/column info set with info from '(' token. CALL node is "derived" from the '('.

Even tokens referenced within alternative result in nodes disassociated with tokens from left of -> if you put arguments on the references:

No Format
a : INT -> INT["99"] ; // node created from adaptor.create(INT, "99")

Anchor
treegrammarAST

Tree construction during tree parsing

ANTLR 3.0.1 could not create trees during tree parsing. 3.1 introduces the ability to create a new AST from an incoming AST using rewrites rules:

  • Each rule returns a new tree.
  • An alternative without a rewrite duplicates the incoming tree.
  • The tree returned from the start rule is the new tree.
  • The new tree created with output=AST in a tree grammar is completely independent of the input tree as all nodes are duplicated (with and without rewrite -> operator).

The rewrites work just like they do for normal parsing:

No Format
a : INT ; // duplicate INT node and return
No Format
a : ID -> ; // delete ID node from tree
No Format
a : INT ID -> ID INT ; // reorder nodes
No Format
a : ^(ID INT) -> ^(INT ID) ; // flip order of nodes in tree
No Format
a : INT -> INT["99"] + // make new INT node
No Format
a : (^(ID INT))+ -> INT+ ID+ ; // break apart trees into sequences

Predicates can be used to choose between rewrites as well:

No Format
a : ^(ID INT) -> {some test}? ^(ID["ick"] INT)
              -> INT
  ;

Don't forget the wildcard (smile)

No Format
s : ^(ID c=.) -> $c ; // new tree is whatever matched wildcard

Polynomial differentiation example

For translations whose input and output languages are the same, it often makes sense to build a tree and them morph it towards the final output tree, which can then be converted to text. Polynomial differentiation is a great example of this. Recall that:

  • d/dx(n) = 0
  • d/dx(x) = 1
  • d/dx(nx) = n
  • d/dx(nx^m) = nmx^m-1
  • d/dx(foo + bar) = d/dx(foo) + d/dx(bar)

Ok, here's a parser that builds nice trees.

No Format
grammar Poly;
options {output=AST;}
tokens { MULT; } // imaginary token

poly: term ('+'^ term)*
    ;

term: INT ID  -> ^(MULT["*"] INT ID)
    | INT exp -> ^(MULT["*"] INT exp)
    | exp
    | INT
    | ID
    ;
exp : ID '^'^ INT
    ;

ID  : 'a'..'z'+ ;
INT : '0'..'9'+ ;
WS  : (' '|'\t'|'\r'|'\n')+ {skip();} ;

Then we differentiate:

No Format
tree grammar PolyDifferentiator;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
//  rewrite=true; // works either in rewrite or normal mode
}

poly:   ^('+' poly poly)
    |   ^(MULT INT ID)      -> INT
    |   ^(MULT c=INT ^('^' ID e=INT))
        {
        String c2 = String.valueOf($c.int*$e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   ^('^' ID e=INT)
        {
        String c2 = String.valueOf($e.int);
        String e2 = String.valueOf($e.int-1);
        }
                            -> ^(MULT["*"] INT[c2] ^('^' ID INT[e2]))
    |   INT                 -> INT["0"]
    |   ID                  -> INT["1"]
    ;

then we simplify (a little anyway):

No Format
tree grammar Simplifier;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=AST;
    backtrack=true;
//  rewrite=true; // works either in rewrite or normal mode
}
/** Match some common patterns that we can reduce via identity
 *  definitions.  Since this is only run once, it will not be perfect.
 *  We'd need to run the tree into this until nothing
 *  changed to make it correct.
 */
poly:   ^('+' a=INT b=INT)  -> INT[String.valueOf($a.int+$b.int)]

    |   ^('+' ^('+' a=INT p=poly) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' ^('+' p=poly a=INT) b=INT)
                            -> ^('+' $p INT[String.valueOf($a.int+$b.int)])
    |   ^('+' p=poly q=poly)-> {$p.tree.toStringTree().equals("0")}? $q
                            -> {$q.tree.toStringTree().equals("0")}? $p
                            -> ^('+' $p $q)
    |   ^(MULT INT poly)    -> {$INT.int==1}? poly
                            -> ^(MULT INT poly)
    |   ^('^' ID e=INT)     -> {$e.int==1}? ID
                            -> {$e.int==0}? INT["1"]
                            -> ^('^' ID INT)
    |   INT
    |   ID
    ;

Finally we walk the tree to print it back out using simple templates:

No Format
tree grammar PolyPrinter;
options {
    tokenVocab=Poly;
    ASTLabelType=CommonTree;
    output=template;
}

poly:   ^('+'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>+<b>"
    |   ^(MULT a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a><b>"
    |   ^('^'  a=poly b=poly)   -> template(a={$a.st},b={$b.st}) "<a>^<b>"
    |   INT                     -> {%{$INT.text}}
    |   ID                      -> {%{$ID.text}}
    ;

Here is a test rig:

Code Block
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;

public class Main {
    public static void main(String[] args) throws Exception {
        CharStream input = null;
        if ( args.length>0 ) {
            input = new ANTLRFileStream(args[0]);
        }
        else {
            input = new ANTLRInputStream(System.in);
        }

        // BUILD AST
        PolyLexer lex = new PolyLexer(input);
        CommonTokenStream tokens = new CommonTokenStream(lex);
        PolyParser parser = new PolyParser(tokens);
        PolyParser.poly_return r = parser.poly();
        System.out.println("tree="+((Tree)r.tree).toStringTree());

        // DIFFERENTIATE
        CommonTreeNodeStream nodes = new CommonTreeNodeStream((Tree)r.tree);
        nodes.setTokenStream(tokens);
        PolyDifferentiator differ = new PolyDifferentiator(nodes);
        PolyDifferentiator.poly_return r2 = differ.poly();
        System.out.println("d/dx="+((Tree)r2.tree).toStringTree());

        // SIMPLIFY / NORMALIZE
        nodes = new CommonTreeNodeStream((Tree)r2.tree);
        nodes.setTokenStream(tokens);
        Simplifier reducer = new Simplifier(nodes);
        Simplifier.poly_return r3 = reducer.poly();
        System.out.println("simplified="+((Tree)r3.tree).toStringTree());
        // CONVERT BACK TO POLYNOMIAL
        nodes = new CommonTreeNodeStream((Tree)r3.tree);
        nodes.setTokenStream(tokens);
        PolyPrinter printer = new PolyPrinter(nodes);
        PolyPrinter.poly_return r4 = printer.poly();
        System.out.println(r4.st.toString());
    }
}

Running the rig on "2x+3x^5" shows:

No Format
tree=(+ (* 2 x) (* 3 (^ x 5)))
d/dx=(+ 2 (* 15 (^ x 4)))
simplified=(+ 2 (* 15 (^ x 4)))
2+15x^4

Rewriting an existing AST

For efficiency, option rewrite=true does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. For example, if you have a huge expression tree but only want to rewrite ^('+' INT INT) to be a single INT node, it's better not to duplicate the entire huge tree. The rewrite mode behaves exactly the same as nonrewrite mode except that rewrites stitch changes into the incoming tree. Nodes are not duplicated for rules w/o rewrites.

The result of a rule with a rewrite is the newly created tree. The result of a rule w/o a rewrite is simply the incoming tree. For chains of rule invocations as in the next example, ANTLR copies rewrites upwards so that the action in rule 's' prints out the tree created in b:

No Format
tree grammar TP;
options {output=AST; ASTLabelType=CommonTree; tokenVocab=T; rewrite=true;}
s : a {System.out.println($a.tree.toStringTree());} ;
a : b ;
b : ID INT -> INT ID ;

Anchor
heterogeneous

Heterogeneous tree nodes

By default, with output=AST, ANTLR creates trees of type CommonTree. To create different nodes depending on the incoming token type, you can override create(Token) and YourTreeClass.dupNode(Object) and errorNode() in a subclass of CommonTreeAdaptor or implement your own TreeAdaptor. Unfortunately, this only allows you to change the node type based upon the token type and not grammatical context. Sometimes you want to have ID become a VarNode and sometimes they MethodNode object. As of v3.1, you can use the node token option to indicate node type (in both parsers and tree parsers):

No Format
decl : 'int'<node=TypeNode> ID<node=VarNode> ';' ;

or equivalently

No Format
decl : 'int'<TypeNode> ID<VarNode> ';' ;

because node is assumed if there is only one option and it is not an option assignment. Token references with node options and invoke the following constructor during tree construction:

Code Block
public V(Token t); // NEED SPECIAL CTOR for ID<V> on left of ->

You can specify the node type on any token reference including liberals:

No Format
a : ID<V> ';'<V>

The "become root" operator ^ is used following the token options:

No Format
e : INT '+'<PlusNode>^ INT ;

Labels are available as usual; e.g., x=ID<V> and x+=ID<V>.

Heterogeneous tree nodes are labeled on the right-hand side of the -> rewrite operator as well:

No Format
decl : 'int' ID -> ^('int'<TypeNode> ID<VarNode>) ;

You can also specify arguments on node type constructors on the right of -> rewrite operator. For example, the following two token references:

No Format
ID<V>[42,19,30] ID<V>[$ID,99]

invoke the following two constructors of V:

Code Block
public V(int ttype, int x, int y, int z)
public V(int ttype, Token t, int x)

The TreeAdaptor is not called; instead for constructors are invoked directly. This is much more flexible because the list of arguments can change per type whereas the TreeAdaptor interface is fixed. Note that parameters are not allowed on token references to the left of ->:

No Format
a : ID<V>[23,21] ; // ILLEGAL

Use imaginary nodes as you normally would, but with the addition of the node type:

No Format
block : lc='{' stat+ '}' -> ^(BLOCK<StatementList>[$lc] stat+) ;

Here is a complete simple example:

No Format
grammar T;
options {output=AST;}
@members {
static class V extends CommonTree {
  public int x,y,z;
  public V(int ttype, int x, int y, int z) {
    this.x=x; this.y=y; this.z=z; token=new CommonToken(ttype,"");
  }
  public V(int ttype, Token t, int x) { token=t; this.x=x; }
  public String toString() {
    return (token!=null?token.getText():"")+"<V>;"+x+y+z;
  }
}
}
a : ID -> ID<V>[42,19,30] ID<V>[$ID,99] ;
ID : 'a'..'z'+ ;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

Sometimes ANTLR must duplicate nodes to avoid cycles and to provide useful semantics. In the next example, the trees returned from rule type are type V. There is only one type specification in the input (e.g., "int a,b,c;") but multiple identifiers. to create multiple trees with 'int' at the root, that node must be duplicated by the rewrite rule ^(type ID)+.

No Format
grammar T;
options {output=AST;}
a : type ID (',' ID)* ';' -> ^(type ID)+;
type : 'int'<V> ;
ID : 'a'..'z'+ ;
INT : '0'..'9'+;
WS : (' '|'\\n') {$channel=HIDDEN;} ;

We want 3 trees, one for each identifier:

Code Block
(int<V> a) (int<V> b) (int<V> c)

We need to override dupNode() in our node class definition as well as two constructors:

Code Block
class V extends CommonTree {
    public V(Token t) { token=t;}                 // for 'int'<V>
    public V(V node) { super(node); }             // for dupNode
    public Tree dupNode() { return new V(this); } // for dup'ing type
    public String toString() { return token.getText()+"<V>";}
}

Here is a test rig:

Code Block
public class Test {
    public static void main(String[] args) throws Exception {
        CharStream input = new ANTLRFileStream(args[0]);
        TLexer lex = new TLexer(input);
        TokenRewriteStream tokens = new TokenRewriteStream(lex);
        TParser parser = new TParser(tokens);
        TParser.a_return r = parser.a();
        if ( r.tree!=null ) {
            System.out.println(((Tree)r.tree).toStringTree());
            ((CommonTree)r.tree).sanityCheckParentAndChildIndexes();
        }
    }
}

Anchor
errornodes

Using custom AST node types

Code Block
/** An adaptor that tells ANTLR to build CymbalAST nodes */
public static TreeAdaptor cymbalAdaptor = new CommonTreeAdaptor() {
    public Object create(Token token) {
        return new CymbalAST(token);
    }
    public Object dupNode(Object t) {
        if ( t==null ) {
            return null;
        }
        return create(((CymbalAST)t).token);
    }
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        CymbalErrorNode t = new CymbalErrorNode(input, start, stop, e);
        return t;
    }
};

Here's a suitable error node:

Code Block
/** A node representing erroneous token range in token stream */
public class CymbolErrorNode extends CymbolAST {
        org.antlr.runtime.tree.CommonErrorNode delegate;

        public CymbolErrorNode(TokenStream input, Token start, Token stop,
                                           RecognitionException e)
        {
                delegate = new CommonErrorNode(input,start,stop,e);
        }

        public boolean isNil() { return delegate.isNil(); }

        public int getType() { return delegate.getType(); }

        public String getText() { return delegate.getText(); }
        public String toString() { return delegate.toString(); }
}

Error Node Insertion Upon Syntax Error

Prior to v3.1, ANTLR AST-building parsers did not alter the resulting AST upon syntax error. After v3.1 ANTLR adds an error node as created by TreeAdaptor.errorNode(...) to represent the missing nodes or confusing input sequences. The first token in the error sequence is the token at which the parser first detected an error. The last token in the sequence is the last token consumed during error recovery. ANTLR creates a CommonErrorNode by default, but you can obviously create your own tree adapter and override this.

Let me demonstrate the new mechanism by example. Referring to the attached SimpleC.g from the v3 examples, here is some good input:

Code Block
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

That input results in the following tree output:

No Format
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The grammar and tree construction for the FOR loop is as follows:

No Format
forStat
    :   'for' '(' start=assignStat ';' expr ';' next=assignStat ')' block
        -> ^('for' $start expr $next block)
    ;

Now, remove the first '(' of the or loop:

Code Block
int foo() {
  for i=0; i<3; i=i+1) {
    x=9;
  }
}

You will see that ANTLR detects an error, but magically inserts the missing token. In this case, the parser was not asked to insert the '(' into the tree so there is no evidence of the error in the output tree:

No Format
line 2:6 missing '(' at 'i'
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

What about when you have a random extra token such as "22" before the '(':

Code Block
int foo() {
  for 22 (i=0; i<3; i=i+1) {
    x=9;
  }
}
No Format
line 2:6 extraneous input '22' expecting '('
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= i 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

Again, ANTLR detects the error and is able to ignore the extraneous token to yield a valid tree.

If you forget a token that must go into the output tree, however, you will see an error node. Given a missing identifier at the start of the FOR loop:

Code Block
int foo() {
  for (=0; i<3; i=i+1) {
    x=9;
  }
}

The parser emits:

No Format
line 2:7 missing ID at '='
tree=(FUNC_DEF (FUNC_HDR int foo) (BLOCK (for (= <missing ID> 0) (< i 3) (= i (+ i 1)) (BLOCK (= x 9)))))

The toString() method of the error node yields "<missing ID>".

When the parser gets really confused, such as when it gets a NoViableAltException, you will see that it consumes a whole bunch of input and adds it to the tree as an error node (it indicates what tokens it consumes during resynchronization). Input:

No Format
);
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}

yields:

No Format
line 1:0 required (...)+ loop did not match anything at input ')'
tree=<error: );
int foo() {
  for (i=0; i<3; i=i+1) {
    x=9;
  }
}>

Making custom error nodes

Just override errorNode() in TreeAdaptor. The default handling is as follows:

Code Block
public Object errorNode(TokenStream input, Token start, Token stop,
			RecognitionException e)
{
	CommonErrorNode t = new CommonErrorNode(input, start, stop, e);
	return t;
}

Make sure that your error node type is a subclass of your node type so that you do not get class cast exceptions.

See the next section for example of how to override.

Turning off error node construction

To turn this off, just override errorNode:

Code Block
class MyAdaptor extends CommonTreeAdaptor {
    public Object errorNode(TokenStream input, Token start, Token stop,
                            RecognitionException e)
    {
        return null;
    }
}

and then set

Code Block
parser.setTreeAdaptor(new MyAdaptor());