There are two mechanisms in v3 for building abstract syntax trees (ASTs): operators and rewrite rules.
Nodes created for unmodified tokens and trees for unmodified rule references are added to the current subtree as children.
do not include node or subtree (if referencing a rule) in subtree
make node root of subtree created for entire enclosing rule even if nested in a subrule
That is the same as the following in rewrite notation from the following section:
The rewrite syntax is more powerful than the operators. It suffices for most common tree transformations.
While the parser grammar specifies how to recognize input, the rewrites are generational grammars, specifying how to generate output. ANTLR figures out how to map input to output grammar. To create an imaginary node, just mention it like the following example (
UNIT is a node created from an imaginary token and is used to group the compilation unit chunks):
ANTLR tracks all elements with the same name into a single implicit list:
If the same rule or token is mentioned twice you generally must label the elements to distinguish them. If you want to combine multiple elements into a single list, list labels are very handy (though in this case since they have the same name ANTLR will automatically combine them):
Here is the entire rule:
Note that using a simple action in a rewrite means evaluate the expression and use as a tree node or subtree. The
mod argument is a set of modifiers passed in from an enclosing rule.
Deleting tokens or rules is easy: just don't mention them:
If you need to build different trees based upon semantic information, use a semantic predicate:
inMethod is set by the
Often you will need to build a tree node from an input token but with the token type changed:
SLIST by itself is a new node based upon token type
SLIST but it has no line/column information nor text. By using
SLIST[$lc], all information except the token type is copied to the new node.
Using a rewrite rule at a non-extreme-right-edge-of-production location is ok, but it still always sets the overall subtree for the enclosing rule.
You may reference the previous subtree for the enclosing rule using
References to tokens with rewrite not found on left of -> are imaginary tokens.
Here, the CALL node has its line/column info set with info from '(' token. CALL node is "derived" from the '('.
Even tokens referenced within alternative result in nodes disassociated with tokens from left of -> if you put arguments on the references:
Tree construction during tree parsing
ANTLR 3.0.1 could not create trees during tree parsing. 3.1 introduces the ability to create a new AST from an incoming AST using rewrites rules:
- Each rule returns a new tree.
- An alternative without a rewrite duplicates the incoming tree.
- The tree returned from the start rule is the new tree.
- The new tree created with output=AST in a tree grammar is completely independent of the input tree as all nodes are duplicated (with and without rewrite -> operator).
The rewrites work just like they do for normal parsing:
Predicates can be used to choose between rewrites as well:
Don't forget the wildcard
Polynomial differentiation example
For translations whose input and output languages are the same, it often makes sense to build a tree and them morph it towards the final output tree, which can then be converted to text. Polynomial differentiation is a great example of this. Recall that:
- d/dx(n) = 0
- d/dx(x) = 1
- d/dx(nx) = n
- d/dx(nx^m) = nmx^m-1
- d/dx(foo + bar) = d/dx(foo) + d/dx(bar)
Ok, here's a parser that builds nice trees.
Then we differentiate:
then we simplify (a little anyway):
Finally we walk the tree to print it back out using simple templates:
Here is a test rig:
Running the rig on "2x+3x^5" shows:
Rewriting an existing AST
For efficiency, option rewrite=true does an in-line replacement for rewrite rules so you can avoid making a copy of an entire tree just to tweak a few nodes. For example, if you have a huge expression tree but only want to rewrite ^('+' INT INT) to be a single INT node, it's better not to duplicate the entire huge tree. The rewrite mode behaves exactly the same as nonrewrite mode except that rewrites stitch changes into the incoming tree. Nodes are not duplicated for rules w/o rewrites.
The result of a rule with a rewrite is the newly created tree. The result of a rule w/o a rewrite is simply the incoming tree. For chains of rule invocations as in the next example, ANTLR copies rewrites upwards so that the action in rule 's' prints out the tree created in b:
Heterogeneous tree nodes
By default, with output=AST, ANTLR creates trees of type
CommonTree. To create different nodes depending on the incoming token type, you can override
errorNode() in a subclass of CommonTreeAdaptor or implement your own TreeAdaptor. Unfortunately, this only allows you to change the node type based upon the token type and not grammatical context. Sometimes you want to have ID become a VarNode and sometimes they MethodNode object. As of v3.1, you can use the
node token option to indicate node type (in both parsers and tree parsers):
node is assumed if there is only one option and it is not an option assignment. Token references with node options and invoke the following constructor during tree construction:
You can specify the node type on any token reference including liberals:
The "become root" operator ^ is used following the token options:
Labels are available as usual; e.g.,
Heterogeneous tree nodes are labeled on the right-hand side of the -> rewrite operator as well:
You can also specify arguments on node type constructors on the right of -> rewrite operator. For example, the following two token references:
invoke the following two constructors of
The TreeAdaptor is not called; instead for constructors are invoked directly. This is much more flexible because the list of arguments can change per type whereas the TreeAdaptor interface is fixed. Note that parameters are not allowed on token references to the left of ->:
Use imaginary nodes as you normally would, but with the addition of the node type:
Here is a complete simple example:
Sometimes ANTLR must duplicate nodes to avoid cycles and to provide useful semantics. In the next example, the trees returned from rule
type are type
V. There is only one type specification in the input (e.g., "int a,b,c;") but multiple identifiers. to create multiple trees with 'int' at the root, that node must be duplicated by the rewrite rule
We want 3 trees, one for each identifier:
We need to override dupNode() in our node class definition as well as two constructors:
Here is a test rig:
Using custom AST node types
Here's a suitable error node:
Error Node Insertion Upon Syntax Error
Prior to v3.1, ANTLR AST-building parsers did not alter the resulting AST upon syntax error. After v3.1 ANTLR adds an error node as created by TreeAdaptor.errorNode(...) to represent the missing nodes or confusing input sequences. The first token in the error sequence is the token at which the parser first detected an error. The last token in the sequence is the last token consumed during error recovery. ANTLR creates a CommonErrorNode by default, but you can obviously create your own tree adapter and override this.
Let me demonstrate the new mechanism by example. Referring to the attached SimpleC.g from the v3 examples, here is some good input:
That input results in the following tree output:
The grammar and tree construction for the FOR loop is as follows:
Now, remove the first '(' of the or loop:
You will see that ANTLR detects an error, but magically inserts the missing token. In this case, the parser was not asked to insert the '(' into the tree so there is no evidence of the error in the output tree:
What about when you have a random extra token such as "22" before the '(':
Again, ANTLR detects the error and is able to ignore the extraneous token to yield a valid tree.
If you forget a token that must go into the output tree, however, you will see an error node. Given a missing identifier at the start of the FOR loop:
The parser emits:
toString() method of the error node yields "<missing ID>".
When the parser gets really confused, such as when it gets a NoViableAltException, you will see that it consumes a whole bunch of input and adds it to the tree as an error node (it indicates what tokens it consumes during resynchronization). Input:
Making custom error nodes
Just override errorNode() in TreeAdaptor. The default handling is as follows:
Make sure that your error node type is a subclass of your node type so that you do not get class cast exceptions.
See the next section for example of how to override.
Turning off error node construction
To turn this off, just override errorNode:
and then set