In null vs missing vs empty vs nonexistent in ST v4 a few years ago, I tried to resolve in my head the difference between a missing attribute, a null value, an array with no elements, and a string with no characters. I don't think I got it completely thought through and ST v4 might have some weird inconsistencies. This page is an attempt to finally write down all the cases and resolve exactly how things should work.
I think the general rule should be: no complete <...> expression evaluation ever equals null. A lone x can have value null, but the resulting <x> evaluates to the empty string. A missing entry in a list like [a,,b] is the only way to create a null value in ST and I think this might have been a mistake on my part to allow missing elements.
Secondly, an undefined attribute x or attribute property x.y is the same as null. Note that undefined y gives a warning to the listener.
Single-valued attribute values
This table shows the result of evaluating the indicated expression given the value of x:
| expr | x undefined | x null | x="" | x=list len=0 |
|---|---|---|---|---|
| <x> | "" | "" | "" | "" |
| <x:t()> | "" | "" | "" | "" |
| <x; null="y"> | y | y | "" | "" |
| <x:t(); null="y"> | y | y | "" | "" |
| <if(x)>y<endif> | "" | "" | y | "" |
| <if(x)>y<else>z<endif> | z | z | y | z |
For that, I assume that x.y where x and/or y are undefined is also considered the same as plain x is undefined or null.
I noticed that people are trying to do filtering inside ST expressions, which is not my intention as I think it violates model view separation. All filtering and computation must be done in the model. The template is only to display the list not to compute the list. This brings us to what lists of values do.
Multi-valued attribute values
We must distinguish between a list, array, or other Iteratable passed in as an attribute and what we create using the list [...] operator in ST expressions. Let's start with attributes passed in. x is an attribute set to the value of the Java array construction specified by a surrounding Java application; a,b Java variables have string values "a", "b". t and u are templates that repeat their single argument.
| expr | x={} | x={a} | x={a,b} | x={null} | x={null,b} | x={a,null} | x={a,null,b} |
|---|---|---|---|---|---|---|---|
| <x> | "" | a | ab | "" | b | a | ab |
| <x; null="y"> | "" | a | ab | y | yb | ay | ayb |
| <x; separator=","> | "" | a | a,b | "" | b | a | a,b |
| <x; null="y", separator=","> | "" | a | a,b | y | y,b | a,y | a,y,b |
| <if(x)>y<endif> | "" | y | y | y | y | y | y |
| <x:{it | <it>}> | "" | a | ab | "" | b | a | ab |
| <x:{it | <it>}; null="y"> | "" | y | ab | y | yb | ay | ayb |
| <x:{it | <i>.<it>}> | "" | 1.a | 1.a2.b | "" | 1.b | 1.a | 1.a2.b |
| <x:{it | <i>.<it>}; null="y"> | "" | 1.a | 1.a2.b | y | y2.b | 1.ay | 1.ay3.b |
| <x:{it | x<if(!it)>y<endif>}; null="z"> | "" | x | xx | z | zx | xz | xzx |
| <x:t():u(); null={y}> | "" | a | ab | y | yb | ay | ayb |
Notice that a null value never gets passed as the iterated value. Subtemplates are not applied to empty values. The null option is evaluated for null elements.
Currently, null option values are not counted in the <i> iterated index variable. This would be a breaking change.
Do we need a way to say null inside of template expressions? One thought might be as the default value for a template parameter, but null is the default value for a named parameter. Parameter x has no value unless we set it.
List construction in expressions
Ok, now let's look at [...] list construction within ST expressions. [...] cats lists together or creates a list from single-valued elements. [a,b] is the same as creating {a,b} in Java. If we have lists A,B then [A,B] is a single list with all the elements combined from A and B. I currently allow a missing element to mean the same thing as null so [a,,b] is the same as creating {a,null,b} in Java.
[ ] is a non-null list with no elements. <[]> is "", <[]:t()> is "".
| expr | value | notes |
|---|---|---|
| <[ ]> | "" | list is empty, yields empty string |
| <[ ]; null="x"> | "" | list is empty; no null element => no x |
| <[[ ], [ ]]:{it | <if(it)>x<endif>}; separator=","> | "" | [[],[]] collapses to [ ] |
| <[ ]:t()> | "" | nothing to apply; template not evaluated |
| <[ ]:{it | <if(it)>x<endif>}> | "" | nothing to apply; template not evaluated |
Any missing map entry evaluates to null.
For dictionary d ::= [x:"x"], <d.(x):{it | <it>}> is x and <d.(y):{it | <it>}> is "".
So, to summarize, I think we're ok as-is. I propose that we change null values to count in <i> index expressions if there is a null option in a future version like 4.1 but let's not change it on a point release like 4.0.7.
Because most ANTLR users don't build compilers, I decided to focus on the other applications for ANTLR v4: parsing and extracting information and then translations. For compilers, we need to convert everything into operations and operands--that means ASTs are easier. For example, 3+4 should become a tree with + at the root and 3 and 4 as operands/children: (+ 3 4). The parse tree in contrast is probably (expr 3 + 4) where rule reference expr is the root node and the other elements are children. The parse tree is indeed less useful for compiler work.
On the other hand, parse trees are usually what we want for data extraction and translation tasks. Moreover, the parse trees can be automatically generated and ANTLR can generate listener and visitor tree walker mechanisms automatically as well. Users of ANTLR v4 don't have to learn AST construction syntax or AST tree grammar syntax and semantics.
ANTLR itself is a very complicated translator. It reads in grammar files and generates multiple Java files as output. The translation is complicated by the fact that the output looks very different from the input and the order of the output can be very different from the order the input. There can be nonlocal transformations. For example, ANTLR can decide it needs a local variable or parameter deep within a rule. Token reference x=T defines a label x that ANTLR would translate as something like:
That code goes somewhere inside the rule function for the surrounding rule. At the same time, ANTLR must inject a definition of x into the rule context object for the surrounding rule, which must be generated outside of the rule function. A single element, x=T, generates code in multiple locations. The order of computation of the output is different than the order of the input elements.
The point is that ANTLR does all this without gradually transforming a grammar AST into a Java AST. That would be extremely cumbersome from lots of personal experience. Instead, ANTLR walks the tree multiple times to extract information and also to annotate the nodes. Once all of the information it needs is available, it does another walk over the tree and builds up a model of the output. The output objects are built up as the information becomes available during the tree walk of the input tree. The result of model construction is a complete tree of output model objects, which must be then rendered to text via StringTemplate templates. The name of the object model class, such as RuleFunction, corresponds to a template with the same name that knows how to render that model object to text. I built an automatic walker that walks the model and builds up the giant output template. The final step is to ask StringTemplate to render that template the text, effectively generating code. It works great!
That does not say there is no use for tree to tree transformations. ANTLR does so itself. I believe that the time to use such transformation is when the transformation results in a tree representing the same language, such as C to C or ANTLR to ANTLR. For example, it's extremely appropriate to normalize or optimize input by rewriting the tree. x+0 can be converted to just x, which means converting AST (+ x 0) to x or parse tree (expr x + 0) to (expr x). ANTLR transforms left recursive grammars to non-left recursive grammars using tree rewriting, but it does so manually at the moment. (boo)
ANTLR does not have tree transformation support at this time and it will not look like it did in v3 when it arrives. Many powerful rewriting systems let you perform rewrites in the concrete syntax of the language you are translating. E.g., see ASF+SDF by example. I imagine we'll do something very similar and allow a series of declarative rules like:
Working with parse trees make such concrete transformation rules much easier to implement.
Sam Harwell and I are also talking about an xpath-like or other query mechanism to specify paths and nodes within parse trees. That way you could say: ``go get me all variable declarations'' and so on. There are lots of possibilities, given all of the research that has been done in this area.
Had a nice lunch with Mihai Surdeanu today at Stanford. Mihai does natural language processing research and has used ANTLR in the past to process English text. He asked for 2 things: tokens that can be in more than one token class (token type) at the same time and the ability to get all interpretations of ambiguous input. Sam Harwell is also interested in getting all interpretations.
Tokens in multiple classes
John Aycock and Nigel Horspool wrote this awesome paperback in 2001: Schrodinger’s token. The idea is to allow tokens to be both identifiers and keywords, for example. In a natural language setting, one might want VERB and NOUN or something more specific to a domain.
The simplest implementation is to split the integer token type in class Token into 2 regions: one for specific token types, and then the upper bits as a bit set to specify which class the token can be in. Of course that would limit us to a certain number of classes because there's a finite number of bits in an integer, but let's start with that. If we want to handle the keywords can be identifiers problem, we would create specific token types for the keywords and then use ID as a token class in the upper bits. That way we could say something like this in the grammar:
The lexical rules would look like this:
The code generated for the parser would look like this:
All of the magic happens in match(t). If the token type requested (the parameter t) is greater than the fixed token type region, then it should treat the token type as a bit mask to use against the incoming token. Otherwise, it just works as normal match(t), checking equality.
Actually, the ParserATNSimulator that does the prediction would also need to understand this splitting of the token type integers so that he could do appropriate comparisons like match().
We could imagine allocating the 1st 12 bits or so of a token type integer for the fixed token types and then we are free to use bits 13 above when we define tokens, as I've done with the new syntax above.
If we really needed many more classes, we could define our own Token implementation that recorded an arbitrary bit set. This would require overriding match() in Parser.java. The key part that ANTLR needs to do is to define the token names like IF, ID, and so on so that the grammar can reference them. Hmm..we also need to specify a new ATN simulator so that we could override the functionality there.
We could do all of this with a handbuilt lexer. Mihai was telling me that for his domain specific application, Law documents, the only fixed token is WORD. Everything else is a domain specific token class. A word might be important in one context, BARRED, but not in another.
One final thought: rather than having a fixed pivot point like the 12 bit, we could do this arbitrarily. If we specify 0x00010000 for a token class, then we are using bits 16 and above. If we use 0x00000100, then we are limiting fixed token types to the and the 1st 8 bits.
Oh crap. ANTLR uses sets sometimes for efficiency. For example if you use (A|B|C), it will generate code to match a set rather than a switch for the subrule. Set (ID|';') would try to add a big number for ID into an IntervalSet. That might cause trouble. Also, I currently generate arrays that assume contiguous token types up to a maximum. A token type of 0x10000000 would certainly generate a big wasteful array. Sounds to me like we shouldn't try to trick ANTLR you into doing this and we should formalize the notion of Shroedinger's Tokens. Perhaps we need to be able define token classes like we define tokens:
Actually, maybe we just add a Token implementation that has a separate field, tokenClasses, that we can check separately. match() would have to have some kind of signal however to check the extra field. Probably ANTLR would have to generate match(tokentype, tokenclass), but that's no big deal. It would require formal definitions like the classes section just shown. Hmm...well, this is a good start to remind me of my thoughts if I can get time to implement this.
Specifying an Ambiguity Resolver
This part seems easy. All I have to do is add a hook in ParserATNSimulator that let's people specify a new resolution mechanism. Right now, the code areas that detect ambiguities, all resolve things by calling getMinElement(). Instead of this I could simply make a a call to some kind of ParserAmbiguityHandler.getPredictedAlt(...) with lots of parameters that give enough information to make a decision. To get all possible interpretations, all we have to do is see the same input to the parser multiple times and have the ambiguity handler rotate through the possible alternatives for each ambiguous decision. When there are no more alternatives to choose from at the ambiguous points, we can stop.
Here is everything I passed to the parser listener when I see an ambiguity:
The ANTLR project is moving to github within a few days. Thanks to user anatol for setting up the ANTLR organization and pulling in the perforce (p4) repositories that we've been using. Everything is now set up for us to seamlessly start using git/github. The purpose of this blog post is to announce this move and to outline how I think workflow should go.
Repositories

Branches
I enjoyed reading A successful Git branching model and I think a lot of it makes sense for my small ANTLR team. Basically there are 2 primary branches:
- master. This is the pristine branch on which we cut all releases. If necessary, we can make hot fixes to this branch but it should always compile and run the tests. Releases are just tags on the master branch not branches, as we did in perforce.
- dev. This branch is sort of an integration branch where all developers merge their stuff. Once the dev branch is working, we can merge it into the master to head towards a release.
Graphically, things would look like this:

The feature branches are private branches (these do not get pushed to the github repo) where we fix bugs or create new features.
Workflow
A typical sequence would be:
- create branch X off of develop branch to do some work
- do some work, committing changes to that branch all we want
- when we are ready to merge into develop, we pull from develop in case there were any changes
- compile in test in X branch
- quickly merge back into develop
To cut a release, we merge develop into master in antlr/antlr4 from parrt/antlr4 or another forked repository using pull requests. When we are ready for the release, we set a tag on master. forks and clones pull to get up-to-date.
Contributions from others
Users that would like to send patches or other updates, should fork the repository, make the fix in a branch name of their choosing, and then send a pull request to antlr/*. We cannot accept anything that is not gone through our clickwrap license so for now people will have to go to that page manually and say something like "pull request xxx from user yyy" and hit submit. then I will be able to acceptable request. It is extremely important that we keep the license clean; e.g., this is why eclipse is able to include the ANTLR software.
Why are we leaving perforce?
I have enjoyed and really appreciated p4's the support of the ANTLR project for the last 10 years, but for the following reasons I'm moving to git/github (in no particular order):
- I don't want to manage a server anymore and make backups
- I don't want to have to create users to give people access to the repository; p4's open source license requires that I do not give out a single anonymous user for general access
- github's pull requests
- git's superlightweight branching that keeps the same working directory; this prevents me from having to create a new project in Intellij for each branch
- Intellij's support of perforce is nice but unbelievably slow, at least for me and the people I talk to
- Nice diff/UI at github
- Nice network effect at github and more people are familiar with git than p4
I'm leaving the p4 server running for a while until I can make sure that everything has been checked in.
Update Oct 2012: resolved with correct definition of when to terminate prediction lookahead. Turns out we don't need to push this extra stack element.
Over the Christmas holidays, I've been busy building example grammars for ANTLR v4. The thing I noticed immediately is that grammars just work. There are no error messages from ANTLR when generating code and all we can get are true ambiguity errors at runtime. E.g., if you can recognize T(i) as both a function call and a constructor call. The lexers work extremely well and I'm a happy guy. Nothing scares ANTLR v4, "it's pretty bad-ass" like the The Crazy Nastyass Honey Badger.
And then, building an R language grammar, I ran into an ambiguity warning and an improperly parsed R phrase. Since the runtime warnings had already found 2 other errors in my grammar, I was not worried. Then, I noticed that the input was not only unambiguous, so should not have given a message, it should have parsed differently. 1.5 days later, I finally have figured out exactly what's going on and exactly what's wrong and exactly how to fix it. The only issue is that the fix is going to add more overhead to parsing not only in memory but in speed. Well, maybe.
The flaw
Imagine we have R expression a(i)<-x that we want to parse starting at rule prog in the following grammar.
I got a mismatch token error at '++'. Somehow, ANTLR decided that it should choose alternative 1 in rule expr_or_assign. That was weird. There is no '++' any input and so ANTLR should not have chosen that alternative. If it's confused, it always consumes more input trying to make a decision. What the hell?
I turned on the diagnostic error strategy and it reported ambiguity warning upon "a(i)<-". What? After some playing around, I remembered that ANTLR delays reporting ambiguities now until a situation which the parser can only match ambiguous input. In ANTLR v3, we would have detected the ambiguity immediately at "a(i)", but as Sam Harwell pointed out, that unnecessarily cuts out some opportunities (it would fail over to backtracking unnecessarily occasionally). (That was flaw or lack of strength problem #1 found in v3's LL(*) algorithm). Once I realized that it was actually ambiguous on "a(i)", it made total sense. It's true, "a(i)" can be matched in 2 different ways by the parser.
Here's how. (Let's assume full context parsing here.) Starting in rule prog, we call expr_or_assign, which calls expr, which calls expr_primary. expr_primary can match "a(i)" completely using the second alternative. OR, you can match just the "a" using the third alternative. To do that, means it has to exit and return to expr. Since "(" not "<-" is the next symbol, it exits expr and returns to expr_or_assign. Since "++" it is also not the next token, we exit and return to the loop in prog. That immediately loops back and jumps into expr_or_assign again. This must enter expr, which immediately calls expr_primary. Finally, we match "(i)" to the first alternative. To finish things off, we return to expr, match "<-x" with the optional sub rule and exit back to expr_or_assign. Since there is no more input, we would assume that we had matched expr the of the second alternative not the first (EOF not "++" is the next symbol). Exiting expr_or_assign returns us to prog where we fall out of the loop because we have no more input.
Yes, the grammar is ambiguous because it can match the same input in more than one way. But, the decision in expr_or_assign is never ambiguous. It doesn't matter what we match in expression, even if there are multiple ways to match some input. If it's followed by "++", it should choose the first alternative of expr_or_assign. Period. Any other token following expression should force the parser to predict the second alternative. Whoops!
After much experimentation and creation of software to trace prediction through the augmented transition network (ATN) representation of grammar, I discovered the source of the problem. To highlight the issue, all we have to do is use the recursive equivalent of prog:
Or, for simplicity, the following manual repetition of expr_or_assign exhibits the same behavior for our input "a(i)<-x".
To understand the problem, note that ANTLR uses a specific definition of ambiguity:
If the parser can reach the same state with the same rule invocation stack (context) but starting from more than one alternative, the decision is ambiguous."
The problem is that ANTLR destroys context information when it converts the recursive looping rule into the equivalent EBNF expr_or_assign* loop. Imagine returning from rule expr_or_assign in this version of prog:
It will immediately enter expr_or_assign again, but from a different spot in prog so that the rule invocation stack will have a different return address. At the start of prog, the rule invocation state might be state p whereas the rule invocation state for the second invocation of expr_or_assign might be q. The stacks will be different down in expr_or_assign--one will have p and one will have q. That is context that he needs to correctly choose between alternatives in expr_or_assign, believe it or not.
We lose contact information with a expr_or_assign* loop. At the start of prog, we invoke expr_or_assign from state p, let's say, and if we return, the loop sends us right back to state p. That means that two invocations of expr_or_assign back-to-back look identical because they have the same stack! Ooops. Apparently, this is not really a problem very often. It only came up in this case because my grammar is ambiguous (derived from that infernal tendency of language designers to allow optional semicolons!).
To see how this works, since I know you are all on the edge of your seats, check out the following image and its green path. (Graphviz, has alternative to on top of alternative one for expr_or_assign...sorry about that.) That path is how ANTLR can get to state 17 by going through rule and matching "a(i)<-x". So far so good. We can identify the configuration of our ATN simulator as (17,1,[8]) because we called expr_or_assign from state 8 in prog. (I have left out that path to avoid even more lines on the graph.) If we find a configuration (17,2,[8]), indicating we have gotten to the exact same state with the same rule invocation stack but predicting a second alternative, we have an ambiguity.

Now, to see the ambiguity we need to follow the path matching "a(i)<-x" via alternative 2 in expr_or_assign (starting in state 19). In the following, I have created an Orange line emanating from state 19 leading into expr then expr_primary. Is one of my choices, I can pick the alternative that matches a single identifier via state 45. Then I reached the end of that rule and return to state 29 then 2 state 30 and state 5, the stopped state for expr. Because we invoked expr from state 19 in expr_or_assign, we return to the following state 22, not 17. We fall off the end of the rule and end up in state 11 of prog. We loop around back to the rule invocation state 8. We return to expr_or_assign and reach the decision point state 21 again. The crucial piece of information here is that the context is a stack of [8]. This is exactly the same context as when we entered expr_or_assign originally. In other words, we do not differentiate between the first and second invocations of a rule within a * loop. More below.
Remember that we are pursuing alternative 2 of expr_or_assign at the moment so any path we take from 21 is still considered part of the possibilities for our original prediction of decision 1 (d=1 in the graph). Choosing to pursue the first alternative, state 15, is what leads to trouble. The blue line shows how to match the remaining input, "(i)<-x".

It eventually returns to state 17, the return state from rule invocation state 15. The configuration of the ATN is (17,1,[8]), which is directly in conflict with (17,1,[8]) associated with the first alternative of expr_or_assign. ANTLR declares an ambiguity. (actually, ANTLR figures it out a little bit sooner at state 27.) The problem is that it's not ambiguity. The only way to get back into expr_or_assign from prog is after having matched an ID to expr_or_assign and looping back around. The context stack of just [8] is the same both times. It does not encode the fact that we have already called expr_or_assign once already--the input position is different. The first time, we are trying to match starting at "a" and the second time we are trying to match starting at "(".
Because only one rule, prog, calls expr_or_assign, Strong LL(*) and LL(*) are the same. There's no point in pursuing a full context parse. Strong LL(*) should be able to handle this without resorting to a full context parse. On the other hand, the situation is presumably very rare. I think it comes because of a fairly ambiguous grammar. We could probably let it fail with the regular prediction mechanism and then simply update the full context parse to properly deal with context for * loops.
The fix
A solution involves changing the context from a simple rule invocation state p to include the iteration number. That would mean tracking how many times each loop is gone around and for each context and so on. Making sure to use the right iteration number would be complicated. The better solution is simpler but a little more expensive. All we have to do is mimic what the tail recursive version does: push something onto the invocation stack for each iteration.
Of then we have to pop this pseudo-rule invocations off the stack. The ATN has 2 convenient states associated with loops: StarLoopbackState/PlusLoopbackState and LoopEndState. The ATNs built by ANTLR for loops look like this:

To simulate tail recursion, we do implicit call to the start of the loop at StarLoopbackState/PlusLoopbackState by pushing that state onto the stack as if we had called the start of the loop from that position. Now, as we step through, say, 'x'* we will get a larger and larger stack that must be unwound after the loop. The LoopEndState is a great place to do this. Instead of doing an actual return to the loopback state, LoopEndState pops all elements from the stack associated with the loopback state for that loop.
To illustrate, imagine input "x" and loop 'x'*. At entry, the stack is empty []. We enter the block, which is just 'x', and then reach the loopback state. We push that state, say, q to get [q] context at StarLoopbackState. We finds no more input and so we exit to the LoopEndState, which pops all q on the stack, leaving [].
Now, imagine input "xx" and loop 'x'*. As before, the stack starts out empty, [], and then we enter the block to match 'x'. At the loopback, the stack becomes [q]. We match another 'x' and hit the loopback. Push to get [q q] at the entry state. No more input, so we jump to the LoopEndState, which pops all q, leaving []. Etc.
The same trick works for + loops. Imagine input "xx" for loop 'x'+. We start out with an empty stack and match 'x' in the block. Looping back around to push q to get stack [q] at the block start. We match the second 'x' and hit the loopback. In this case, it's a decision state and those 2 exit to the LoopEndState. That state pops all q (just one this time).
This requires that I alter my closure routine so that it add this functionality to the loopback and loop and states. Should not be too bad but requires an extra parameter to express this "mode". The closure method has to work for both regular protection and full context production. Maybe I will do some cut-and-paste ![]()
Ok, the point of this huge exercise was simply to convince myself I understood the problem and the solution. I think I got. Parser prediction and grammar analysis are extremely complicated, which is why I'm still finding edge cases years after ANTLR v3 came out. The adaptive LL(*) in v4 is truly awesome, but even more complicated!
Wow. 15 lines of code in closure(). Took about an hour.
ramblings about design as stream of consciousness
I built a quick mockup with NetBeans just so that I have all of the Windows in front of me with a couple of faked images. The basic design is easy because NetBeans allows us to move windows around as we want. I happen to lay the Windows out like this:
![]()
(Sam already has the navigator and the editor Windows filled in, but I was too lazy to incorporate.)
Every window has content, publishes events, and listens for events. For example the navigator window, listens for publication events from the editor window saying "heh I have a grammar". It then displays the list of lexer and parser rules. It, in turn, publishes selection events that the other Windows can listen in on. The editor window moves the cursor position to the selected rule. The properties window might say something about that rule. The syntax diagram window shows the syntax diagram for the selected rule.
Keeping Windows in sync
Lots of the Windows published selection events: navigator window, editor window, syntax diagram window, AST window, and parse tree window. They also listen for the selection events so that all of those Windows can stay in sync. If you select a rule in the navigator, perhaps it selects all rule references in the parse tree.
There's a difference between selecting a rule in the navigator and selecting a specific rule or token reference in a tree. The tree views are representations of the tokens in the input window and how they are parsed. So the trees and the input windows need to be kept in sync and they also published selection events to highlight the appropriate elements in the grammar.
The only selection event published by the editor window is the current rule. I guess that means that as you move the cursor through the editor, the rule in the navigator should also be selected.
Debugger
There is no "debugger" per se. We simply run or interpret the grammar and get a parse tree and, if running with output=AST, we also get an AST. Once we have recorded the entire parse, we may want to walk back through it. Technically, there's no reason to actually step through anything because you can see the entire parse. That said, I think it would be useful to be able to single step by input token or even position within the syntax diagram. This information would be available via the parse log generated during interpretation or parser execution. I have no idea what that log will look like but it will clearly need to include:
- grammar (as a reference to the filename or the actual file contents)
- the input
- the parse tree. this would have parameters and local variable values if it's being run not interpreted
- the AST if this thing is being run not interpreted
By selecting elements in the AST or parse tree, the properties window can show information about that node. For example, the parse tree node property sheet which show whatever labels, parameters, etc... were computed. We could even show via reflection the elements of the AST nodes very much like a full Java debugger. At some point, we might actually use the debugger components of the NetBeans platform to make this more "live"
I do not intend to expose the underlying augmented transition network (ATN) to the user. Any kind of stepping should be done by walking the syntax diagram (which is really the dual of the ATN anyway).
Internal models
ANTLR v4 parses grammars into GrammarAST heterogeneous ASTs and then constructs ATNs from those ASTs and finally generates code from the ASTs. I can't get away from building some kind of internal data structure and we already have the GrammarAST so let's keep it. I also need the ATN so that I can generate it into the output (and also use it for the interpreter). Can't get away without that either. Okay so we are stuck with both the tree and an ATN model of the input.
What model or models do we use for the navigator tree and the syntax diagram. The Nodes API needs nodes objects for the navigation window. These will be wrappers on some other internal model object I guess.
The syntax diagram prototype that Colin is working on uses its own model of the grammar rule such as Rule, Alternative, Element, token ref, rule ref, loop, etc... That is probably fine for experimenting, but even as we moved to a prototype it seems like we should come up with an internal model needed by ANTLRWorks that we can share.
One option is to have Colin walk the AST manually or with a tree grammar to compute size and position of the syntax diagram subcomponents. I could also augment the GrammarAST class hierarchy to beogrammer friendly and include a visitor dispatch mechanism. The current hierarchy is actually almost like a simple input model like what Colin is using now:

It looks like I would have to add AST nodes for some of the other rule elements such as * and + loops, but that is no big deal. Then, instead of simple getChildren() I could add specific methods such as AltAST.getRuleElements() etc... This would make it much easier to use by other people.
What do the navigator nodes wrap? The top level nodes are just lexer and parser so they can wrap just a string and pointer to the underlying v4 Grammar object. The second level nodes are the rule names. Perhaps they should wrap the Rule objects created by v4. These Rule objects have a pointer to the AST which knows where it came from in the text of the grammar. That would allow us to navigate, moving the cursor in the editor window. Actually, perhaps the navigator should simply wrap the (not yet implemented) RuleAST nodes.
Yep, as I write this, it seems like the right answer is to beef up the AST class hierarchy so that it's more programmer friendly. If I did this again, I would go the parse tree but that would require some pretty significant changes through the code generator and all the other systems. Maybe I should simply have ANTLR created a parse tree as well as an AST using some kind of option. That parse tree would make it harder to walk, however, at least manually.
Uniquely identifying window elements
We need a way to uniquely identify the rule elements such as a token references and rule references within a rule. We also need to map that unique identifier to a node in the syntax diagram. Nodes in the parse tree and AST must map back to these unique identifiers so that we can highlight elements in the grammar. Basically we need to publish "location 46" and have at least three Windows know which element to highlight: the parse tree, the AST, and the syntax diagram. The navigator should also highlight the rule that contains location 46, whatever that element is.
We can't use the ATN state number because AW has no reason to create the ATN unless it's planning on doing interpretation of the grammar. It would probably create this on-demand. Probably would be better to use something like the unique identifier mechanism I had in ANTLR v3:
As the syntax diagram created nodes from AST nodes, it would record a reference back to the AST from which it was derived. Selecting this syntax diagram node later, would publish the unique identifier of the grammar AST node.
The interpreter feeds off of the ATN, whose states all know from which GrammarAST node they were derived (or could be made to track). As it created parse tree elements, it could track the appropriate unique identifier pulled from the GrammarAST nodes. Uh, wait. The ATN state Transition objects are the things that correspond to grammar elements. As it transitions, the interp would record the GrammarAST identifier.
If we actually execute a parser, we have a problem. At parse time, we always know where we are in the ATN, but we have no idea how that corresponds to elements in the original grammar. I guess I could create a mapping as part of the ATN I generate in the parser file. The mapping would convert ATN transitions to the unique identifiers in the GrammarAST nodes. They are like pointers that would transcend Java VM executions. As long as we have the same grammar, the GrammarAST unique identifiers would be the same. That assumes of course that we have a deterministic process for creating the unique identifiers, which we would by using the above counter.
Such a map would be extremely useful for syntax error reporting during development. Given a state within the ATNState, the parser could immediately identify where exactly in the grammar the input fail to match. We could even pop up ANTLRWorks and show you exactly where in the grammar the parse was.
As part of a parse log, we could keep the exact list of GrammarAST unique identifiers that we pass through. At the moment, I track the ATN state number, but instead I think I should track the transition objects' GrammarAST unique identifiers. This way, ANTLRWorks could single step through the syntax diagrams and grammar to show you the trace.
We also want to be able to click on an input symbol and have ANTLRWorks highlight where the grammar and syntax diagram that token was matched. I guess we need a list of GrammarAST unique identifiers that correspond to the input symbols. Four 100 input symbols, we would have a list of 100 unique identifiers. to single step through the input would then be easy. So, we would have both single step by input as well as single step through the grammar as a result of a parse. tasty.
Action items
I guess I just need to do this:
- Beef up GrammarAST, adding more types, getters
- Add unique ID to GrammarAST
- Add GrammarAST ptr in Transition objects
- Generate information in the generated code that maps ATN transitions to GrammarAST unique identifiers.
- Alter the trace ATN mechanism so that it tracks ATN transitions not the states
- Create a mechanism to record the unique identifier for each input symbol matched during a parse.
I just finished attending a three day workshop on developing standalone GUI applications with this awesome Java applications framework that you've never heard of. Actually, that's not true. You've heard of it but thought it was an IDE--NetBeans. Unfortunately, the amazing applications framework has been hitched to the NetBeans IDE wagon which, for better or worse, has much less market share than eclipse. (I know how NetBeans users feel because I use Intellij, both of which are swamped by eclipse users.)
I've built a number of applications with Swing, but Swing is just so primitive. Every GUI I build has to reimplement all of the usual application lifecycle and menu stuff. No one uses Swing partly for this reason. Unfortunately, I understand that Sun now Oracle uses the lack of users as a justification to dump it. If you're old enough to remember the birth of Java, you'll recall that all the excitement about it came down to one word: Applets. That is pretty funny because of course no one uses it for applets. Why not? AWT, the windowing library at the time, was terrible. Swing was meant to fix this and, years later after they fixed the speed issues, it's actually okay (though it still pretty low level). The problem was and is that it's not an application framework and we build applications not widgets.
Personally, I greatly prefer using a GUI app sitting on my machine rather than a web app (pretty funny since I'm teaching a graduate course in web systems and algorithms this semester). Anybody who's tried to use google's spreadsheet or document editor online knows what I'm talking about. Anybody try using a web-based development environment? Surely, were not going to dump our IDEs and start typing into web browsers. As Cay Horstmann's JavaOne blog points out
But my feeling is that there is a continued need for the desktop. Information consumers won't care, but information producers do. Someone needs to write code, write books, do art layout, run complex calculations. The desktop has been pretty well optimized for that. Java and JavaFX have the chance to dominate that market while everyone else chases consumer tablets.
There are lots of people still building GUIs, but most of these are internally used by corporations and government organizations that aren't necessarily going to blog about their competitive advantage (e.g., trading software) or give out their software. You can check out some sample applications built with the NetBeans Platform. I notice that lots of those organizations still building GUIs are flush with cash: banks, the military, the government, ...
I understand that all the excitement is about the web and mobile applications. I'm sure if I were 22 again, that's what I would be excited about. Building the very large jGuru.com server 10 years ago pretty much knocked the excitement out of it for me. Next.
Anyway, there also sometimes very good reasons not to build web-based applications. Aside from the difficulty of building rich clients beyond some gee-whiz JavaScript, the first thing I have to do with a web application is come up with a Web server somewhere (and pay for it plus bandwidth). This exposes me to security issues and I have to figure out how to get two computers to communicate with protocols instead of using method calls. It's just one more thing to break and I need to have an active network connection to use my application.
I'm really glad people are out there building cool mobile applications and web applications, but please don't deny desktop applications have their place and please make it easy to build GUIs myself. So here is what I want: Oracle should fork off the applications framework, clean it up, and ship it as a separate product. The NetBeans IDE is already just an application that sits on top of this framework. Come on, folks. Let's put some effort into this application platform, to be named later. ![]()
In closing, let me thank Geertjan Wielenga for giving the NetBeans workshop (an excellent instructor and nice guy) and Andreas Stefik for helping explain more about the NetBeans Platform via his SodBeans IDE for the visually impaired.
Sam Harwell, Colin Bean, Udo Borkowski, Shaoting Cai and I are using NetBeans Platform to build a new ANTLRWorks grammar IDE. More on that later. Here are a few snapshots of the editor for grammars and StringTemplates that Sam got working within a few days:


I have a prototype working for the automatic parse tree construction and automatic visitor generation. Imagine we have the following simple grammar:
The usual startup code looks like:
To make it create a parse tree, all we have to do is set a flag in the parser and then ask for the rule context object. The rule context object is actually a node in the parse tree, implementing interface ParseTree:
So, given the following input:
we would see something like the following:
It prints s as the root node and ifstat as the 1st child and then all tokens as children of ifstat node.
Ok, so what else can we do with the parse tree other than print it out? We want to listen in on the various enter and exit rule events, as well as potentially listening in on token visitation events. ANTLR generates 2 more files than it does now in v3: a listener and a blank listener that implements that interface:
The blank implementation is something convenient to override:
Note that if you want to track what happens as the visitor enters every single rule, you can override method enterEveryRule instead of putting code in every single enterRule variation. To visit the tree, just create a ParseTreeVisitor, pass in a listener, and tell the visitor to visit:
This seems like a pretty simple but powerful mechanism. It also is well within the comfort zone of the average programmer because they are providing callbacks instead of embedding actions within a grammar. The other benefit of this is that programmers can call the visitor multiple times without it reparsing the input.
It seems like a lot of extra code to generate in these extra 2 files, but I guess it can be totally ignored if you don't want it. It's great to have all of this infrastructure generated automatically.
Ok, been doing some thinking and playing around and also talking to Sam Harwell / Oliver Zeigermann.
The first modification I've made is to turn parse tree construction on or off with a simple Boolean, rather than having to regenerate the parser with -debug. Also, the parsers fire methods enterMethod/exitMethod with the rule index all the time now since it is so convenient to have these. No more needing -trace and regenerating to get debug output.
The parse tree is useful for debugging because it records a complete trace of how the input was matched by the grammar. Because it can be automatically generated, it is also a convenient way for newbies and experts alike to get something up and running quickly. It nicely isolates their actions from the grammar itself, which can then be reused for other projects without regenerating and recompiling it.
Basic parse tree listeners
ANTLR can generate a visitor that walks a parse tree, triggering methods of a listener. The most brain-dead listener looks like this:
With the enter/exit methods, programmers can specify what to do on the way down and on the way up the tree. By using a listener mechanism, I can hide all of the details of a visitor within the ANTLR runtime library. It's too easy to forget that visitor methods must check their children. In this way, listeners just manage the event, not the walking of the tree.
Parse tree implementation
The amazing thing is that I really didn't have to implement parse trees because we create RuleContext objects for every rule invocation. These objects contain the parameters, return values, labels, and any local variables you specify in the rule. These can act like a parse tree if we simply track them and add a list of children. We only need to create a leaf node pointing at token objects. To add properties to that particular node, just add a new local. For example, consider rule r:
we get the following context object:
Interface ParseTree defines two sub interfaces
The visitor
We want the visitor mechanism itself to be in the ANTLR library. We might have multiple strategies for walking the tree, distinct from firing events from the tree. The following recursive depth first implementation fires the basic events in ParseTreeListener and any rule specific events that the programmer has defined in their listener:
We want ANTLR to generate an interface like the following that has the rule specific events:
ANTLR always generate this interface in case it's needed
So, to specify what to do upon discovering any rule, programmers would override the visitor's enterRule() method. To respond only to the discovery of a specific rule, they would implement a method from the grammar specific interface. For example, imagine the following simple parser grammar:
grammar T;
s : i=ifstat ;
ifstat : 'if' '(' INT ')' ID '=' ID ';' ;
That would result in RuleContext objects with visitor double-dispatch methods in them (normally I don't like these, but they're automatically generated in this case):
Naturally, we can easily generate a blank listener implementation of TListener so programmers only have to implement the events they care about:
Then, to use the visitor on a particular parse tree root, we can do the following.
Listener event return values
Listeners can track whatever state information they want as fields of that listener object. But, what if they want to compute information and have that percolate back up to the root of the tree? Instead of having a return value from the various listener methods and have the visitor pass them back up, the listener methods can simply alter the parse tree nodes. Normally, this is a bad idea because side effects free code is much easier to maintain and read and so on. But, in this case, the nodes are actually rule context objects that store all of the variables and labels defined in rule. If your listener methods need to compute values used by listener methods triggered higher up in the tree, they can save the values in a field stored in the rule context object. Locals, parameters, return values, and labels and get stored in that context. The parse tree is like a freeze-dried version of the entire stack frame used during the parse. Storing return values in the nodes is simply freeze-drying all of the return values. More often than not, however, the result of the visitor is kept in the state of the listener object, which can easily be returned to the invoking code. It also has the benefit of not modifying the trees at all. That would be my recommendation. We can always simulate return values with a map: Map<RuleContext, Object>.
Looking back at my previous post on the visitors and trees, I noticed that I can't put listener events anywhere but the right-hand side of the outermost alternatives. This would be illegal:
It looks nice, but does not make any sense. The visitor only walks the tree after the entire parse. Triggering event inside a parsing loop has no meaning. To truly deal well with expression trees, I recommend using abstract syntax tree is not parse trees. I suppose the tail recursion could be used here to trigger an event upon each add operation.
Dealing with multiple alternatives
Rules with multiple alternatives should have a different event for each alternative, but using automatically generated names like alt1(), alt2(), etc. is a bad idea because a change in the grammar will silently break the listener. That means we need to do something like I was talking about previously and specify the name of the event in the listener. Something like this (NEW SYNTAX):
ANTLR can collect all of these names and trigger discover and finish methods for them. I don't think it's as useful to limit ourselves to just a single event that occurs after we've processed a rule's nodes. For example, what if you want to do something before the visitor reaches the index of the array reference? We need a listener method for the start and stop of the alternative. Notice that I changed the name from the previous incarnation from visitInt to anInt. This way, ANTLR can generate methods like enter_anInt(), etc. We say what the parser has matched, rather than the specific name of a listener or visitor. Having a discover and finish for alternatives with a single token reference doesn't make much sense but we could ignore one or the other method.
These names can also be used on single alternative rules as well to isolate name changes in the grammar from causing compilation errors in the listener. It would prevent runtime errors for ANTLR targets based upon dynamically typed languages.
This '#' allows AST rewrites with -> and visitors for parse trees.
Summarizing discussion from people on the interest list.
Editor
Likes
- the editor works pretty well to help with auto indenting etc to make things look pretty and provide easy to read formatting.
Dislikes
- editor is quirky
- forward and backward arrows don't always work
- undo is character by character
- a number of people pointed out the inefficient and sluggish error checking and syntax highlighting. there are little user benefits for key-stroke-by-keystroke checking while the user is typing, 2) the constant checking is unnecessary usually and really slows down typing performance. emphasis on fixing this feature. NetBeans, for example, only sends off a check event IF the user has stopped typing for 0.5 seconds (they call this the parsing and indexing loop).
- near the end of a line, the cursor seems to shift slightly so that it's not between characters.
- rule/token name completion should automatically select the top choice so you can just hit return
Desired features
- add ability to see matching closing/opening parenthesis/braces etc.
- the "find" functionality should work in the console and other Windows
- code completion for $x.attribute
- using reflection, if the target allows, to code completion on the field/methods of an object like "$user." for some label user.
- code complete the options
Visualizations
people like the visualizations: syntax diagrams (with ambiguous passed highlighting), AST, parse tree.
dislikes
- Resolving ambiguities in the syntax diagrams is not always obvious; tempting but doesn't seem to help unless you already know what the problem is
- Don't use A..B notation in trees or syntax diagrams; For parser it makes no sense.
Debugger
- A user reports that he prefers to debug and the like via Java code
- A user reports that he likes the debugger, but can't use it with his large project with all of his build scripts. perhaps this means we need to make the remote debug thing more obvious?
Interpreter
- for prototyping, quick check parsers working
- The "run" but should be more obvious.
- inconsistent that you type in the input on the left for the interpreter, but not for the debugger
- the interpreter should ignore the actions if there are any and do the best job you cannot input (e.g., it might not work if there predicates)
Exporting/generating
- export syntax highlighted html; example HTML from grammar. (If you click a defining rule name, you'll be taken to the ANTLRWorks graphical presentation of the rule)
- Export PDF of a selected rule and all invoke rules or all of the rules.
- Most of my students generate antlr code in the same folder as the grammar, so it might be nice if the output path was the empty string by default. Others may prefer it the way it is.
- Make it easier to generate -debug output with a menu option rather than having to change preferences.
Overall GUI
- A user reports: I think of the interpreter and debugger as modes. Modes don't belong on tabs.
- Would be nice to design this new version has a set of widgets that cooperate so that people can take pieces of it and inject into other development environments or whatever
- Allow the visualization window to "Pin" to the bottom, like most IDE's do, effectively allowing it to auto hide.
Misc
- console spews continuous error messages
- clear the console when you generate or run any other kind of analysis.
- More documentation; maybe that's a book thing
- Preferences can only be set once a grammar is open. Preferences should be able to be set regardless of having a grammar open.
- tool tips about what the various stuff is; e.g., what is an UP token?
- It would be nice if the preferences window followed the standard paradigm of an Apply, Ok, and Cancel buttons. If no settings need to be changed, the apply button is greyed out.
What I did not hear people talk about
I've only heard from a small number of people at this point, less than 24 hours after my initial post, but no one mentioned:
- Refactoring
- displaying decision DFA (well, one person mentioned to say that sometimes it gets stuck)
- only one person mentioned showing the generated code
- group/ungroup rules, rule collapsing
After a few weeks away from ANTLR v4 coding, I'm back to thinking about tree grammars and the automated generation of tree visitors. I recently replaced a number of tree grammars in ANTLR v4 itself with much simpler visitor implementations. Doesn't require a separate specification and is much easier to debug. I made an ubervisitor that actually matches patterns in the tree rather than nodes (using a single prototype tree grammar) and then calls listener functions. The ubervisitor has the combined pattern matching functionality necessary for the various listeners.
A tree grammar is nothing more than a formal specification of the tree structure created by parser grammar. From this tree grammar, I generate an external visitor (See Language implementation patterns). People always ask if there is an automated way to create tree grammar from the tree construction instructions sitting in the parser grammar. It's a pretty hard task given the complexity of the rewrite instructions and so on. I was thinking that I could do something that observed the trees you created and combined it with knowledge of the grammar to produce a tree grammar. Sounded pretty complicated. But, that got me thinking about using the structure implied by the grammar to automatically produce high-level visitors (i.e., visitors that don't simply tell you that they found an ID node, which is pretty worthless). What we want is to match phrases per what you had in the grammar. Oliver Zeigermann and I have been tossing some ideas around, which also led to my thoughts on this blog page.
Newbie mode
What if we had a newbie mode that literally did not expose an internal representation (tree or otherwise). Let's face it. most of the time all we want to do is parse something and then visit the elements multiple times. We'd reparse all of the time if it weren't so slow and it didn't sound so unsophisticated. Also, trees or some other internal data structure are really great for keeping little pieces of information. We like heterogeneous tree nodes because we can have different information per node. To replicate that in a visitor, we could simply deconstruct the nodes or whatever the internal representation is, and simply send that information back to the visitor as parameters of a function. No need to have any of this exposed to the user. Here is a simple expression grammar thingie:
We don't need to build a tree per se. What we need is to be able to visit the phrases and tokens multiple times, and efficiently. We also don't want to have to build a tree grammar, which requires tree construction by the user, and we don't want to have to build a visitor by hand.
So, perhaps we automatically generate visitor methods that mirror the grammar. This has the advantage that were not using the generic visitor that visits node by node. Here's what we could easily generate automatically from the above grammar:
The return values get passed as parameters to other visitor methods. For example, rule stat calls rule expr and saves the return value in label e. We pass that e value to the stat() visitor method. You can return nothing or a payload of interest, depending on the application of the visitation task. We could even let these functions set data "local" to each particular phrase matched by the parser. Subclasses of OllieListener could maintain state by using instance variables. If we want to persist the parameters and return values, this could easily be added to the functionality of the internal representation.
The point is that we don't have to specify tree structure. We don't have to build a tree grammar that calls our listener functions (or has actions directly inside). We don't need to create heterogeneous tree nodes; the parameters of the various visitor methods represent the elements of tree. Actually, as I say, we don't actually care what's on the inside. could be a list of tokens that we re-parse continuously.
The other thing to note here is that I don't call them visitStat, etc. these are actually listener methods triggered by a visitor. We don't want the user to have to make the right calls inside these methods to make sure that we walk the entire intermediate representation.
From looking at the work of my friends, Jurgen Vinju (ASF+SDF) and Eelco Visser (Stratego/XT), there are a couple of very interesting lessons:
- you don't always have to care what the internal structure of the input is
- when doing transformations, using concrete syntax like "x+x -> 2*x" is useful; probably implies a parse tree as the internal structure
- it's a great idea to separate how we walk the input from what we do while discovering nodes or phrases.
The negatives
I can imagine this being less efficient because I would likely build a parse tree instead of an AST. Parse trees have internal nodes that represent rule applications/invocations, but for many applications we care more about the relationship between the tokens than we do about which rules were used to match them. So, these internal nodes are a waste. What we really want to do is to create operator-operand trees such as (+ 3 4) where + is at the root and has 2 children, 3 and 4. Power users will still want the ability to create their own ASTs and tree grammars / tree filters for maximum efficiency and control.
The other problem is that every time you change the grammar, the visitor would change. Like automatically generated GUIs, we have to worry about code that users supplied that no longer apply or should go into a different method. So, if you rearrange alternatives one and two of atom above, all of a sudden the code would break since atom1 and atom2 would be flipped. Worse, it might sort of work. I guess one way to mitigate this is to make sure that you get a decent grammar 1st and then generate the visitor. Minor tweaks or reordering from that point shouldn't be a big deal. I can't think of a good automated way to reduce the sensitivity of the visitor to the original grammar since we are relying on the grammar for structural information. If we left special markers in comments, we could do like the GUI designer tools do and try to shuffle programmer supplied code around to the right spot.
For ANTLR v4, I had a single tree grammar that walked the nice orderly tree created from my parser grammar. That isolated all of my visitors from changes to the tree and changes from the original parser grammar. I don't see a way to isolate visitor code from grammars automatically.
If we want to do transformations, however, we can use concrete syntax which hides not only the internal structure of the tree but does not have any visitors that programmers can modify. All the functionality is extracted from the side of "this goes to that" transformations. Of course, this only works when you're doing translation. If you need to trigger calls from the configuration file into your massive server, there's no escaping the fact that we need to execute code snippets. Where to specify and how to specify those snippets is the key question.
Pattern-based visitors
Maybe we should borrow a page from the term rewriting guys and use patterns instead of special method names. For example, we could specify contexts (rules) and then patterns followed by actions to execute if those are seen:
This would work very well for operating on pieces of the input, such as all variable declarations. Of course, requires an extra step to convert this to Java code and without a plug-in an IDE must treat this as a text file.
One of the negatives I can see here is that there might be a lot of cut-and-paste from the grammar into these patterns. The pattern for a method declaration can be rather large. In contrast, a visitor would simply say visitMethodDecl() or whatever. Obviously, one could simply put a call to visitMethodDecl() inside the action, isolating the code from the patterns.
Contrast this pattern-based approach, with a tree filter grammar that triggers actions. I would have to specify how to build trees in the parser grammar but could get away with specifying only the patterns of interest in the tree filter like this:
The pattern-based approach seems easier and, because it uses concrete syntax, could be easier to use.
Java annotations approach
I wonder if there's a way to use annotations to specify patterns of interest to avoid having to run ANTLR or some other tool on those patterns or rules to get the visitor. (This won't help the other languages but Java is probably the largest market for ANTLR, so I will investigate this option).
From the collection of annotations, and optional patterns, we could run a static annotation analyzer to derive a visitor that would trigger these patterns. We could also do it at runtime so that there was no tool to run whatsoever to make this work. Running the static analyzer would simply make it go faster.
I can see that the names of the parameters could be an issue because changes to the labels in the grammar would force changes in the method signatures. I guess at least we would know about it because the annotations.
Mixed-mode rules and listener triggers
Finally, it's probably worth exploring a simpler and more direct approach: adding triggers to the original grammar but having ANTLR actually invoke them during its traversal of an internal data structure.
Here, I'm using the -> rewrite operator to indicate the name of a function to trigger. The parameters to the function are the labels if any on the elements to the left of the -> operator. To be more general, we can allow arbitrary actions to the right of the rewrite that would call the listener function after performing some operations and passing custom values to the listener.
ANTLR would generate a visitor as as well as the parser. The parser would have embedded actions to build a parse tree. The programmer could request multiple visits over that tree without having to re-parse the input. It only looks like we are triggering actions during the parser.
I'm not sure I like overloading/abusing the -> rewrite notation, but I guess it's okay.
This approach is nice, because you might have noticed that we don't call visitAdd unless we see an addition operator. Some of the previous versions call visitExpr even when all it does is match INT down in atom.
The user doesn't have to figure out a tree to build, ANTLR would simply build a parse tree. Because we are specifying the names of functions to call, the grammar and resulting visitor functions should be less tightly coupled than the visitor methods atom1, atom2, atom3 for the 3 alternatives of rule atom, for example.
Introduction
I'm abandoning this post mid-stream...seems that regular alternatives can match erroneous input just as easily as so-called error alternatives. Because of adaptive LL(*), it shouldn't affect production speed at all once it gets warmed up.
ANTLR has a built-in mechanism to detect, report, and recover from syntax errors. It seems to do a pretty good job. Certainly it's better than PEG, which can't detect errors until EOF. It does single token insertion and deletion automatically to deal with lots of common cases where people forget a symbol or have an extra one like "f(x;" and "f((x);".
The default mechanism recovers by scanning for a token that can follow the current rule or the invoking rule and so on. This work surprisingly well and is pulled almost directly from Niklaus Wirth's early Pascal parsing discussions in the 70s.
Yacc has error productions like this:
that mean consume tokens until you reach SEMI, replacing everything but the SEMI with the error nonterminal. The error rule acts like a non-greedy ".*" loop.
My intuition is that scanning with the power of the parser rather than ".*" will allow a great deal of control over resynchronization. At the very least, it it will provide a great deal of flexibility in terms of generating good messages. We can simply list the common mistakes and add actions to emit very specific error messages. With only ".*" to distinguish erroneous input, yacc it might have trouble distinguishing between different erroneous sequences (in order to generate specific error messages).
The error productions serve both to recognize the erroneous input and to resynchronize the parser. After specifying the grammatical structure of the erroneous input, the error production can include the grammatical structure needed to resynchronize. After detecting an incorrect variable declaration prefix, a programmer could specify how to consume all the way through the rest of the declaration, even if it's not terminated by a \n or ';'.
where '/' is a "bent" or "not quite right" alternative operator, to distinguish from the PEG ordered alternative operator. One could even imagine skipping over the entire method body if the header is messed up:
Here, we match any kind of header and then skipped past the entire body. There's no way a simple "scan until resynchronization set found" loop could scan past a nested construct like a method body.
Another important thing to mention is that the ".*" is not scanning until it sees the 1st token of rule body. It's a non-greedy loop that skips anything until a valid body is seen.
Proposed mechanism
To support their productions, rule alternative blocks will be extended to allow 0 or more error productions at the end. Rules would not support error productions. (I looked through a number of grammars and couldn't really find a situation in which the default error mechanism was seriously lacking.)
The normal mechanism, without error productions, traps mismatched token and no viable alternative errors within the productions of that rule. After reporting an error, it resynchronizes to follow set. (assuming we could not repair the error with single token insertion or deletion inline). If the user specified a catch expression, that code was executed in lieu of the standard mechanism. Any finally clause is executed after everything else, right before returning from the rule method. Rules look like this:
Here's a rule with a few error productions:
Boy, I'm sure having a hard time coming up with decent examples.
At long last, I'm back on the ANTLR v4 rebuild after 9 months hiatus to write an academic LL(*) paper with Kathleen Fisher and release StringTemplate v4. Woot!
Ok, so what does all that title nonsense have to do with ANTLR v4? Well, v4 will use all those things at some point, either in analysis or in the generated code. I'm proposing something a little different for v4: Along with a recursive-descent parser, I'm going to serialize the grammar down to a simple PDA (push-down automaton) represented by the bytecodes of an interpreter (Derived partially from Cox' description of Thompson's 1960s work). The interpreter will serve many useful purposes:
- a pure interpreted version of a lexer/parser/tree parser; related work: A Text Pattern-Matching Tool based on Parsing Expression Grammars, but I will need to extend for semantic predicates and actions
- the backtracking mechanism for parsers and tree parsers
- the mechanism for handling lexical constructs that a DFA cannot handle.
- improve error recovery
- support "code completion / intellisense" type functionality in development environments
From the bytecodes, we can build a data structure in memory that will be more convenient for a lot of purposes. The data structure will effectively be a recursive transition network (RTN) or, if there are predicates and actions, an augmented transition network (ATN).
Pure-interpreted parsing
Sometimes you need to parse some data but going through the whole ANTLR code generation and then compilation process is overkill. It would be nice to have a good interpreter that could parse input described by a grammar without having to deal with code generation. We can short-circuit the ANTLR tool at the point it's created an internal ATN but before code generation. Then, we can simply run an interpreter over the ATN.
If you don't mind a code generation phase but don't want a huge output file typical of recursive descent parsers, we could also generate the bytecode for a PDA into a Java file. The bytecodes are like a serialized version of the ATN. Even if you don't plan on interpreting the grammar, I think I will generate the bytecodes anyway for the recovery stuff I describe below. The bytecodes even for large grammar shouldn't be more than a few k or 10s of k bytes. Hmm... this does bring up the whole issue again in Java of generating static arrays (see Implementing parsers and state machines in Java).
Interpreted backtracking
Unlike most backtracking parsers, ANTLR v3 supports arbitrary actions, even those that cannot be undone. It does that by turning off the actions while backtracking. Upon success, it rewinds the input and then re-parses the input, this time with feeling to execute the actions. To pull this off, I need to generate an "if backtracking" conditional around every action, which causes a problem if you've defined a local variable in that action. All of a sudden that variable is invisible to the other actions. Also, at least the way I generate syntactic predicates in v3, backtracking results in code bloat.
If I have a separate mechanism to do backtracking, I don't need to guard all of the actions with a conditional and there's no code bloat. Syntactic predicates will launch an interpreter (as bytecodes or walking an ATN; doesn't matter) to see if a particular grammar fragment matches. Even if this turns out to be much lower than using generated recursive descent code fragments, ANTLR works very hard to remove unnecessary syntactic predicates. I doubt even a slow implementation would affect the overall parsing time significantly.
Handling context-free fragments in the lexer
ANTLR v3 lexers annoy me and probably you too. v4 lexers will be DFA-based and, theoretically, should be much faster as a result. Further, they will also behave much more as people expect in some edge cases. (I'm adding modes since they are really useful for parsing HTML and other embedded languages). The problem is I really want to support non-greedy operators, which are really hard to implement in a conventional DFA but trivial in an NFA/ATN. I also want to allow arbitrary actions interspersed among the rule elements. Oh, and recursion. What this comes down to is using the DFA as an optimization if you don't do anything wacky. If you come up with a rule that's recursive or has actions and non-right edges or non-greedy operators, ANTLR generates PDA bytecodes for those rules and, importantly, any rules that look like they could match the same thing. Upon entry to nextToken(), the lexer uses the current character to determine whether to jump to the PDA or the DFA. Oh, I almost forgot--Unicode characters can cause trouble in terms of table size for pure DFA. Range 'a'..'\u8000' for example, makes a transition table 0x8000 elements big for each state that could reference it. It makes more sense to deal with Unicode with a separate mechanism. I think JavaCC does this too.
Better error recovery
If we have a mapping from every location in the grammar to a position in an ATN, we should be able to do a pretty good job of error recovery. For example, we can supply error alternatives without bloating the generated code at all. Here is a contrived example with two regular alternatives and two error alternatives:
If there's a syntax error, we can suspend the parser and launch an interpreter that tries to match one of the error alternatives. If one succeeds, an action in that alternative will properly report an error and resynchronize. If nothing matches, we can do the standard recovery boogie.
The standard error recovery is to compute the set of possible following symbols, using the parsing stack as context to provide the exact follow rather than FOLLOW(the-rule). If we have the entire grammar represented in some way, ANTLR does not have to compute and then generate code to store bits sets for error recovery. I can just walk the ATN at runtime. It also doesn't have to pass these bits it's around as it does in v3. For rule reference r, currently I need to generate:
Likely I will need to keep some kind of pointer and stack of pointers into the ATN representation of the grammar in order to pull off this error recovery.
Code completion
Beyond error recovery, we can provide good information to the user about what the parser could've accepted at any particular error position.
After reading more about whitespace handling in scannerless parsing generators (e.g., GLR, PEG), it looks like you have to manually insert references to whitespace rules after every "token rule" and one at the beginning of the parse. So apparently, ANTLR is a scannerless parser generator if you simply use characters as tokens.
This page shows not only how to build a real scannerless parser in antlr but also shows how to build abstract syntax trees (i.e., not parse trees)!
Scannerless parser
Here is the grammar from before updated with whitespace handling and keyword checking in rule id.
Given input
ANTLRWorks reports the following parse tree:
![]()
The other problem is to avoid matching keywords as identifiers, which we can handle with a semantic predicate. Oddly enough, we can do it without a predicate that gets hoisted into the decision-making process. The predicate in rule id makes sure that we have not found a keyword. Given input:
we get a bad, but correct error message:
line 1:6 rule id failed predicate: {!keywords.contains($text)}?
Notice that, because the predicate appears after letter+, there is no way it gets hoisted into the prediction for rule stat. It turns out that syntax alone is enough to distinguish the alternatives because "return" it is always followed by an expression but not by '=' or ':'. The order of the alternatives in stat does not matter; LL(*) correctly handles the decision.
What if we have two syntactically identical alternatives where the left edges start with a keyword and an identifier?
ANTLR reports the syntactic ambiguity:
warning(200): T.g:21:5: Decision can match input such as "'r' 'e' 't' 'u' 'r' 'n' ' ' ';'"
using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
It resolves it correctly, however, because priority is given to the first alternative. If we swap the order of the alternatives, though, "return" would match as an identifier. We could fix this with a semantic predicate, but it would be awkward in this scannerless mode. The predicate would have to march ahead to match an identifier and then test it against the keyword list. The validating predicate we have above in rule id checks the text after we've matched the identifier.
To do this for real, I think I would make ANTLR deal explicitly with scannerless grammars so that it could automatically add "ws?" on the end of any token rule. we already have the notion of a token rule, anything starting with an uppercase letter. So if you define
Then ANTLR would automatically insert WS? on the end of that rule. You can also optimize any backtracking or memoization that had to happen, since it knows that the rule is really a token.
Oh, you want AST construction?
The usual ANTLR AST construction mechanism works too, but it's a little bit more work because the individual elements are characters not tokens. Here I have defined a few imaginary tokens for use as AST nodes. For input:
// start return 23; // foo foo = bar; // end
we want AST:

Here is the grammar with AST construction modifications:
PS: woot!
Scannerless parsing generators have an advantage over separate lexers and parsers: it's much easier to create Island grammars, combine components of grammars, and deal with context-sensitive lexical constructs. I still think I prefer tokenizing the input, but thought I would run an experiment to see what a scannerless ANTLR grammar would look like.
I started out with the grammar that contained an LL(*) but non-LL(k) rule (stat). Because we're looking at characters as tokens, referencing rule id on the left edge of the second two alternatives of stat represents an infinite, left prefix. id also conflicts with the keyword rule kreturn. Here's the grammar:
The DFA that predicts alternatives of rule stat looks complicated, but it's really just trying to see past the letters to the characters that follow. LL(*) handles this with no problem but LL(k), with its fixed k lookahead, couldn't predict alternatives. Here is the DFA:
![]()
The trick to making this work is to create a stream of tokens that are really characters. The problem is that, since were making a parser grammar not a lexer grammar, ANTLR thinks that the character 'a' is really some random token type instead of the ASCII code. Using the tokenNames array in the generated parser, the following class figures out what the right token type is for each input character.
To try this out, the following test rig works. The only difference is that we are using a weird kind of lexer:
For input:
return 23; a = b;
We get the following output (the hash table printed out first is the mapping of ASCII code to token type). we ignore any input character for which the grammar has no reference, such as newline whitespace. Finally, I print out the token list before parsing and then, generating with the trace option, we see the entry and exit rule events during the parse:
$ java org.antlr.Tool T.g
$ javac *.java
$ java Test input
{48=4, 49=5, 50=6, 51=7, 52=8, 53=9, 54=10, 55=11, 56=12, 57=13, 59=14, 61=15, 97=16, 98=17, 99=18, 100=19, 101=20, 110=21, 114=22, 116=23, 117=24}
no token type for char ' '
no token type for char '
'
no token type for char ' '
no token type for char ' '
no token type for char '
'
[[@0,0:0='r',<22>,1:0], [@1,1:1='e',<20>,1:1], [@2,2:2='t',<23>,1:2], [@3,3:3='u',<24>,1:3], [@4,4:4='r',<22>,1:4], [@5,5:5='n',<21>,1:5], [@6,7:7='2',<6>,1:7], [@7,8:8='3',<7>,1:8], [@8,9:9=';',<14>,1:9], [@9,11:11='a',<16>,2:1], [@10,13:13='=',<15>,2:3], [@11,15:15='b',<17>,2:5], [@12,16:16=';',<14>,2:6], [@13,0:0='<no text>',<-1>,3:1]]
enter prog [@0,0:0='r',<22>,1:0]
enter stat [@0,0:0='r',<22>,1:0]
enter kreturn [@0,0:0='r',<22>,1:0]
exit kreturn [@6,7:7='2',<6>,1:7]
enter e [@6,7:7='2',<6>,1:7]
enter integer [@6,7:7='2',<6>,1:7]
enter digit [@6,7:7='2',<6>,1:7]
exit digit [@7,8:8='3',<7>,1:8]
enter digit [@7,8:8='3',<7>,1:8]
exit digit [@8,9:9=';',<14>,1:9]
exit integer [@8,9:9=';',<14>,1:9]
exit e [@8,9:9=';',<14>,1:9]
exit stat [@9,11:11='a',<16>,2:1]
enter stat [@9,11:11='a',<16>,2:1]
enter id [@9,11:11='a',<16>,2:1]
enter letter [@9,11:11='a',<16>,2:1]
exit letter [@10,13:13='=',<15>,2:3]
exit id [@10,13:13='=',<15>,2:3]
enter e [@11,15:15='b',<17>,2:5]
enter id [@11,15:15='b',<17>,2:5]
enter letter [@11,15:15='b',<17>,2:5]
exit letter [@12,16:16=';',<14>,2:6]
exit id [@12,16:16=';',<14>,2:6]
exit e [@12,16:16=';',<14>,2:6]
exit stat [@13,0:0='<no text>',<-1>,3:1]
exit prog [@14,0:0='<no text>',<-1>,3:2]