Stuff I do not like in ANTLR 2
Updated March 10, 2004
In no particular order, here is a list of stuff that I don't like or
must be fixed for ANTLR 3; includes stuff about the site, support,
build, debugging. My comments are followed by the (in arrival order)
comments of others. This is not a complete list, just the stuff that
annoys the hell out of me. Please send (parrt at antlr.org) your
complaints. :)
Thanks to all for the encouragement and for caring enough to
complain! ;)
Terence Parr spake:
- no unit/functional tests
- the build process
- no separation of runtime vs tool
- no way to get the text or tokens matched for a parser rule.
- parse trees are sometimes useful. Automatic generation would be nice.
My new derivation sequence generation should be automatic as well.
- EOF must be handled properly in lexers (i.e., a virtual character)
- Lexers are pretty damn slow
- Labels are unique to a rule, which was done for exception handling.
It's annoyhing though because you can't do this (a:ID|a:TYPE) to have
'a' set to whatever was matched. Now you have to set two different
labels and then figure out which was set. Ick. Sriram Srinivasan
brought this to my attention again.
- Approx lookahead vs full LL(k) bites me about once a grammar.
- I hate having to left-factor my lexical rules (though I love the
LL(k) nature of them). Python would have been MUCH more difficult w/o
full LL(k) lexers I think.
- I hate not having a good way to resolve nondeterminisms.
- I want an easy way to start a new grammar (wizard or otherwise).
- I don't think I like the inheritance model for reusing grammars (not
to say you don't want inheritance; you want it at the implementation
level not the grammatical level).
- I hate how hard it is to build a code generator.
- I don't really like how expressions work. A precedence parser would
be much better. Hard to explain to people as it is and it's slow.
John Mitchell
Lack of predicate hoisting.
Matthew Ford
His thoughts are long enough to warrant a separate page:
Tree building proposal
Monty Zukowski
I would like a better way to manage all the warnings. I don't particularly
like the warnWhenAmbig=false flag. For one it doesn't work for all the
cases I want it for.
I think a companion tool could be better at annotating error messages and
presenting new ones/hiding old ones, etc.
An ANTLR Cookbook would be quite handy too, and could be built on top of the
existing examples, and my C grammar :)
Of course there's a slew of tree things but Loring's got that covered from
the design cabal.
Allow syntactic predicate to break out of a loop.
Interactive parsers are not so fun with ANTLR 2.
I would like to be able to trace a Token all the way back to file offsets so
I could modify files in place, not have to regenerate them entirely. That's
not so easy with Unicode.
I also hate dealing with ambiguous keywords!
Chris Daly
- (I thought of this last but I'm adding it at the beginning because
it's so important to me!). Those of us wanting to use Antlr within a
corporate environment have to do something to make the laywers happy.
I think the singlemost important thing you can do here is to have a
mechanism for registering contributers. Each contributer should be
reachable (email is fine but the more contact info you have the
better) and should have made some kind of affirmation that they agree
with the license (this affirmation could be an email that you print
and save, but a signed form faxed or snail-mailed would be even better).
I suggest looking at some of the bigger open source projects like
Mozilla or Eclipse or Apache to see what mechanisms and forms they use.
Laywers would be even happier if the contributers all assign their
copyrights to one person (i.e. Terence) or entity (like U of S.F.) but
I don't think this is necessary as long as all of the contributers are
contactable (and there aren't so many that it becomes extremely
difficult to contact them all).
Beyond that, you mentioned before that you are considering BSD as the
license. That would work for me. GPL or LGPL would totally disallow
me from using Antlr. CPL (the license Eclipse uses) would be ideal
for me but BSD is very doable.
Here are the more technical suggestions:
- Ignore options that Antlr doesn't care about. Warning about
unknown options is ok, but don't just bail out. I have some cases
where I am parsing the .g files to generate some code that works with
the generated parser. So I want to be able to define my own options
that my tool looks at but Antlr ignores.
- never call System.exit()! Throw an exception instead.
- I second your #6, "Labels are unique to a Rule". I would also
recommend replacing the name:TOKEN syntax with name=TOKEN. Using the
colon in that context can cause some head-scratching bugs. Look at
the rule(s) below and see what can happen now when you forget the
semicolon at the end of a rule.
rule1 : a b
rule2 : X;
- When a parser refers to an undefined token complain about it. For
example the following test case compiles without error even though
token DASH is not defined anywhere:
class TestParser extends Parser;
sos : s o s;
s : DOT DOT DOT;
o : DASH DASH DASH;
class TestLexer extends Lexer;
DOT : '.';
TJP: it turns out this is often what you want, but not always. Ric
had this turned on for ANTLR and it made my java grammar complain like
mad. All of those actions in a lexer that say "$setType(FLOAT);" in
rules like NUMORINT will result in errors.
- I'd like to see a better system for managing parsers and lexers in
separate files. I don't have any specific ideas here, just the
feeling that the TokenTypes files introduce an unnecessary extra level
of confusion.
- I want to second what Monty said:
I would like to be able to trace a Token all the way back
to file offsets so I could modify files in place, not have
to regenerate them entirely. That's not so easy with Unicode.
I think Token offset is a more fundamental concept than line/column.
If the latter are available then the former should be too.
- Some way of using a literal without adding it to the literals
table. I always end up with rules enumerating the keywords that are
also legal identifiers, like:
id: ID | "foo" | "bar";
There could be some syntax like "foo"# that means don't (or do) add
this to the table.
Ric Klaren
- I seriously dislike having to dequote java strings in the codegenerator and checking them for sanity (multibyte sequences) and then repackaging them again. I'd rather see the lexer supply int arrays or something similar so I don't need to worry about quoting other than during writing stuff out.
- Also the lexer should check the charactervocabulary for sanity (no values 0,1,2,-1) And all string/character literals should be checked before passing them to the codegenerators.
- Loring reminded me just now in a post: I want to be able to
reference EOF
in the lexer as a 'normal' token. uponEOF is a kludge.
- ANTLR should stay a commandline tool (the core functionality for
grammar
inheritance or whatever it should replace I don't care) if it
doesn't run
from a Makefile I'm not interested ;)
- More documentation in the code.
- The lack of warnings/errors for incorrect use of options/constructs
- Not being able to specify a default template errorhandler for the
whole
grammar.
Internal things:
- Internally I'd like a better interface between codegenerator and
what is
now the action parser. (if that's still an issue with the new
codegen/syntax) The near heuristics now used to do the right thing
to
translate a #treethingy into something sane is a horror.
- Codegen wise I'd like to know more things before I start generating
a
piece of code, so I can cut down on declarations etc. Stuff that
does not
need to be constructed does not cost cpu cycles.
- I dislike the 'all-over-the-place' system used for grammar/file
options.
I'd prefer having the included codegenerators register commandline
options and handlers in the maintool (and get rid of all the globals
for
them).
- Clear semantics for things like:
( { x < 4 }? myRule )*
( { x < 4 }? myRule )+
- Clear documented semantics for the scope of a variable defined in actions:
( { int somevar; }
)*
Or:
( { int somevar; } :
)*
Or dirty stuff like:
( { if( i > 10 ) break; } :
someRule
)*
( { if( i > 10 ) throw ....; } :
someRule
)*
Or a statement like 'You're on your own!'.
- Consistent importVocab/exportVocab behaviour when the lexer/parser are in
the same file and in the separate files cases.
- it would be nice if ANTLR would examine dependent grammars and remake them
if needed. Would probably require some extra options for specifying the
search path for grammars/vocabs. (we definitely need more control for
this as someone else pointed out as well)
- It would be nice if ANTLR3 was designed from the outset for heterogeneous
AST and Token support. In antlr2 this was not the case and it shows.
- I would love a shorthand syntax for:
( stuff ) ( DELIMITER stuff )*
Where delimiter is usual a single token or a set of tokens. The advantage
is that the action code for stuff can be the same. In the current
implementation you have to keep some near identical bits of code
synchronized with the occasional copy paste error resulting from that.
ell used a syntax like ( stuff || DELIMITER )* or something along those
lines.
- It would be nice if antlr would warn for common mistakes like
rule* in stead of ( rule )*
Pete Forman
I still have a feature request. Might we have a "-i" command line
option to specify a directory other than the current one to locate the
import vocab file. This would complement "-o" for those such as
myself who keep source and generated/object files in separate
directories.
Anthony Youngman
I think that gcc allows you to specifically disable certain
warnings--along the lines of
warn everything except warning number 82
I suspect that when I get things working the way I want, my screen
will explode in ambiguity warnings! The problem is that there might be
one warning I really need to see, and I miss it in the screeds of
ambiguity crap I can't suppress.
I'd suggest that you allocate numbers to all your warnings (you
probably do that already :-) and then print your warnings via a call
to a central routine. Then have some way of telling that routine that,
if it's called for warning(s) x, it should return without doing
anything.
Mike Tiller
I'd like the C++ runtime to use data structures that handle heterogenous tree construction better. I gather the current approach is based on trying to mirror the Java side of things (at least that is what I recall Ric saying). I think a more C++ish design is necessary. I used heterogenous trees in my project and I don't regret it (in the sense that I strongly prefer heterogenous trees), but it sure was a pain to work through all the inheritance, reference counting, type casting, issues.
Robin Debreuil
One thing I find strange is using exceptions for flow control in the
generated code. I prefer exceptions to just be used for things you aren't
expecting in code, probably that is just a style thing though. Just this way
they seem to be much like gotos with global variables for state...
For the C# version, it would be really nice to use Enums for all the tokens,
or better yet, categories of them. It makes debugging much easier, and the
whole thing becomes a bit more 'solid'.
Above all though, it would be great to have more error information
available, both in the grammars and when running the generated code. The
program itself is designed to facilitate building that kind of thing into
languages, so it seems kind of like the cobblers kids going without shoes.
Maybe restrict what is valid syntax in a grammar and catch more common
errors. For the generated code, maybe a debug version - where it can tell
you things like the statement that it couldn't get past etc. Maybe even
things like setting breakpoints on input files.. Probably that would be
hard, but with all the guessing levels, gotos, exceptions, etc, it can get
pretty hard to trace. I may be overlooking a few exisiting techniques here
though, I'm pretty new to it.
Brian Smith
All the runtime error messages that generated lexers/parsers produce should be
localizable and just generally easily customizable.
Steve Silber
Better msvc integration support. I know we all hate M$, and I'm no
exception. But for the love of god, there's a LOT of development
going on in MSVC these days, and ignoring the poor saps (like me) who
have to use it is not so nice. Basically, we need for ANTLR to be
truly platform agnostic, not just *nix agnostic.
To whit:
- NMAKE-compatible makefiles for the C++ libs. If we can cobble up
NMAKE makefiles, then we're cool with Ric--we're purely command-
line. I got no problem with that, since if you can make an NMAKE
file, you can make a GUI project for it.
- Has anyone mentioned smoke testing of builds yet? If that's
already happening, let's get MSVC as a test target for the C++ libs.
- A warning-free build for the C++ libs, for all major targets. A
Windows build is still littered with warnings all over the place.
- Let's get multiple error reporting formats. Make them command-line
selectable. MSVC mandates a specific format for error output from
external commands in order to integrate them fully into the
environment (eg. being able to double-click an ANTLR error in the
output window and it taking you to the offending line in your grammar
file).
Sriram Srinivasan
- I love the idea of using a regex to express a variable amount of lookahead.
- For ANTLR3, use the built-in collections and java 1.5 generics
- Need support for associativity and precedence.
- I don't always understand ambiguity warnings. It would be nice if ANTLR could produce a counter example ("These are possible alternatives in the input which these productions can't disambiguate")
- Should be able to use the same label:
m:STATIC | m:PRIVATE | m:PUBLIC ...
Of course, one could do
{m = LA(1);} (STATIC | PRIVATE | PUBLIC ...)
but I should be able to do it either way
- Have ANTLR optionally call javac or jikes on the generated code
and then fix the LineNumberTable in the class file, so that the lines
correspond to the actions in the grammar file. That way, one can use the
- Native Integration with Idea would be sooo nice --- semantic
support (not just code coloring), outlining etc.
- Better support for keywords and identifiers. I should be able to
have standard tokens like "if", "then" and have my own identifiers. If
we can give built-in strings more priority, then we don't have to worry
about clashes between keywords and identifiers.
- Token.toString() can use reflection on the generated TokenTypes
interface to map its type to the corresponding name. This facility can
also be provided as a static method for custom Token classes that don't
inherit from CommonToken.
- Why do the generated methods not return the Token or the result?
(That is, why are they declared void?)
- Since the lexer's performance is usually more critical than the
parser, here are some observations:
- Lexer visits each character at least thrice:
if (LA(1) =3D=3D '>' && LA(2) =3D=3D '>') {
match(">>"); // It should be sufficient to say advance(2)
then later
return new Token (... new String(...)") // another copy.
- setText should do intern the string. This improves parser
performance considerably becase one can always do instead of
String.equals.
I have attached a StringSet class that you may find of use. It works
on String, StringBuffer and char[] keys. The lexer can accumulate
characters in a char array, and StringSet.put() will return the
corresponding interned String. This way, you need to have only one
char[] array in a lexer and only produce new String objects if they
didn't exist before. This way, all tokens get very efficient interning
(much faster than String.intern) and you produce far fewer objects.
On that note, I also use a KeywordMap class that accepts char[] and
StringBuffer
- Copying/buffering is necessary only if the source is a stream,
not if it is a CharSequence or an array etc.
- Most generated code can be optimized away.
Instead of the following code for the lexer production -- AND: "&"
public final void mAND(boolean _createToken)
throws RecognitionException, CharStreamException,
TokenStreamException {
int _ttype; Token _tokennull; int _begintext.length();
_ttype AND;
int _saveIndex;
match('&');
if ( _createToken && _tokennull && _ttype!Token.SKIP ) {
_token makeToken(_ttype);
_token.setText(new String(text.getBuffer(), _begin,
text.length()-_begin));
}
_returnToken _token;
}
I think we can infer enough from the context of the production
to have:
public final void mAND(boolean _createToken)
throws RecognitionException, CharStreamException,
TokenStreamException {
advance(1); // no boundary or error checking here
_returnToken makeToken(_ttype, "&")
}
Note that the string for the operator is automatically interned.
We need to check forToken.SKIP only if the action code has this
string.
Is _createToken really needed? A production either produces a token
or it doesn't, and this information is available at grammar compile
time.
- The barrier to entry for a newcomer is still high. I have some
experience with ANTLR, and I too would like a better way to start a new
grammar from scratch (compared to copying).
Perhaps, we can have a wizard that walks the user through a set of
questions and even determines lookahead automatically. The questions
could be like these
- "Which of the following languages does your input most resemble
- java, javascript, python, /etc/password, comma separated lists, etc"
- "Here's a default list of tokens. Make appropriate changes."
- "Multi-line comment format"
- Single-line comment format
- Whitespace.
- The code for newlines etc. is automatically generate.
- Or perhaps, the wizard can be "trained" using bunches of sample
code
- Ideas for documents
- Have HOW-DO-I notes on various topics : associativity,
precedence, getting started etc.
- Have quizzes on the website for different tracks: "beginner,
intermediate, expert, Parr". Gives targets for learning ANTLR and
parsing concepts.
Tom Moog
TJP: Tom Moog is the super smart guy that has augmented and maintains PCCTS.
Thoughts on code generation and semantics