|
Home |
Download |
ANTLRWorks |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs |
v2
|
|
|
Latest version is 3.0.1 Download now! » |
|
|
ANTLR 3.0 Error Reporting and Recovery
December 29, 2004Re: error recovery and semantic actions, trees Ok, so we handle errors within a rule for mismatched tokens now. This
introduces a new problem: actions may refer to labels on missing
tokens. For example,
What happens when you have input "3+;"? You will get an error saying that INT is missing, found SEMI. It will "insert" an imaginary INT token to continue parsing. There's the problem: it actually doesn't create a token--right will be null. I think I have a simple solution: have token labels be initialized to
an ERROR token so at least you won't get null pointer exceptions.
In this way, if an error occurs in any of the match() routines, an action may still refer to the token object, it will just get empty text and token type. Not optimal, but better than putting if(error) in front of every action. Heh, what happens if somebody wants to use their own Token object? Then ERROR will require an invalid downcast. The programmer would have to define their own ERROR object. Hmm...ok, worry about this later. Trees. So what happens to tree construction? I haven't thought about tree construction yet, but I need to make sure the error handling will work with tree construction. Normally, syntax errors prevent subsequent processing by a translator. Just like a compiler won't generate code if you have syntax errors. Erroneous input will lead to invalid tree structures leading to tree parsing errors. That said, sometimes you want the bad tree in order to display in a GUI what went wrong with some input, for example. Should an ERROR node be inserted for missing tokens? Probably. What about deleted tokens? Probably should leave in the tree, but marked in some way to indicate a syntax error. Come to think of it, we can actually create a valid tree for most if not all mismatched token situations. It's not possible to create a useful tree, however, when there is no viable alternative (e.g., you enter statement but the first token is a right curly). For input "3+;" I would like to see this tree:
assuming the semicolon is not to be included in the tree. For input
"3+)4", I'd like to see
because we can give the correct tree, but how to encode the error?
What about "()"? I guess you'd want a single ERROR node as parens are
not normally included in the tree. What about "[]"? Perhaps you want:
A simple mechanism might be: "upon error, added ERROR token to the
tree and start the error recovery mechanism avoiding tree additions
during error recovery." In that way, you'd get the real tokens in the
right spot I think. Imagine, we want brackets in the tree, then input
"[3+]" would build
Well, I hope. Whether you can parse the tree is another matter. Perhaps the tree parser would bail out of the current rule upon ERROR token or something. December 28, 2004For v3.0, I have dramatically improved the error recovery to something
similar to what Josef Grosch had in Cocktail. Was pretty easy.
Anyway, it now does single token insertion / deletion if necessary to
keep going. So
says "mismatched token 'then'; expecting ')'; inserting ')'" and keeps
going in the same production. :) If you have
it says "mismatched token 'blort'; expecting 'then'; deleting blort". It notices that "then" is next; so it can trivially delete the blort. :) Ah... All done in one simple routine. Best part is that it does not bail out of the ifStatement rule like it used to. PCCTS had something like this, but you had to manually tell it to do it. Further, better than PCCTS and ANTLR 2, I am using actual local following-token-set information rather than global FOLLOW information to resync. It seems WAY better so far in my testing of the java grammar--it knows precisely what can follow a rule reference rather than what could follow any reference to that rule. :) This is per Wirth circa 1970. Class Parser has a huge number of comments. June 28, 2004Had to drive 250 miles yesterday...lots of time to think about error reporting/recovery improvements. Enhancing Error ReportingOne of the annoying things about simple error messages such as
is that there are probably 200 references to token ID in your
grammar. Where in the grammar was the parser when it found the
mismatched token? Further, you have little contextual information
about where the parser was in the input stream. Line 20, column 12 is
useful, but you have to go into the file to see where that was. I
would love the following (particularly for debugging grammars):
where the @ indicates the position. Now you know precisely where in the grammar and where in the input stream the error occurred. Knowing the location in the grammar is particularly important for tree grammars as the relationship between input and text input and tree node can often be muddy especially for artificial AST nodes. For debugging, given valid input, the question you have is where in the grammar did it fail to match. It is at that point that you go fix your grammar usually. So, how do we implement this. Well, first we need to pass the input stream to the error reporting routine. If the stream is capable of looking backwards and forwards then context of the input can be printed out as the user wants. How do we get the grammar fragment information? One simple way is to
generate a string representation of the entire grammar and then have a
mapping from grammar position to char position within the string. The
error reporting routine could walk backwards and forwards from the
char position to compute the string representing just the surrounding
alternative. We could use an instance variable called gposition and
constantly keep it updated:
or update match() to include the position because that is the only
place where you can get an error:
I don't like burdening all calls to match() with an extra parameter,
but users of ANTLR can optimize by simply changing the code generator
to not include this position parameter and then provide a different
match() method. I just looked at the C output (non exception
handling version) to see how many parameters there were to the match
routine...lots! Check this out:
Rather than print the grammar out as a string for use by the error reporting, perhaps the entire tree structure (i.e., the processed grammar) should be made available to the running parser. The grammar position would map to an actual AST node. The cost would be a slightly larger string (flattened AST as string) compared to the original grammar, but would potentially provide more grammar information to a reporting facility. Hmm...actually, since the trees have down/right pointers (not up/left) it would be hard to walk backwards to print out a larger portion of the grammar. Ok, forget that. Arguably this grammar string error message would be more useful to the programmer than an end user of the tool surrounding the parser. Then again, at a high level showing the possible surrounding structure could be useful to the user if the grammar was clean enough. When confronted with a syntax error, users will often consult the manual to see the valid syntax. The realities of grammar development though means that many of the rules, introduced for coding convenience, just won't make any sense to the end-user. I think that the existing paraphrase option is probably good enough for end-users, though I still want this grammar print out option for development. The grammar string would be small, like 10k, as would the table, but they would really take up space at the bottom of the parser. The position to grammar char position mapping could be thousands of entries long. Hm...seems like this should be a debugging-only feature. How would this be turned on/off? I don't want option bloat. ParaphraseANTLR has a paraphase option so that you can indicate (only at the
rule level I think at the moment) what kind of thing you are trying to
match with a rule in English rather than using the rule name. ANTLR
can then generate something like:
where you have:
if an error occurs anywhere in rule ifExpr. I would add paraphrase for subrules so that any no viable alt exception could print something reasonable. From the user point of view, this is better than printing the name of the rule or the actual grammar position per the above. For debugging a grammar, however, I think I'd like to have the grammar print out. Paraphase has to be dynamic, with a stack of paraphrases. So that putting paraphase="expression" in rule expr will result in that paraphrase being used for any error within a rule invoked from expr. What about paraphrase at the alt level? Most of the time I don't make
separate rules for the various statements. Paraphrase at the rule or
subrule level is blunt: all errors would say "error in statement" not
"error in IF statement" or whatever. The syntax for options on an alt
could get really hairy. Oh I know, use a subrule around it:
Parse trees, derivationsA standard feature of ANTLR will be to create parse trees; they are marginally useful for translation work, but are of a great help when figuring out why your "x=4;" input assignment statement gets interpreted as a weird kind of declaration. The parse tree or (easier to print linearly) the derivation would reveal everything without having to trace the grammar via print statements like option traceParser or via a debugger. Upon recognition error, the parse tree or the final step in the
derivation sequence, derived from the tree, could be printed.
Actually, the current derivation shows sort of where you are in the
grammar and the unprocessed input at the same time. For example,
Hmm...perhaps a derivation is more useful when your input was succcesfully parsed, but improperly interpreted. Yes, that's it. I would rather see the exact syntax (i.e., the grammar) where an error occurred. Auto missing and extra token error correctionGood automatic error recovery is hard, but there are a few simple strategies that work very naturally. The first, described in my blog entry the other day, is to consume tokens until a token in the FOLLOW(that-rule) is found. This is pretty catastrophic failure, however, for a rule. Any
mistmatched token or no viable alt exception exits that rule. It is
often the case that the input is simply missing a token such as:
or have typed an extra token:
Sometimes you want the whole rule to fail (you want to simply say "bad expression" regardless of where in the deeply-nested expr-call-chain the parser found the error). Many times though you want the rule to fix the single token deletion/insertion and keep going in the same rule. PCCTS's "@" operator told PCCTS to handle the error locally like that. I am now thinking that there could an option indicating whether errors should be handled locally within the alternative (is it the same option as defaultErrorHandling or a new option that could be turned on/off independently?). By default, NoViableAlt errors would, as usual, throw exceptions and be caught within the rule. Mismatched token exceptions, however, would be handled at the "match site", perhaps within match() itself. For example, let's start with the default handling of errors in the
following rule:
If the lookahead was inconsistent with any of the alternatives for rule stat, then NoViableAltException would be caught within stat. An error would be reported and then the parser would consume tokens until a token in FOLLOW(stat) was found. If a match() invocation found a mismatch between the desired token and the current input symbol, it would look at what followed the token reference. If the current symbol is in the set of tokens that can follow the token reference, report a missing token and continue without consuming a token. What about extra tokens? Hmm...i wonder if it's really useful to delete a single extra token? Is that very common? I think that missing comma or semicolon is very common, but an extra ID or INT or whatever is not that common. Anyway, if the current symbol is not something that could appear after the token reference, then a single token is consumed. If this following symbol, now the current symbol, is not consistent with what follows, an error is reported and an exception is thrown. Hmm...i may not implement this part. I think I will start by throwing an exception unless I think it's a single symbol omission. Then the normal exception catching mechanism will deal with it. Another reason to have the grammar position: it can let us know what the FOLLOW(some-token) is for error recovery. If default error handling is turned off, then any error, mismatched token or no viable alt, is handled by throwing the exception and not catching it automatically in the rule. You have to manually put in exception handlers. Allow resync sets for rules? In other words, instead of relying on
the FOLLOW, perhaps you could specify the "stop set" as I think they
are called? (You could specify how the parser should resync even on
the alternative level maybe)
The stop set would be inherited by any rule in the rule call chain
below it. Means a stack of these has to be maintained? Sure...no
problem:
would generate:
But wait, errors should be trapped where the resync is defined. The set should indicate what tokens to look for before returning from that method, which implies default error catching would be turned off dynamically for any rule invoked from that rule. In the above, a problem in multExpr() should blast out of it and return to expr() where it would consume until a token was found in the resync set (or EOF) and then expr() would return. So the exception is generated in the nested rule, but trapped in expr. Exceptions in targets w/o native exceptionsHow to implement these exceptions in C or other languages without native exceptions? C has longjmp() but that is not a great solution. I wonder if error return values could handle this? I suppose with a goto after each rule reference checking it's error. Perhaps I'll define the default error handling as the required functionality for a target language. Being able to specify exception handlers would be target-dependent. I fired off a note to Tom Moog, maintainer of PCCTS, to see what his thoughts are for exception handling for C target. June 24, 2004ANTLR 2.x Error HandlingCurrent ANTLR 2.x has a very simple and straightforward error handling mechanism replicating the exception-based strategy pioneered by PCCTS. Upon MismatchedToken (or Char) or NoViableAlt, an exception is thrown by match() or the lookahead prediction, respectively. Catch{} blocks are added to each rule by default, whose action is to call reportError(exception) and then consume tokens until a token in the FOLLOW(that-rule) is found. After consuming, the rule with the error returns normally to the invoker, hoping that it has resynchronized properly. This simple mechanism works suprisingly well. For example, when an error occurs in a statement, it will consume tokens until it sees a semicolon. It consumes the semicolon and returns to the statementList rule, which looks for another statement. To have a single error terminate the entire parse, just turn off default error handling with an option. Exceptions must then be caught by the code that invokes the start rule of the recognizer. ANTLR 2.x can also attach exceptions to rule references to allow
errors occurring in some deeply nested rule invocation chain (such as
expr) to be caught where you have more contextual information such as
"bad IF expr" vs "bad WHILE expr". For example, given when an error
occurs during the recognition of the conditional of an if-statement, a
good way to recover would be to consume tokens until the then is found
on the input stream:
The parser would continue with the parse after the expr reference (because we attached the exception handler to the rule reference) and look for the then right away. PCCTS 1.x Error HandlingPCCTS by default did not do the automatic try/catch insertion.
Instead, you had to either manually add exception catches or you could
use a so-called global handler that was essentially a macro that got
inserted after each rule. For example:
This grammar fragment is functionally equivalent to:
PCCTS also had a NoSemViableAlt exception that was thrown when you had predicates in the decision that failed. It is an indication that, while one or more of the alternatives were syntactically ok, none was both semantically and syntactically valid. PCCTS also had an interesting mechanism for handling single symbol
insertion or multiple symbol deletion, which also worked remarkably
well: the @ operator. You could suffix any token reference with the
@ operator, which indicates that if that token is not seen on the
input stream, errors are to be handled immediately rather than
throwing a MismatchedToken exception. This was great for handling
missing then tokens:
Case 1: missing THEN. If THEN is not found, the parser reports "missing THEN", but keeps going if the next symbol is FIRST(stat). Case 2: delete random token(s). If THEN is not found, the parser tries to consume tokens until it sees FIRST(stat) and then continues by invoking rule stat. This all easily implemented by reporting the appropriate error and consuming until FIRST(stat) then continuing. When working at NeXT on the C/Obj-C/C++ grammar, we found that this automatic mechanism was very very close to a handbuilt parser in terms of intelligent error recovery. Proposed ANTLR 3.0 Error HandlingFor 3.0, my general summary would be: "I want it all (and more...)!". Clearly, I will keep the exception-based strategy, but I would like to provide much more information / capabilities for reporting and recovery. When generating code for non-OO languages, however, I'm not sure how this will all translate, but I remember PCCTS doing parsing exceptions for C as well. I would like to reduce the number of options for both simplicity and for code generation's sake, but I can't think of a good way to specify the default catch action. So, the code generator will automatically insert the try/catch structure around each rule with a default error action of "consume until FOLLOW(that rule) and return". The programmer can alter the code gen template to alter this default behavior if they want or simply put a manual catch after each rule. You will be able to specify exception handlers for any rule. I also
like being able to catch errors generated from a rule reference, but I
don't like having labels be unique to the rule so I can attach an
exception. Unique labels make actions that reference rule labels
harder to write. I also don't like the wacky random @ symbol PCCTS
used (I was out of characters and in a hurry). ;) So, labels will not
be unique and I must find a way to attach an exception to a rule
reference. Because this will not be that common, perhaps one puts the
exception inline:
where an exception could be attached to a block (...) to make it
more clear. Or perhaps we make labels on blocks unique and then can
move the exception to bottom of rule:
Nah...i don't like this unique label stuff, but then how do I trap those inline exceptions? We will need FIRST, FOLLOW sets available for reporting and recovery. The lexer will not do this recovery strategy. Upon error it will probably just bail out and attempt a new token. Information about the grammar not just the token names/types should be included in the output recognizer. For example, I would like to have set of what could have been matched next at any point for reporting purposes. For debugging or even for better error messages in general, I would like to spit out what position in the grammar corresponds to the current parser state when an error is detected. Currently error messages for tree grammars are horrible. We must be able to look back into source (line,col,text) to give a good error. I will be building derivations / parse trees for debugging and testing. These must be consistent up until error occurrence and we must be able to access the partial derivations / trees.
|
|||||||||||||||||||||||||||||||||||||||||||||||