Custom Syntax Error Recovery

Skip to end of metadata
Go to start of metadata

Custom Syntax Error Recovery

An important part of a robust and useful parser is the behavior it exhibits when the next token in its incoming token stream is not one that the grammar specifies should be there. This is generally known as a syntax error. Here we look at how ANTLR recovers from the various mismatch cases and what techniques you can use to override or influence that recovery. The examples shown use the Java language, but the same techniques should apply to all ANTLR targets.

ANTLR implements good error recovery mechanisms by default within the runtimes of all target languages but in some cases the way a grammar is structured impairs the ability of the algorithms to recover to exactly where you might expect or wish. Sometimes your grammar rules cause a loop in a parsing rule to be exited earlier than you would expect or want; sometimes when a certain construct is in error you want to skip everything up to the end of that construct instead of resuming at wherever ANTLR sees a token that looks like it is a valid recovery point. There are many reasons you may wish to influence or override the standard recovery behavior. However, before we can examine how to implement your own recovery, we need to know something about how ANTLR recovers from mismatch problems.

Recognition Exception

Once a mismatch is detected, the generated code causes the target language equivalent of the Java Runtime's RecognitionException to be thrown or otherwise handled. If you examine a method generated for a parser rule in the Java target, you will see that the rule logic is encapsulated within a try {} catch {} block, which catches RecognitionException:

Generated Exception Handler

You can see that the exception handler does two things, the first of which is to report the error via whatever mechanism is available. Error reporting is covered elsewhere in the Wiki but essentially, if the parser is not already in error recovery mode, then reportError() increments the count of errors it has seen and calls displayRecognitionError(), which is generally the method that one overrides to provide custom display/capturing of parser error messages.

The second thing that happens after reporting the error is that the recover() method is called - it is this method that attempts to resync the token input stream to some place that makes sense within the context of the current parser rule and its parsing point within that rule. In order to do this, it must compute a Follow Set for the current parsing point - note that I am trying to avoid technical definitions and jargon here in favor of understanding. A follow set is basically a list of all the tokens that would be valid if they were seen at the this point in the parse. The recover() method then basically consumes tokens from the input stream until it locates a token that exists within this set. It may consume no tokens if it feels that there is actually a token missing.

Implementing Your Own Exception Handler

So, one obvious way we could influence recovery is to override the default implementation of the recover() method by either adding your own implementation in @parser::members() or better yet, in your own superClass (see options {}). However, this is fine if the recovery should always take place in the same way regardless of what the parse point is, but if that was the case, you probably would not have found this article, as the default implementation of recover() probably does what you want anyway.

When you want to do something special for a particular rule, you implement your own handler for the RecognitionException, and recover in whatever manner seems appropriate. Generally, you will still call reportError(), unless you wish to suppress the error, but then you can use your own method/algorithm to recover the input point. To do this, you add a catch clause to a rule:

Custom Exception Handler

You will see the kind of things that can be done in recover() by examining the source code for the default implementation, or reading further in this article (recommended first).

Early Termination of Parsing Loops - Finer Grained Recovery

There are many situations where a parsing loop aborts prematurely using the standard implementations of recovery. An example helps to illustrate this and here we use a fairly typical rule that parses some kind of class definition (this is actually extracted from the JavaFX parser).

Grammar for Class Definition

This grammar is perfectly sound and will indeed parse a class definition correctly, assuming the subrules are correct. However it creates a recovery problem as it stands in that quite a large number of the possible syntax errors found within the rule classMember will cause the * loop to exit, issue an error about an expected RBRACE and return to the calling rule. Basically an error in the classMember will abandon the parse of the class definition.

Clearly, when we return from an attempted parse of a classMember, we need to ensure that we resync the input stream to either the start of a new classMember, or the closing RBRACE. Now, this example also serves to highlight the limitations of any recovery method - we are not going to parse the input stream to try and make sense of it, just try to discard things that cannot possibly make sense at this point in the parse. However, by doing this, we will at least keep the classMember* loop functioning, even if we issue more errors while we try to get back in sync. This should be our goal because when we issue errors, we wish them to be as local to the actual problem as possible. We can now issue errors that will say "While trying to parse class XYZ" rather than "While parsing a script", and within classMember we can apply similar techniques, so that we stay within the class member parsing loops as long as we can.

So, how can we implement such a resync? Well, we could of course write a custom method for each parse point, and call it in an action directly after the classMember rule. However, as well as being tedious, it is prone to error as the grammar develops because you must remember to manually update the set of tokens that are valid recovery points for any case. So, before attempting this, let us pause and examine what ANTLR can provide us automatically.

First and Follow sets

I have tried to keep this article as free from jargon as possible, but here we must use a little of it. As ANTLR parses your grammar it constructs an NFA for each block of alternatives in a rule or subrule. When this process is complete, ANTLR can then compute the First and Follow sets for rules.

The First set, as its name implies, is the computation of all the tokens that indicate a rule should be selected (or similar decisions), for instance: X ruleY? Z - the First set for ruleY is the set of tokens that show ruleY should be invoked. The First set is used by ANTLR for code generation and analysis and is not available to us when the parser is running. The Follow set is used for standard error recovery and as its name also implies, is the set of tokens that should follow on when a rule has finished parsing. At runtime, before any rule is invoked from any other rule, the pre-calculated follow set is made available on a stack, and we can use the top set (or indeed any number of stacked sets) for whatever we like.

In our example rule, we clearly want to know what the First set is for the rule classMember, as when we apply our resync, this is the set of tokens that we should resync to (as well as the RBRACE that closes the class definition). However, as just noted, we do not have access to the First set at runtime, so how can we get ANTLR to do all the hard work for us?

Hijacking Follow Sets as First Sets

In order to have ANTLR compute the required recovery set for us, we first note that the set we want is not just the First set of the rule classMember, but also must include the closing RBRACE. This in fact is the Follow Set for the rule classMember in this production, and hence we can hijack that set and use it as our resync set. But, we want all this to happen automatically, and it would be better if we could write one resync rule that we could use (judiciously) anywhere in our grammar. In fact, all we need do is utilize a simple trick, which is perfectly 'legal' and will do all the calculations for us.

The key to this trick is to create a parser rule that consists only of a single empty alt (epsilon set in jargon), then use action code within the rule to resync the input stream to the Follow Set that it finds on the top of the stack. ANTLR will then invoke this empty rule but will first place the Follow set for that rule on the recovery stack. Because the rule itself is empty, it will add no additional tokens to the set and we can infer that - for our example - the Follow Set it computes will be the combination of the First set for the classMember rule and the RBRACE that ends the class definition. It now only remains to sync the input stream to this set within our empty rule. Because our rule is generic, we can use it anywhere in our grammar that it is required, and ANTLR will supply us with correct set of tokens for the location it is used in our rules.

Here is what the empty rule looks like:

Empty Grammar Rule for Custom Error Recovery

We can now use this new rule in our class rule as follows;

Grammar for Class Definition

The Recovery Code

We have our grammar in order now, so it remains only to implement the syncToSet() method in our target language. The followSet is implemented in all language targets, so if you are not using Java, then you can still do this, but must adapt the code to the semantics of your target language. Here is what the code looks like:

Java Implementation of Custom Resync

Using this code, we can sync to either the follow set currently on the stack, or to a set of our own construction (should we have anything more special to do). It should be obvious that our method can do anything it likes of course and specialized recovery for a particular rule is entirely possible.

Other Recovery Mechanisms Within ANTLR Runtimes

There is one other aspect of recovery which you may need to customize, and that is what happens when a mismatch() occurs. You will see in the generated code that there are lots of calls to the match() method. Inspecting the default implementation (in the Java runtime) we find that the match method will call the method recoverFromMismatchedToken() and this in turn will try to use the current Follow set stack to determine if the reason we mismatched is that there was a spurious token in the input: X Y Z when we wanted just X Z, or a missing token: X Z when we wanted X Y Z. If ANTLR can determine, using the Follow sets, that by skipping a token, it would see valid syntax, then it will consume the spurious token, report the extra token, but will not raise a RecognitionException. Similarly, if ANTLR can see that there is exactly one token missing from the input stream, which if present, would make the syntax valid, then it will manufacture this missing token, report the error, but again will not raise the RecognitionException.

If you want behavior that is different to this, then you can override the match() method, or more likely, the recoverFromMismatchedToken() method. Perhaps you do not want the spurious/missing error detection? Or, as you will see from the default implementation, ANTLR will first see if it can fix things by ignoring a token, then go on to see if it can fix things by adding a token. However, there are some syntax errors that can be recovered using either method - perhaps you want to reverse the order that these strategies are tried?

Conclusion

The built-in recovery mechanisms of ANTLR are suitable for almost all general parsing problems; however we can clearly see that there are cases where you must take things in to your own hands. Don't be afraid to do this - generic mechanisms, by their very nature cannot cover all your specialized cases and all the language targets allow you to override the default behaviors.