Left-Recursion Removal - Print Edition

Skip to end of metadata
Go to start of metadata

Introduction 

Writing grammars, one will encounter left-recursive rules. As ANTLR is a recursive-descent parser, it cannot cope with them and thus they have to be removed. But what is a left-recursive rule exactly? It is a recursive rule which calls itself on the left-edge of a global alternative. An example would be:

ANTLRworks can resolve this problem and turns rule a into

But ANTLRworks can do this automatically only for the above kind of left-recursive rules, which exhibit the left recursion within the implementation of the recursive rule itself. I would call these single left-recursive (SLR) rules. The other kind are mutually left-recursive (MLR) rules where the left-recursion occurs over several rule invocations. The simplest example would be the following:

Rule a references rule b, while rule b references rules a. A more complex example would show such a 're-reference' taking place after many subsequent rule invocations. As stated above, the resolvement has to be done manually here.

As the strategy to solve this problem isn't that obvious, I decided to write a tutorial about two real grammar cases I have encountered while working on a C# grammar. It should be noted that this tutorial describes what I wished to have happened, as I visited many more dead-ends than noted. Furthermore, the strategy to resolve MLR rules is based on the description by Gavin Lambert.

First Example

Preparations

It is important to work with a copy of the rules, because changing the grammar into the normal form can become quite messy. Create a new grammar file and copy only the rules into the file which are reported in the error message. After that, create rules for missing rule references with ANTLRworks, so that the tool doesn't complain about them.

These rules may neither be empty nor be duplicates - I find the use of keywords here simple and efficient. It is advisable to move those rules to the end of the file so they don't get mixed up with the recursive ones. As every grammar with mutually left-recursive rules is kind of unique in its own sense, the following procedure is to be viewed as a heuristic instead of an algorithm.

After the execution of the first step I had the following C# grammar snippet:

After this all mutually left-recursive rules are checked to see if they are referenced in any other rules besides themselves in the original grammar. The five rules being referenced elsewhere are type, array_type, non_array_type, pointer_type and unmanaged_type. To simplify my future work I inline all unreferenced rules first and save the result:

Now I don't have to repeat as many steps. This would be a good point to make a copy of this grammar file because when a mistake can't be easily undone one can start easily anew. Saving intermediate copies of the grammar once you complete each important step, would be a good practice, too. It is important to note that inlining leads always to only one single survivor so the process has to be repeated for every required rule.

It is also possible that rules become single left-recursive and have to be changed by ANTLRworks first (unless that rule is supposed to be the last rule - then postponing this to the last possible step is a good idea). Rules may become ambiguous, too, so doing grammar checks after single left-recursion removal (SLRR) is recommended. Analyze the cause and remove it before proceeding further.

When inlining is done, ANTLRworks automatically adds extra parentheses around the inlined statement. Those parentheses may be superfluous and should be removed as soon as possible, since SLRR may not be possible otherwise. A related problem is that after an SLRR that the trailing prevents the naive deletion of the parentheses around the main body.

This can be solved with "in-factoring": "(a|b) c" => "ac|bc". Further analysis of the syntax diagram can point out other simplifications like "(ac|bc) (c)*" => "(a|b) c+". Don't do them aggressively until only the last rule is left - further inlining may solve asymmetries where not all alternatives share the exact same parts.

Rule Inlining type

Every rule besides type (and the ones in the OutOfSight group) is inlined and all parentheses have been removed. The result is:

Now SLRR leads to:

At this point, a grammar check would reveal an ambiguity, which can be easily resolved by the removal of the '+':

Rule Inlining array_type

This time we remove the left-recursion of the rule type, right at the beginning:

Now we inline type and simplify further. I inline unmanaged_type as next and in-factor "INTERR* STAR" to each subrule, so that I can remove the left-recursion in pointer_type. Because of the assymmetric "VOID STAR" subrule I inline pointer_type instead doing some optimization, except turning the trailing into "(INTERR | STAR)*". Then we arrive at

Taking the near duplicates into account, "INTERR* STAR (INTERR | STAR)" is transformable into "INTERR* STAR? (INTERR | STAR)" and this is "(INTERR | STAR)*". We then obtain

The last INTERR* can be safely deleted. Then after some rule-inling, SLRR and simplification we arrive at

Now repeat SLRR and simplify the result to

"((STAR | INTERR)* rank_specifier)" is similar to "(+STAR +| INTERR | rank_specifier+)+", if there is at least one rank_specifier. This can be checked with:

Rule Inlining unmanaged_type

We start here with the two saved unmanaged_type and array_type rules. We remove the left-recursion from array_type and simplify it to

Then we only need to inline it and simplify unmanaged_type to

Now SLRR and simplification steps have to be repeated:

Rule Inlining non_array_type

We repeat the known procedure, where non_array_type won't have its left-recursion removal until the last step. We then get

pointer_type is turned into

Inlining pointer_type leads with some simplification to

Following the next few steps as done when inlining array_type results in:

Rule Inlining pointer_type

Starting with the first intermediate result of non_array_type, we turn non_array_type into

After inlining and simplifying we arrive at

This can be turned into

Note that I moved the STAR to the end to allow additional transformations, but in this form there can be rank_specifier and INTERR between VOID and the first STAR. This is alleviated by

Second Example

The first example showed only a limited pool of possibilities, so I decided to use the second mutual left-recursive rules loop in the C# grammar as another example.

Preparations

After the execution of the first step, as explained in the first example, I had the following C# grammar snippet. I realized at this point that it would be a good idea to use more of the original rules instead stand-ins because I wanted to be sure that there is no hidden interaction between the two recursion loops.

After this all mutually left-recursive rules are checked if they are referenced in any other rule besides themselves in the original grammar. The original list consists of invocation_expression, primary_expression, primary_no_array_creation_expression, post_increment_expression, post_decrement_expression, element_access, member_access, pointer_element_access and pointer_member_access.

Out of those, only element_access, pointer_element_access and pointer_member_access aren't referenced elsewhere, so they are inlined before I save a copy. Next to the absence of the aforementioned rules primary_no_array_creation_expression looks like this. Its single left-recursion points it to be the first surviving rule.

Rule Inlining primary_no_array_creation_expression

Inlining primary_expression leads to in-factoring at every other rule. The inlining of the other rules is straight-forward and gives us

After SLRR we get

Since this doesn't show any prospect of an ability to simplify, so we go on with the next rule.

Rule Inlining primary_expression

Firstly, we can do SLRR and simplification with primary_no_array_creation_expression:

The rest of the rule inlining is simple. It leads to

This point seems to be problematic. primary_expression is actually SLR. But due to the parentheses ANTLR considers them MLR. In-factoring is the only and somewhat ugly solution to allow SLRR:

The only simplification which preserves the equivalence without checking for validity would be the out-factoring of "(OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*", but as I use a second pass for my compiler a further simplification can be done.

The first move is to pretend the line "array_creation_expression (OPEN_BRACKET (expression_list | expression) CLOSE_BRACKET)*" and out-factor the inner trailing. Thus the outer trailing can be turn into:

For now we can leave this rule like this and move on.

Rule Inlining member_access

We inline primary_expression and in-factor all rules before we save a new copy of the grammar:

We postpone the SLRR on primary_no_array_creation_expression and its inlining until the last moment as this rule is referenced by other to-be-inlined rules. We inline now invocation_expression, post_increment_expression and post_decrement_expression and remove SLRR on primary_no_array_creation_expression to get:

Now after inlining member_access, it has to be in-factored to allow SLRR. Due to the size of the trailing it was out-factored into a separate rule temporarily:

As one can easily see, most alternatives of member_access include member_access_trailing. Looking more closely, the only three alternatives, which don't include the member_access_trailing, use only the zero-or-more-closure of member_access_trailing.

The simplification is complicated by the fact that there are four other array_creation_expression-s. Taking all five alternatives into account, we can actually combine all array_creation_expression-s into one. Using our "we check it in the next pass"-trick we can simplify member_access to:

The second pass has to check for the alternatives predefined_type and qualified_alias_member, that they are directly followed by a DOT. array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that DOT IDENTIFIER type_argument_list? is the last matched sequence.

Rule Inlining invocation_expression

We start by inlining member_access, post_increment_expression and post_decrement_expression. Then we do SLRR on primary_no_array_creation_expression:

Then we inline primary_no_array_creation_expression itself. Again we extract the trailing into a temporary subrule and in-factor its reference, before we do another SLRR.

The situation is analogous to member_expression above.

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OPEN_PARENS argument_list? CLOSE_PARENS is the last matched sequence.

Rule Inlining post_increment_expression

We have to repeat the same steps again as for invocation_expression to get:

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_INC is the last matched sequence.

Rule Inlining post_decrement_expression

We have to repeat the same steps again as for invocation_expression to get:

The second pass has to check that array_creation_expression may not be followed by OPEN_BRACKET. It has to be also checked that OP_DEC is the last matched sequence.

Rule Normalization

The next step would be ordinarily to look for extractable subrules but one additional step is helpful here. The rules often have a few alternatives whose rule structure is a special case of the rule structure of the other alternatives. The best is to make those special cases equal in rule structure to the alternatives to achieve a better subrule extraction.

Of course, the change of the grammar results in the recognition of a different language. To prevent this one can either use actions like shown with pointer_type to disallow invalid structures. Or one can use 2-pass compiler and check in the next pass specifically for correctness. This approach permits to turn the rules in the first pass even more uniformous, so maybe only one superset rule remains. An additional advantage are also the already mentioned improved error messages.

Labels: