Dashboard > USF Computer Science 652 - Programming Languages > CS652 Home > LL Parsing and recursive descent design pattern
  USF Computer Science 652 - Programming Languages Log In | Sign Up   View a printable version of the current page.  
  LL Parsing and recursive descent design pattern
Added by Terence Parr, last edited by Terence Parr on Feb 03, 2008  (view change)
Labels: 
(None)

Introduction

It's easy to generate L(G) from G:

  1. create the syntax diagram from G
  2. starting at the start symbol, explore every possible path while collecting a list of terminals
  3. continue until you "return" from the start symbol

This is not very practical because most grammars generate languages with an infinite number of sentences.

It is also not very interesting. What we really care about is determining whether a given sentence is in the language described by a grammar. Unfortunately, it is not obvious how to do this testing. Indeed, there are hundreds if not thousands of parsing strategies. The key difference between language generation and recognition is that when generating all of the sentences of a grammar you don't have to decide which branch to take. You just pick a path you have not visited yet. To recognize, you have exactly 1 sentence and you need to decide which path to take.

Recall the maze analogy where you have a passphrase that tells you precisely how to go from the and entrance to the exit. At each decision point you must decide which path to take. Most of the time this will be obvious by looking at the next symbol in the sentence, but sometimes the next symbol or couple of symbols are written on the floor of multiple paths.

Successfully navigating a maze is an apt analogy for language recognition, or parsing. From a grammar, we'll need to build some kind of recognition machine. The most obvious place to start is with the associated syntax diagram just like language generation. With the syntax diagram, it is easier to see the decision points.

LL Parsing

Let's start with something simple. Consider following partial grammar for a programming language:

file:	vardef
    |	classdef
    ;
vardef
    :   'int' ID ';'
    ;
    
classdef
    :   'class' ID '{' '}'
    ;

Now, how are we going to build a machine to recognize valid sentences? Let's start with the syntax diagram, which looks like:

file:

vardef:

classdef:

The most obvious thing to do is simply to try out the various alternatives to see if you can match the vocabulary symbols in the input sentence with elements along the various patterns in the syntax diagram. If you find a mismatch, rewind the input and try the next available path.

For example, try out input class T {} starting at the start symbol file. You are immediately presented with a "fork in the maze", a decision point. At this point, we are not applying any intelligence--we are just blindly choosing the first path and trying it out to see if we can reach the end of rule file without getting a mismatch. The first path tells you to call the syntax diagram submachine for rule vardef. The first symbol of the input class clearly does not match int. No problem, just fall back out of vardef and try the next alternative of file. In this case, you have chosen the correct path and the input will match classdef. Upon returning, you notice that there is no more input to consider and you are at the end of the start symbol. This is a success condition. Anything else is a failure condition. Technically speaking, the proper way to think about this is matching an imaginary token called EOF after the start symbol has finished:

s : file EOF ;

This backtracking approach is easy to understand because it is really just variation of a depth first search on the syntax diagram graph. The problem is that it is very inefficient. You will go down lots of paths that lead to dead ends. Some of these paths will the obviously incorrect just by looking at the current sentence symbol. The previous input started with class which is clearly not going to match a variable definition. Because there is a submachine invocation in between the decision point in file and the actual tokens in vardef and classdef, it is not obvious which token or tokens predict which path.

With some preparatory work, we create a prediction DFA that will tell us precisely which way to go in many cases without making unnecessary submachine invocations.

By creating a prediction DFA of this kind for each decision point, we can dramatically speed up the recognition process. This kind of recognizer is called LL(1). The classification means that our parser matches input from the left to right and does a leftmost derivation. That doesn't say much about the implementation, but the effect of doing the parsing in this way does give you a leftmost derivation. The "1" means use one symbol of lookahead. The lookahead is used within the DFA to pick edges.

So, this more efficient method has a two-level recognition strategy. The overall strategy is to pick a path at each decision point in the syntax diagram by invoking a small DFA whose accept states dictate which path to take. We say that the DFA predicts which path to take. It predicts rather than matches. We predict using lookahead symbols, in this case just one.

LL(1) doesn't always work because the same input symbol predicts two different alternatives. You could not produce a DFA because more than a single edge would be labeled with the same symbol (that would be an NFA though). For example, consider adding an optional public modifier to the grammar:

file:	modifier? vardef
    |	modifier? classdef
    ;
modifier : 'public' ;

file:
modifier:

Now, upon seeing public, the decision point in file will not know which path to take. Increasing the amount of lookahead to to symbols, however, neatly solve the problem. The DFA with 2 symbols of lookahead now has to traverse to edges before reaching an except state and hence use more information from the input sentence to determine which path to predict.

Because the modifier is optional, int and class are still part of the prediction DFA.

It is important to note that we have increased the strength of our prediction by adding lookahead. LL(1) failed and, without more lookahead, we would have to fall back on the brain-dead backtracking algorithm.

If we cannot produce a DFA for a decision point, we say that the decision is nondeterministic.

Grammars can get even more complicated however, leaving LL(k) parsers nondeterministic for any fixed k. For example, let's make the number of modifiers arbitrary:

file:	modifier* vardef
    |	modifier* classdef
    ;
modifier : 'public' | 'static' ;

file:
modifier:

How do we predict which alternative to match now? There is no fixed k that will see past the arbitrary number of modifiers. By the way, when the same thing begins multiple alternatives, we call it a common left prefix. The answer is to simply allow cycles in the DFA:

The set of input sequences used to predict an alternative path is also a language, a subset of the overall language generator by the grammar. This subset is called the lookahead language for a particular alternative. If the lookahead language is finite, then clearly it is regular and a DFA can match that lookahead language. Even when that languages infinite, you can still build a DFA sometimes as you just saw. Unfortunately this is not always the case, and sometimes you do need to backtrack in order to properly match a sentence against a grammar. ANTLR allows you to go beyond fixed k lookahead to use cyclic DFA such as these. I call it LL(*).

Building prediction DFA

How do you automatically build the lookahead DFA presented in the previous section? The full answer is extremely complicated, but in general the process involves converting the syntax diagram to an NFA and then doing a modified NFA to DFA conversion at each decision point. Here is the NFA created from the file syntax diagram:

This conversion relies on two critical and well-known finite state machine operations called closure and reach. The closure answers a question, "to what states can I reach by traversing epsilon edges?" The reach(\x) operation answers "to what states can I reach by traversing a single x edge?"

Recursive descent parsers

Syntax diagrams are great for humans to look at and are useful for building an interpreted version of a parser. The parsing algorithm is essentially a depth first search modified to match edges with input symbols. But what if we want to implement a parser in code? We could generate a state machine using a series of tables encoded in Java or C++, but those tables are not human readable. That means you also cannot debug your parser by stepping through with a standard debugger. When building a parser by hand, programmers build something called a recursive-descent parser_. The grammar becomes a set of potentially mutually recursive methods. The descent refers to the fact that we are building a top-down parser.

The rules for creating recursive descent parsers from grammars or syntax diagrams is straightforward:

  1. rule definitions become method definitions
  2. rule references become method invocations
  3. token references become match(token);
  4. decision points become if-then-else sequences; the expression for each alternative tests the current input against the lookahead language for that alternative:

Using these rules, let's translate the original file rule:

file:	vardef
    |	classdef
    ;
file() {
    if ( lookahead is 'int' ) { vardef(); }
    else if ( lookahead is 'class' ) { classdef(); }
    else System.out.println("error");
}

Rule vardef

vardef
    :   'int' ID ';'
    ;

would be translated as follows:

vardef() {
    match('int');
    match(ID);
    match(';');
}

where match() throws an exception if the current symbol of lookahead does not match the parameter.

ANTLR mimics what you would build by hand for the most part. It to generates recursive descent parsers with the added advantage that it automatically computes all of the lookahead prediction DFA for you.

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.1 Build:#806 May 06, 2007) - Bug/feature request - Contact Administrators