Home | Download | ANTLRWorks | Wiki | About ANTLR | Feedback | Support | Bugs | v2


Latest version is 3.0.1
Download now! »

Download
» Home
» Download
» ANTLRWorks
» News
»Using ANTLR
» Documentation
» FAQ
» Articles
» Grammars
» File Sharing
» Runtime API
» Tech Support
» Bug Tracking
»About ANTLR
» What is ANTLR
» Why use ANTLR
» Showcase
» Testimonials
» Getting Started
» Software License
» ANTLR WebLogs
» ANTLR Workshops
»StringTemplate
»ANTLR v2
»Feedback
»Credits
»Contact


Support StringTemplate, ANTLR Project by making a donation! Terence often pays for things like the antlr.org server, conference travel, and this site design (that alone cost US$1000). Buy him a beer and pizza remotely ;)

Search



ANTLR 3.0 Lookahead Analysis

RSS Feed

December 06, 2006

[I started putting this LL(*) description in the upcoming The Complete ANTLR reference guide book, but decided it was beyond the scope of the reference guide so here it is in the lookahead blog. Not sure that it is useful to industry or academia because it is neither practical, nor complete, nor rigorous. ;) Anyway, here you go.]

Notes on the LL(*) analysis algorithm

When designing the LL(*) algorithm, I wanted something that modified a well understood algorithm so that it would be more accessible to researchers and developers. Indeed, the algorithm in its general strategy is straightforward, but the details of getting this algorithm to be efficient and to catch analysis errors properly are fiendishly difficult; I've spent nearly 4 years getting it practical and correct. For example, my original algorithm had a lot of landmines in the sense that the algorithm would hit certain grammar rules and never terminate. I had to add a failsafe that terminated the NFA to DFA conversion for a decision if it exceeded a certain threshold in time. Clearly, this makes the academic in me queasy. After looking at a grammar with a lot of these landmines carefully, I arrived at an important conclusion. It turns out that the set of decisions for which ANTLR timed out was the same as the set of non-LL(*) decisions for which ANTLR would never be able to succeed anyway due to recursion occurring in more than one alternative in a single decision. Morever, the algorithm was able to detect these situation quickly and report the non-LL(*) decision rather than attempting the NFA to DFA conversion and then timing out.

I should mention that, after constructing the algorithm, I realized that I had invented the analog of Manual Bermudez's interesting work with arbitrary lookahead for LR-based parsing, LAR(m) where m corresponds roughly to my recursion stack threshold that guarantees termination. The difference is that Bermudez's m corresponds to the simple LR stack size whereas my threshold only counts recursion. Both restrict the stack to a finite size, but ANTLR's threshold allows it to roam farther ahead in order to resolve nondeterminisms. The stack threshold is not the amount of lookahead although it is related.

Here's the basic idea of how LL(*) works. First, ANTLR builds an NFA similar to a recursive transition network (RTN) from the grammar and then performs a modified classical NFA-to-DFA conversion using the ``subset construction algorithm''. The NFA are created are as follows.

Token:

Rule:

Decision:

(in ANTLRWorks, select ``Show NFA'' checkbox when viewing the syntax diagram for the rule to see the NFA ANTLR constructs). Recall that, in the classical algorithm, each DFA state encodes the set of NFA states that the NFA could be in after having seen a particular input sequence. For example, if input ``abb'' means that the NFA could be in one of two states then there will be a DFA state, reachable upon ``abb'' from the start state, that contains those two NFA states.

ANTLR's algorithm differs from the classical algorithm in that DFA states must encode more than just the set of NFA state numbers. If you applied the classical NFA to DFA conversion to a decision, the resulting DFA would simply answer yes or no as to whether an input matched the entire decision rather than yielding a predicted alternative number. The classical algorithm continues building DFA states while there are reachable NFA states. Our goal, on the other hand, is purely to distinguish between alternative productions rather than actually parse with the DFA. As such, the algorithm needs to distinguish between NFA states derived from different productions and DFA states are split according to the predicted alternative just like the LR(1) algorithm splits otherwise identical LR(0) states according to lookahead sets.

An LL(*) NFA configuration tracks the NFA state number, predicted alternative, and rule invocation stack needed to get to that NFA state: (s|alt|context). As the closure operation enters a rule, it pushes the NFA state that invoked that rule. As the closure operation reaches the accepts state for a rule, it pops the rule invocation NFA state from the top of the NFA configuration's stack and continues just like it had done a method call.

The algorithm stops working on a DFA state and considers it an accept state when that DFA state uniquely predicts an alternative or when it finds a nondeterministic DFA state. While there is at least one NFA configuration that predicts another alternative, the algorithm will pursue edges out of that state; this means it uses more and more lookahead in an attempt to distinguish between the alternatives.

A DFA state uniquely predicts an alternative if all NFA configurations have the same predicted alternative. A DFA state is nondeterministic if there are two NFA configurations (s|i|ctx) and (s|j|ctx) for the same NFA state s, different alts i,j and the same context (rule invocation stack).

Finally, the algorithm verifies that the DFA is reduced and all alternatives have an associated prediction stop state. This guarantees that there are no dangling states, meaning that ANTLR cannot figure out which alternative to predict upon a lookahead sequence, and guarantees that every alternative within a rule is reachable.

The algorithm has exponential complexity just like the classical NFA to DFA conversion algorithm, but that pathological scenario does not seem to come up in practice. Also like the classical algorithm, the LL(*) algorithm terminates because there is a finite amount of work. There is a finite number of NFA states, which leads to a finite number of DFA states. Because the algorithm works on DFA states there is a finite amount of work and it terminates. While the stack context for each NFA configuration within a DFA state can grow arbitrarily large in principle (making the number of NFA configurations infinite), a simple threshold causes the algorithm to abort analysis for that node and report a message to the user.

To demonstrate why the algorithm must track the rule invocation stack, consider the following simple LL(2) grammar that matchs {A, B, AA, AB}.

s : a A
  | a B
  ;
a : A
  |
  ;

The lookahead for the first alternative of rule s is {AA,A EOF} and {AB,B EOF} for the second alternative. When the lookahead computation encounters the invocation of rule a it asks for the lookahead from a within the context of its invocation so that B does not appear in the lookahead for alternative one at all. In other words, the lookahead for rule a must be {A, epsilon} not {A, FOLLOW(a)}={A,B}. Here is the NFA for s:

and here is the NFA for a:

The DFA describing the decision for rule s is:

Using the -Xdfaverbose option with -dfa shows the internal NFA configurations in each DFA state:

Examine state s1. Notice that all of the NFA configurations associated with alternative one ("alt1:...") have a context stack with state 4 on top and the NFA configurations associated with alternative two have state 8 on their context stack (``17|[4 $]'' means NFA state 17 with a context of NFA rule invocation state 4). State 4 is the rule a invocation state in the first alternative of rule s and state 4 is the invocation state in the second alternative. The algorithm uses the context stack to properly compute lookahead, but the resulting DFA naturally does not have a stack--it is not a pushdown automata.

When designing the LL(*) algorithm, I wanted something that modified a well understood algorithm so that it would be more accessible to researchers and developers. Indeed, the algorithm in its general strategy is straightforward, but the details of getting this algorithm to be efficient and to catch analysis errors properly are fiendishly difficult; I've spent nearly 4 years getting it practical and correct. For example, my original algorithm had a lot of landmines in the sense that the algorithm would hit certain grammar rules and never terminate. I had to add a fail safe that terminate the NFA to DFA conversion for a decision if it exceeded a certain threshold in time. Clearly, this makes the academic in me queasy. After looking at a grammar with a lot of these landmines carefully, I arrived at an important conclusion. It turns out that the set of decisions for which ANTLR timed out was the same as the set of non-LL(*) decisions for which ANTLR would never be able to succeed anyway due to recursion occurring in more than one alternative in a single decision. Morever, the algorithm was able to detect these situation quickly and report the non-LL(*) decision rather than attempting the NFA to DFA conversion and then timing out.

October 19, 2006

I had started a paper, but I'm not going to finish for a while. Here is some intro stuff in latex format (sorry). IT IS REALLY JUST MY NOTES and should be considered a draft and possibly inaccurate! See also slides from ANTLR2005 workshop.

<begin notes>

When building language applications, the stronger the parsing technology, the easier it is to build suitable and natural grammars for even small or uncomplicated languages. Moreover, some languages, such as C++ and Perl, seem to resist specification using many commonly-available and practical parser generators because of ambiguities and context-sensitive structures. The computing resources available on the desktop now permit a number of impressive parsing strategies that allow natural descriptions for even large and complicated languages.

Most recently Elkhound citeelkhound, a more efficient $GLR$ citeGLR implementation, and \em Rats! citerats, a parser generator for Parsing Expression Grammars (PEGs) citepeg, accept all context-free grammars (CFGs) and PEGs (ordered non-left-recursive CFGs (### not quite accurate), respectively. Roughly same power. There does not seem to be a strict strength ordering between CFGs and PEGs, however, because there exists a PEG for at least one non-context-free language: $a^nb^nc^n$ citepeg.

The extra power of $GLR$ and PEGs comes at the cost of time and space. Grimm's experiments citerats show SDF2 citesdf2 (traditional $GLR$) and Elkhound are roughly 7 times slower than a more traditional $LL(k)$ parser generator such as JavaCC citejavacc at parsing Java source code. \em Rats! is currently about 2 times slower than JavaCC. The extra machinery beyond the $LR$ and $LL$ foundations slows down parsing for even deterministic languages like Java.

$GLR$ is much more efficient than Earley's citeearley algorithm because it relies on traditional $LR$ parsing for all but the nondeterministic decisions. Similarly, \em Rats!, which always chooses paths by backtracking (squirreling away partial results like Earley), might be able to rely upon traditional $LL$ lookahead in some decisions to improve efficiency. One may conclude that increasing the power of the underlying $LR$ and $LL$ technology would then further increase efficiency of $GLR$ and PEG-based tools as long as the new technology itself did not incur a heavy cost.

The obvious place to look for increasing the power of $LL(k)$ and $LR(k)$ parsers is to increase the maximum lookahead depth, $k$. Increasing $k$ increases the number of grammars acceptable to $LL(k)$ and $LR(k)$ parser generators (though all $LR(k)$ grammars can be rewritten in $LR(1)$ form, they are usually huge and less readable). In the early 1990s, ANTLR citeantlr demonstrated that the $O(|T|^k)$ exponential worst-case for $k>1$ lookahead and token space $T$could usually be avoided by using \em linear approximate lookahead citeparr93, which compresses lookahead $k$-tuples into $k$ sets of tokens, $O(|T|times k)$, whenever possible and by modulating the lookahead depth as necessary from decision to decision. Experiments showed that the vast majority of decisions could use linear approximate lookahead, implying that it is often a token at a particular depth rather than a sequence that distinguishes alternatives. Gagnon reports that SableCC citesablecc now also implements linear approximate $LR(k)$ to good effect.

Because traditional $k$-token lookahead is finite, the lookahead language is regular and can be described with an acyclic DFA or trie. A cyclic DFA, in contrast, could describe the unbounded regular lookahead, significantly reducing the number of nondeterministic parser states (relative to finite lookahead). By simply allowing cycles and without introducing a new concept, recognition strength can be drastically improved. The extra heavyweight machinery used by $GLR$ at parse time to deal with finite-lookahead nondeterminisms can sometimes be condensed to a (stack-less) DFA matching a regular approximation of the underlying context-free lookahead. In effect, much of the parser simulation work can be done statically at parser generation time by computing lookahead DFA.

Bermudez's $LAR(m)$ citeBS90 computes such lookahead automata to resolve nondeterministic $LR(0)$ states (parameter $m$ is not the lookahead depth, but is a small constant needed by the algorithm as explained in section refalgorithm). $LAR(m)$ is similar to $GLR$ in that both engage a form of arbitrary lookahead when confronted with nondeterminisms. $LAR(m)$ engages a DFA checking the regular lookahead language and $GLR$ engages multiple, full $GLR$ parsers to explore all paths, in a sense, also examining the context-free lookahead language. A $GLR$ system based upon $LAR(m)$ would be exactly as powerful, but in principle would be much more efficient due to fewer $GLR$ states. Further, the exact same parse forest would result because all grammar ambiguities result in nondeterministic $LAR(m)$ states (just like $LR(k)$) for which $GLR$ would be engaged.

This paper introduces $LL(*)$, the dual of $LAR(m)$ and the core parsing engine of the new ANTLR v3. $LL(*)$ is a simple extension to traditional $LL(k)$ that neither constrains the maximum lookahead depth nor requires the programmer to specify a $k$ value \em a priori. $LL(*)$ represents a much larger class of grammars than $LL(k)$, even accepting some non-$LR(k)$ grammars as does $LAR(m)$. $LL(*)$ retains the simplicity and semantic flexibility of traditional $LL$ parsers while dramatically improving recognition strength. As with $LAR(m)$ and $GLR$, an $LL(*)$ base for PEGs would increase efficiency without loss of strength.

The $LL(*)$ lookahead language is a covering (### actually, LL(*) builds an exact DFA if the lookahead language is regular; if not regular, it's not LL(*). works by simulating finite stack with states which is possible unless recursive) regular approximation to the underlying context-free lookahead language. This loss of precision makes $LL(*)$ weaker than PEGs, but scanning ahead quickly with a DFA is much faster than backtracking with a full parser. $LL(*)$ is $O(n^2)$ because ..., but the max loookahead depth for a given parse is on average very small. Java is max 6 for my tests. No one claims $LL(6)$ is non-linear. Just because it's arb doesn't mean it's large.

To illustrate the fundamental difference between $LL(*)$ and $LL(k)$, consider the following ANTLR v3 grammar fragment that matches Java type definitions:

typeDef
   :   modifier* classDef
   |   modifier* interfaceDef
   ;

noindent Rule \tt typeDef is not $LL(k)$ for any finite $k$ because the common left-prefix, \tt modifier*, can be arbitrarily long. By inserting user actions at the left edges of the productions, rule \tt typeDef is also easily shown to be non-$LR(k)$ for any fixed $k$. \tt typeDef is, however, $LL(*)$ because the lookahead languages for both alternative productions are regular and disjoint. The following cyclic DFA uniquely predicts alternatives by scanning ahead until it sees either keyword ``\tt class'' (implying alternative 1) or ``\tt interface'' (implying alternative 2).

begincenter resizebox2in!\includegraphics{figures/typeDef-decision} endcenter

### note: no backtracking with parser; no stack needed. Analyzed statically to discover look language is regular--don't need stack to encode language. Reg look languages are disjoint and hence the alts are deterministic given arb reg look.

noindent As with linear approximate lookahead, the distinguishing lookahead feature is usually a key symbol at a particular depth rather than a sequence; here the common modifier prefix actually hinders prediction. Compressing a lookahead parser to a lookahead DFA is analogous to compressing full $LL(k)$ lookahead to linear approximate $LL(k)$.

$LL(*)$ also proves an excellent enhanced parsing strategy upon which backtracking may be grafted to achieve the recognition strength of PEGs without the cost of continuous backtracking and memoization. Moreover, dealing with arbitrary user actions written in the target language is much easier because backtracking is engaged so infrequently. This ability in turn allows semantic predicates to be functions of user actions so that, for example, symbol table information can easily direct the parse. A future paper will discuss how both semantic and syntactic predicates citePQ94 may be trivially folded into the $LL(*)$ analysis algorithm described herein to generate parsers that dynamically engage predicates and backtracking when necessary. As a pleasant side-effect, these augmented decisions are optimal in that the extra machinery is only engaged for those specific, individual, $LL(*)$ DFA paths that do not uniquely predict an alternative production. ANTLR v3 parsers combines the power of PEGs with remarkable efficiency and ANTLR's traditional semantic flexibility.

<end notes>

July 21, 2006

Been racking my brain trying to find a more graceful way of "failing" to analyze a decision with LL(*). For the Rats! Java grammar (converted to v3 format), ANTLR took 18s on my fast box and spewed all sorts of errors. Decisions that were too complex had to time out (in one second) before I figured out there was a problem. You would see stuff like:

java.g:467:23: ANTLR could not analyze this decision in rule
concreteDimensions; often this is because of recursive rule references
visible from the left edge of alternatives.  ANTLR will re-analyze the
decision with a fixed lookahead of k=1.  Consider using "options
{k=1;}" for that decision and possibly adding a syntactic predicate.

It would retry at k=1 and dump lots of errors at you. Ugh. LL(*) got a few decisions (3) that were not LL(k), but many others timed out. 18s was way too long also. So, then I implemented the auto-backtracking feature that made ANTLR hide all the errors and just backtrack at runtime instead of saying it couldn't figure out what to do at static grammar analysis time. Still had to time out before ANTLR could know to backtrack. :(

This also made the LL(*) algorithm look bad--it failed to terminate for many decisions in a big grammar.

I spent time looking carefully at the Rats! Java grammar. Looks like this: with just pure LL(*) and no backtracking on the Java 1.4 grammar ripped directly from Grimm's packrat parser, I get 12 decisions with trouble. I detect 4 of them as truly ambiguous and I tell you so. 8 decisions fail to terminate and I have to abort after my simplistic failsafe triggers. Eventually I realized that all 8 have nonregular lookahead languages because of recursion and are, hence, non-LL(*) by definition (DFAs have no stack so LL(*) can't predict these alts).

SO! It turns out that the set of decisions for which ANTLR times out is the same set (so far) of non-LL(*) decisions. Morever, I'm able to detect this situation quickly and report rather than timing out. Errors in this situation (3.0b3) look like:

java.g:468:23: [fatal] rule concreteDimensions has non-LL(*)
decision due to recursive rule invocations in alts 1,2.  Resolve
by left-factoring or using syntactic predicates with fixed k
lookahead or use backtrack=true option.

ANTLR terminates in 8 seconds now, shaving off 10 seconds. 8s is still a bit long, but the lexer analysis and code generation is taking about half of that.

Woohoo! Whew! Now, I think LL(*) is ready for prime time! You can type in any ol' crap and ANTLR will do something sane and reasonably quickly.

May 11, 2006

Robert Grimm, creator of Rats! PEG/packrat parser generator, was here speaking to my graduate class yesterday and it reminded me that Bryan Ford's PEGs (parser expression grammars) have not predicates also. They are really only useful in the lexer and seemingly only for single elements (all examples so far have been "not semicolon" or something similar). In ANTLR, you say ~';' so I don't think we need them. However, if we did, I believe a simple transformation will yield them:

a : alpha !(beta) ;

which means "match alpha if not followed by beta". We can transform to:

a : ( alpha ((beta)=>{false}? | ) => alpha ;

by using a semantic predicate. Heh heh heh... To get an "and predicate" at the end is easy too:

a : alpha &(beta) ;

(which means "match alpha only if followed by beta") converts to

a : (alpha beta)=> alpha;

Easy and optimized for the common case. :)

December 8, 2005

So, we have a syntax / notation problem for predicates now.

We have the following functionality for which to create syntax:

  1. disambiguating semantic predicate: hoisted into prediction code only when syntax alone is insufficient to determine uniquely which production will win. Works nicely for things like languages with keywords as variables and TYPE vs ID syntax issues.
  2. gated semantic predicate: turns off a production; the associated edges in the prediction DFA are literally turned off so the production can never be predicted. Works nicely for, say, turning off GCC extensions in your C grammar.
  3. syntactic predicate: backtracking. Only used when LL(*) lookahead is insufficient. Useful for dealing with truly ambiguous input sequences like "T(*a)" in C++ is both a decl and an expr.

The order of productions specifies the order of ambiguity resolution. So if alt i and alt j are predicted, alt i is chosen if i<j.

Originally, in PCCTS (ANTLR v1), we used:

  • (...)? syntactic predicate
  • {...}? hoisting disambiguating semantic predicate

The symmetry of the ? op was/is appealing.

Then, in ANTLR v2 we had:

  • (...)? optional subrule
  • (...)=> syntactic predicate
  • {...}? nonhoisting disambiguating semantic predicate

Now, we've added gated predicates and we have currently:

  • (...)? optional subrule
  • (...)=> syntactic predicate
  • {...}? hoisting disambiguating semantic predicate
  • {...}?=> gated semantic predicate

The problem here is that we have an inconsistency. Note that the syn pred and the disambig sem pred are only used when syntax is insufficient whereas gated sem preds are always used. The => operator means implies and, therefore, sort of means gate. The problem is that syn preds are not executed unless LL(*) fails. I sort of like the idea that => means forced evaluation or something similar. In that case, {...}?=> or {...}=> seems right, but (...)=> seems wrong. Optimally, I'd use

  • (...)? syntactic predicate
  • {...}? hoisting disambiguating semantic predicate
  • {...}=> gated semantic predicate

but, what about optional subrules then? decl? looks very nice as optional match decl. Some people use [...] or {...} to mean option, but we use square brackets for rule args and curlies for actions. In PCCTS, I used <<...>> for actions, but that was a drag. We can't use parentheses for rule arguments as that would be ambiguous. I wonder if an action as args would work: rule{args}. Nope that is ambig also. I wonder if using real EBNF would work: <rule(args)> or plain rule if no args? Heh, that might work!!! Uh oh, what about labels then? r=<rule(args)> might look weird. Let's see what happens with parens for arguments/return values and angle brackets around a rule ref with args:

rule:  ... o=<optionsSpec(32)> ... <ruleAction(r)> block ';'
    ;

optionsSpec(int max) returns (Map opts=new HashMap())
    :   OPTIONS (<option(opts)>)+
    ;

option(Map opts)
    :   ... {$opts.put(...);}
    ;

I don't like the returns syntax with a paren, but it might work. I note that JavaCC syntax uses <STATIC> notation for keywords, which I think is the opposite of old EBNF. Actually they require parens on all method calls, which makes them stand out. That is a pain, but very clear. Interesting. The parens are overloaded, but as long as they are required, there is not ambiguity between args and subrule.

If we use <rule(args)> notation then we can use [...] for optional and then we can do this:

  • [...] optional subrule
  • (...)? syntactic predicate
  • a : (alt)? | ... ; syntactic predicate wrapping whole production
  • {...}? hoisting disambiguating semantic predicate
  • {...}=> gated semantic predicate
  • rule rule reference
  • <rule(args)> rule reference with args

I don't like the difference in syntax for rule ref with and w/o args. :( Also, I used ID<AST=VarAST> in v2 to indicate the heterogeneous tree node to create for a particular reference. Perhaps angle brackets means I'm going to add options to something, which has a different little syntax on the inside like <ID(AST=VarAST)> and <rule(args)>. Heh, that could work. Add:

  • ID token reference
  • <ID(args)> token reference with options

Heh, that could even be a nice syntax to flip the type or text or line or something:

<LCURLY(type=BLOCK, text="BLOCK", line=n)>

Interesting.

Ok, people hate me messing with the syntax that much. ;) Let's try again

  • (...)? optional subrule
  • (...)? => syntactic predicate
  • {...}? hoisting disambiguating semantic predicate
  • {...}? => gated semantic predicate

Not bad I suppose. I still don't like the inconsistency with gated sem pred vs syn pred. Hmm...how about the parsimonious:

  • (...)? optional subrule
  • (...) => syntactic predicate
  • {...} => hoisting disambiguating semantic predicate
  • {...}? gated semantic predicate

December 4, 2005

Ok, took a few hours to get it right, but I have syntactic predicate memoization going. It only took an hour to get it going yesterday morning but then I got stuck trying to make an efficient cache. Worst case the cache must be O(|R|n) where |R| is the number of rules. Not sure where Ford's cache takes exponential space because once you have commit to an input position, you can throw away all cached info for that position. That must be the difference; his grammars are probably stronger as they never fully commit until end of file whereas I take a hybrid LL(*)-backtracking approach. I suspect the constant in front of my linear O(n) parsing will be much smaller than a pure packrat parser.

Anyway, here's the idea. Every time you finish a rule when guessing, just record the final token matched for that rule (-1 indicates failure) but associate it with the start token. You get a map like this:

r_memo[startTokenIndex] = stopTokenIndex;

The next time you enter that rule at startTokenIndex and guessing>0, check the cache. If there is an entry, seek to the stopTokenIndex+1 and immediately return w/o descending into the rule. The invoking rule will be none-the-wiser.

Action execution. Note that when you enter a rule and you're not guessing, you have to actually do the rule and all associated actions. To guess r you execute it with gates around the actions and then after you have decided to do r, enter again and execute. With a cache you will never be able to execute a rule more than twice for any input position, hence time O(2n), for input size n.

Here is an example:

grammar W;

s : (a ';')+ ;

a
options {
  k=1;
}
  : (b)=> b {System.out.println("alt 1");}
  | (b '.')=> b '.' {System.out.println("alt 2");}
  | c       {System.out.println("alt 3");}
  ;

b
@init {System.out.println("enter b");}
   : '(' 'x' ')' ;

c
@init {System.out.println("enter c");}
   : '(' c ')' | 'x' ;

WS : (' '|'\n')+ {channel=99;}
   ;

For input

(x) ; ((x)) ;

You get:

enter b
enter b
alt 1
enter b
enter c
enter c
enter c
alt 3

Review this carefully...here is what is actually going on with the cache:

begin backtracking synpred1 @0
enter b
storing b_memo[0] = 4
storing synpred1_fragment_memo[0] = 4
end backtracking synpred1: SUCCEEDED
enter b
alt 1
begin backtracking synpred1 @6
enter b
storing b_memo[6] = -1
storing synpred1_fragment_memo[6] = -1
end backtracking synpred1: FAILED
begin backtracking synpred2 @6
seen b before; skipping ahead to @-1
rule b will never succeed
storing synpred2_fragment_memo[6] = -1
end backtracking synpred2: FAILED
enter c
enter c
enter c
alt 3

Note that alt 1 of the loop in s starts the guessing by entering (b)=> which starts to parse b. Upon success, it records the stop token for b+1. It then enters b to do it "with feeling", executing the actions: printing "alt 1". Input "(x) ;" is done. Any cache for this input can be removed (though I do not remove anything for now).

Next, rule s tries to match (b)=> again for the first alt of the loop. It fails due to input "((" and records that for rule b and in fact the predicate itself. It then tries alt 2's pred (b '.')=> which tells it to match b, but it says it's seen it before and it will never succeed. It returns immediately w/o entering b. Next, it defaults to trying rule c w/o a predicate. It eventually executes the action to print "alt 3" after recursing twice.

I specifically built rule s so that the same rule will be called, b, during guessing to show that it's just as important to record failure as success so you don't keep trying a rule when you will it will fail.

Here is the entire rule b to show you how the cache works:

Map b_memo;
...
// w.g:14:1: b : "(" "x" ")" ;
public void b() throws RecognitionException {
    int ruleStartIndex = input.index();
    if ( backtracking>0 ) {
        if ( b_memo==null ) b_memo = new HashMap();
        Integer skipIndexI = (Integer)b_memo.get(new Integer(input.index()));
        if ( skipIndexI!=null ) {
          int skipIndex = skipIndexI.intValue();
          System.out.println("seen b before; skipping ahead to @"+skipIndex);
          if ( skipIndex>=0 ) {
             input.seek(skipIndex);
          }
          else {
            failed=true;
            System.out.println("rule b will never succeed");
          }
          return;
        }
    }
    System.out.println("enter b");
    try {
        // w.g:16:6: ( "(" "x" ")" )
        // w.g:16:6: "(" "x" ")"
        {
        match(input,7,FOLLOW_7_in_b94); if (failed) return;
        match(input,8,FOLLOW_8_in_b96); if (failed) return;
        match(input,9,FOLLOW_9_in_b98); if (failed) return;
        }
    }
    catch (RecognitionException re) {
        reportError(re);
        recover(input,re);
    }
    finally {

        if ( backtracking>0 ) {
          int skipIndex = input.index();
          if ( failed ) skipIndex = -1;
          System.out.println("storing b_memo["+ruleStartIndex+"]= "+skipIndex);
          b_memo.put(new Integer(ruleStartIndex), new Integer(skipIndex));
        }
    }

}
// $ANTLR end b

So, this memoization is easy to do except for the cache memory performance. I'll have to come up with a nice, fast cache later. The support code is a fairly significant chunk of code to insert into every rule as you can see. Perhaps this can be turned off. Ok, I just added a parameter to outputFile template called memoize. Here is the result of running the input through w/o the rule cache:

enter b
enter b
alt 1
enter b
enter b      An extra entry into rule b w/o memoization
enter c
enter c
enter c
alt 3

We only save one rule invocation with memoization. Without the memoization, however, the code space is much smaller. With the caching code, W.class is 7314 and 5492 without; a 25% increase.

The new cmd-line option is:

  -nomemo        when backtracking don't generate memoization code

December 2, 2005

I'm thinking about the backtracking mechanism now. v2 uses exceptions which everyone is hating because they can be really slow particularly when you invoke them exponentially many times.

Much of the time, you don't need backtracking in v3. For example, consider the usual expression rule chain:

e : e2 ('+' e2)* ;
e2: e3 ('*' e2)* ;
e3: INT ;

with a rule that has productions with the same left-prefix in common:

stat: e ';'
    | e '.'
    | e '!'
    ;

This is LL(*) believe it or not. Here's the DFA that predicts which alt of rule stat will succeed:

So, my dream of just feeding stuff to antlr and having it work is coming alive! The problem is that we still need a way to handle ambiguous grammars because some things can be matched two ways but you can order the productions such that it chooses the right path. We need the ordered productions and backtracking implied with syntactic predicates (the term I coined when I built them into PCCTS).

So, when you have a construct that is too hard for LL(*) to parse or you know it's ambiguous (also too hard for LL(*)) then you would manually specify the backtracking with a syntactic predicate. You get the power but don't have to resort to a full backtracking parser (very slow). Here's the classic C++ ambiguity problem solution in v2 ANTLR:

// C++ decl vs expr ambiguity
stat : (decl)=> decl
     | expr SEMICOLON
     | ...
     ;

Despite the power of LL(*), I'm finding that I still get a bunch of warnings I don't want and newbies will find perplexing. My main goal is for newbies to be able to just type stuff in and, if it's at all reasonable, it will sort of parse. :)

As I consider the addition of syn preds to v3, I have come to an interesting generalization. Syntactic predicates can be converted to functions that return true/false if that grammar fragment matches the current input. Then the analysis engine can be fooled into thinking they are semantic predicates like {synpred42()}?. The resulting DFA will split an ambiguous state based upon semantic information to decide what to do...it's just more expensive than a normal predicate. Anyway, if you have predicates, they can easily be added into the analysis and are evaluated only on those input streams that are non-LL(*).

So then I asked myself, why not just always backtrack if I find a problem? This would imply that all productions had an implied syntactic predicate on the left edge. Then I started looking at Bryan Ford's extension of my grammars to PEGs and his memoization that guarantees linear time (but exponential space worst case). Greg Benson here at USF reminded me that this memoization is similar to the incremental parsing strategy Prashant Deva has and the partial solution I came up with (Prashant and I will write up the stuff together). Then I realized just how freaking easy it would be add the memoization and then I realized, after reading Bob Foster's blog that I could toss out all memoized data before position i after I have commit to a parse at that point. The data requirements would be minimal and we'd get linear time. I need to look at Ford's work more carefully to figure out how close to full CFLs he gets with PEGs.

Anyway, here's the key: I could guarantee that no matter how much you backtrack, you would parse something at most twice, once to check and once to exec actions!

So, the question is, should I always automatically backtrack now and just avoid giving errors except for left-recursion? We'd have an LL(*) option to allow you to manually specify the backtracking. Or, perhaps we should start out having it plain LL(*) with manual backtracking and have a -newbie switch to make ANTLR shut up and just do its job. Hmm...start with a -newbie switch I guess.

Another question, do I implement Ford's "not syntactic predicate"? Seems useful and easy to implement. You could say things like match this as long as it's not followed by that.

(part II after feeding my face)

So what should backtracking look like (no memoization)? I am pretty sure now that exceptions are unnecessary. An instance variable called failed can be used instead of return values (sort of like UNIX errno). As long as that value is used or saved immediately following a method call, no stack is needed and we can avoid having to use the method return value mechanism. Then, just alter the parsing routines to "return" failed instead of throwing an exception:

public class Parser {
  protected boolean failed = false;

  public void match(IntStream input, int ttype, BitSet follow)
    throws MismatchedTokenException
  {
    if ( input.LA(1)==ttype ) {
        input.consume();
        errorRecovery = false;
        failed = false;		// new bit
        return;
    }
    if ( guessing>0 ) {		// new bits
        failed = true;
        return;
    }
    MismatchedTokenException mte =
        new MismatchedTokenException(ttype, input);
    recoverFromMismatchedToken(input, mte, ttype, follow);
    return;
  }
...
}  

In the generated code, you'd see things like:

// a : A b | B ;
a() {
  if ( LA(1)==A ) {
    match(A); if (failed) return;
    b(); if (failed) return;
  }
  else if ( LA(1)==B ) {
    match(A); if (failed) return;
  }
  else {
    if ( guessing>0 ) {failed=true; return;}
    throw new NoViableAltException();
  }
}

Clearly the cost is two new bytecode instructions to push failed onto the stack and then a branch instruction...for every token match and rule invocation. But, who cares...still readable. As usually, every action needs if (guessing>0)... around it.

The same syntactic predicate may need to be invoked in multiple places within a DFA so we need to move this to a separate method.

// assume b and c have at least one input sequence in common
a : (x SEMI)=> b
  | c
  ;

Step 1. I would need to convert (b)=> into

a : {synpred23()}? b
  | c
  ;

Step 2. Generate code for the synpred23 function as if it were a rule with a boolean return value:

boolean synpred23() {
  try {
    guessing++;
    input.mark();
    x(); if (failed) return false;
    match(SEMI); if (failed) return false;
    return true;
  }
  finally {
    input.rewind();
    guessing--;
  }
}

The only difference is the return false instead of plain return and the guessing stuff; then the final return true. Actually, make a wrapper function around a normal rule:

boolean synpred23() {
  try {
    guessing++;
    input.mark();
    fragment23();
    guessing--;
    return !failed;
  }
  finally {
    input.rewind();
    guessing--;
  }
}
boolean fragment23() { // normal looking rule now
  x(); if (failed) return;
  match(SEMI); if (failed) return;
  return true;
}

Yes, that is cleaner and if the fragment is just a rulename like x, then I can directly call that rule. :) That is all there is too it. Magic!

Memoization. So talking to Prashant about his incremental parsing work has led to this simple version of Ford's memoization for backtracking (i'll have to look at this implementation). Revisit the above expression example. Imagine parsing an e at token position i--you will descent into e2 then e3. To avoid ever having to do that again at position i, you just remember that you've already done it and, most importantly, how many tokens you consumed by the end of the rule the first time. In general you need a map:

memo[input position, rule] -> token position after matching rule

I'd probably do this as an array of dictionaries that answered the stop token for any rule that has parsed at that position successfully. Once I consume() past position i, I can set it's memo to null. :)

Map[] memo = new Map[input.length()];

Note to self, should we record a rule that has been attempted but that did NOT succeed to avoid unnecessary backtracking later?

Anyway, so, you'd see a new rule preamble that checks to see if we've seen this before.

e() {
  if ( guessing>0 ) {
    int stopIndex = memo[input.index()].get("e");
    if ( stopIndex>=0 ) {
      // we've seen rule e at this position before; skip ahead then return
      input.skipTo(stopIndex);
      return;
    }
  }
  // parse normally
}

The important point here is that calling rule e at the same input position immediately returns success without descending all the way down to e3!

Ok, I've got the dev environment up...go!

Easy...faked out ANTLR by adding a few lines of code and converting syn pred to sem pred. This is a good sign that ANTLR's code is well structured when I'm able to make such changes in 15 minutes.

parser grammar T;

a : (b)=> b
  | c
  ;

b : '(' b ')' | 'x' ;

c : '(' 'x' ')' ;

results in the following DFA for rule a prediction.

Notice how that it only backtracks on what it knows is a problem, (x), but not on the other two possible prefixes! :)

Ok, 15 minutes later and it's building the code for the predicate fragments:

// w.g:3:5: synpred1_fragment : b ;
public void synpred1_fragment() throws RecognitionException {
    try {
        // w.g:3:5: ( b )
        // w.g:3:6: b 
        {
        following.push(FOLLOW_b_in_synpred1_fragment13);
        b();
        following.pop();
        }
    }
    catch (RecognitionException re) {
        reportError(re);
        recover(input,re);
    }
    finally {
    }
}

Gawd, that was easy. Now to add failed bits and the guessing stuff and pred wrappers. I suppose I'll optimize later to remove that extra try/catch.

Ok, 20 minutes more and it's generating nice wrappers and all the if-failed stuff.

    boolean synpred1() {
        backtracking++;
        int start = input.mark();
        synpred1_fragment();
        input.rewind(start);
        backtracking--;
        return !failed;
    }

It compiles. I guess now I actually need to check the execution of mark/rewind. Oh, seems like I should make my example have:

b : '(' 'x' ')' ;
c : '(' c ')' | 'x' ;

So rule b is the special case of a single parens. Let's see if it matches.

Wow. 15 minutes later and the example works:

a : (b)=> b {System.out.println("alt 1");}
  | c       {System.out.println("alt 2");}
  ;
...

Upon "(x)" I get "alt 1" and with "((x))", I get "alt 2". All total, implementation sufficient to get first example working, 1.5 hours. :) Boy do I love this v3 code base!

My tendonitis is flaring up in my hands/forearms. Better quit for the night while I'm ahead! I'll do memoization over the weekend. Wow, this is great. Backtracking with no exceptions and no change at all to the analysis. That's nice.

December 1, 2005

hooray...solved the gated predicate thing. Imagine:

A : {p}? "a" ;
B : {!p}? ("a"|"b")+ ;

Imagine that input "aa" and p true should result in two A tokens rather than 1 B token. Currently, it matches a single B because it matches the longest string here. The DFA predictor makes it clear how the two rules are predicted. =>2 implies alt 2 (2nd token) is predicted.

Now imagine a new kind of gated predicate (as opposed to disambiguating) that literally turns the production on or off:

A : {p}?  => "a" ;
B : {!p}? => ("a"|"b")+ ;

The syntax came from a Loring Craymer idea. In this case, I take the DFA and I force evaluation of p or !p on any path that can match that input. Then at runtime if p goes false, the related arcs will be unavailable. Here's the DFA.

Notice that it's the same DFA minus the additional predicates on the arcs.

This took me quite a lot of thinking to get right. I considered and rejected lots of different alternatives.

One of the hard things was what it means when disambiguating and gated predicates are available in the same predictor DFA! It turns out that only gated predicates are collected from the normal predicate expressions. So if you have

a : {p}? A
  | b
  ;
b : {q}?=> c
  ;
c : {r}?=> C
  | {s}?   D
  ;

Then the DFA looks like:

Note that the p pred is never used as syntax alone is enough to predict that alternative. :) That is the key difference: disambiguating predicates are only evaluated when syntax fails and gated predicates turn off a production completely (by gating out edges in the prediction DFA). This works for cyclic DFA as well :)

I discovered a fabulous way to do syntactic predicates today. :) Carry them along like sem preds, treating the result of their evaluation like a predicate. The predicate will be just a method call that returns true/false if the related grammar fragment matches.

November 29, 2005

Gated productions via forced predicates

So, I'm playing around with real grammars (python and ruby) to find nasty parsing problems (boy did I find them). In python, there are only lexical problems really. newline is context-sensitive and somethings you want to ignore them and sometimes you don't. Smells like we need a predicate, but problems arise. For example, here is the comment rule:

COMMENT
@init {
    int startPos = getCharPositionInLine();
}
    :   "#" (~"\n")* { channel=99; }
        ( {startPos==0}? "\n" )+
    ;

You want it to scarf all trailing comments as they are not statement terminators; we don't want COMMENT to fall out, which would let NEWLINE have them (annoying the parser).

The loop:

( {startPos==0}? "\n" )+

looks like it would not match the subrule according to the predicate, but....it is not even part of the prediction! Remember predicates exist to disambiguate syntactic problems. In this case, the lexers thinks that as long as newlines are present, it will scarf them. No semantics required.

This is the problem that Oliver ran into for XML. He needs a predicate that "gates" in/out a production or loop. I know how to insert that (analysis would ignore; code gen would simply insert always), but what the hell would the syntax be to distinguish?

The parser seems to work naturally with disambiguating predicates but the lexer seems to want gating predicates. Interesting. Loring Craymer had an interesting suggestion at the workshop: use syn pred notation around it. (...?)=>

Perhaps a variation:

( {startPos==0}=> "\n" )+

where the action implies when it's ok to proceed.

These would NOT hoist as you'd be evaluating them out of context (which is why the {..}? variety are always evaluated after the lookahead test). These gate predicates would simply get shoved into the lookahead decision for that specific production.

The problem is that sometimes you want to distinguish between two lexical rules like Oliver did:

TAG_OPEN : { !tagMode } => '<' { tagMode = true; } ;
TAG_CLOSE : { tagMode } => '>' { tagMode = false; } ;

Perhaps the special case of gate predicates on the left edge of lexical rules get hoisted one level into the special Tokens rule that chooses among lexer rules.

Actually should it be

a : {p}? => foo
  | bar
  ;

I wonder. Well, it is more consistent and is really then a shorthand for ({p}?)=>

ok, a problem.

A : {p}? => "ab"
  | "ac"
  | "x"
  ;

It's clear what you mean: throw no viable alt exception if !p and ab is the lookahead. However, look at the predictor. The lookahead DFA is:

o-a->o-b->predict 1
|    |
|    o-c->predict 2
|
o-x->predict 3

You want the gating predicate to be evaluated in such a way that the first alternative is not even a possibility if p is false. Doesn't that mean we'd need a DFA specific to this situation that would carve out the "ab" lookahead leaving "ac" and "x"? I can't think of where to stick the predicate in the DFA predictor since the shared common prefix of "a" merges in the DFA. I can't just stick the predicate before the evaluation of the DFA either.

Re-examine the canonical problem:

lexer grammar T;
A : {p}? 'a' ;
B : {!p}? ('a'|'b')+ ;

Upon input "aa" and p is true, ANTLR does the "wrong" thing--it matches a single B token. Two 'a' char is sufficient for syntax alone to distinguish between the alts and ANTLR doesn't even eval the preds.

To get "aa" and p to yield two A tokens, ANTLR must literally turn off multiple branches in the predicting DFA. Any edge that leads to a B predictor state must be gated off or a new DFA must be created w/o the B alternative. Consider:

lexer grammar T;
A : {p}? => 'a' ;
B : {!p}? => ('a'|'b')+ ;

Now, input "aa" would for sure match AA. The DFA would have all predicates for each edge evaluated to shut off invalid paths. I looked at the analysis can stay the same...only code gen is different!

September 30, 2005

More on the predicate issue with lexers. Consider the following ambiguous grammar:

parser grammar T;
a : A
  | (A|B)+
  ;

ANTLR says:

ANTLR Parser Generator   Early Access Version 3.0ea6 (?, 2005) 1989-2005
t.g:2:5: Decision can match input such as "A" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

If you add predicates, all is well and ANTLR generates the following prediction DFA for rule 'a':

If you convert to a lexer grammar:

lexer grammar T;
A : {p1}? 'a'
  | {p2}? ('a'|'b')+
  ;

it still works. You get an analogous DFA.

When the two alternatives are two separate tokens, however, everything changes.

lexer grammar T;
A : {p1}? 'a' ;
B : {p2}? ('a'|'b')+ ;

Now ANTLR cannot detect the ambiguity soon enough in order to gate out one token or the other depending on p1 or p2. Yes, it sees that 'a' can predict both A and B, but it knows that EOT ("end of token") appears afterwards and it tries to use syntax (that is, more lookahead) before resorting to semantic predicates. You get this DFA:

This DFA implies that 'a' followed by the end of token (in this case anything other than another 'a' or 'b') is ambiguous but it has to see the EOT first. I actually don't generate the right code here, but that is a code gen bug not an analysis bug. EOT is like EOF--it's not really an input symbol, but it is a useful analysis tool. Without it I couldn't properly handle keyword versus ID (if A is a keyword, B looks like ID in this case).

What is needed is a "gate" that literally makes rule A or B invisible. For example, "aa" is a B according to the syntax and what people expect; "intint" should be an ID not two int tokens, right? So I'm afraid we need a new concept. With such a "gate", you could turn off rule B, making "aa" give two A matches depending on the predicate p2 being false. This is definitely not how I have defined predicates.

September 28, 2005

The .* idiom says match anything, but actually means "match anything until you see what follows it". In the lexer this definition works great:

SL_COMMENT : "//" .* '\n' ;

Unfortunately, at the moment it is by default greedy like all other loops and we need:

SL_COMMENT : "//" (options {greedy=false;}:.)* '\n' ;

I am changing the def of this in the lexer to mean nongreedy by default. Oh and (.)* is the same as .*.

As for the parser, what does .* mean? Well, how about this filtering example:

parser grammar t;

block : '{' .* '}' ;

It seems to me that it should work for nonnested blocks. With .* nongreedy by default now, it generates a proper loop that exits when it sees '}' followed by EOF. The problem comes when the .* loop is deeply nested in your grammar and it could look at many many rules and tokens looking for the sequence until EOF. In the lexer, it stops at end of token and doesn't look into other rules (unless the rule is a fragment). This parser version then is very inefficient. I wonder if in the parser shouldn't assume nongreedy and k=1 by default. Then the above grammar would stop when it saw '}'. Yes, I think that will do it. You can set k yourself if you want.

Ok, just fixed the code so that the following exits the loop upon R:

parser grammar t;
s : A block B+ EOF ;
block : L .* R ;

Plus loop on wildcard works same way.

September 25, 2005

Oliver Zeigermann writes again to say that his grammar:

lexer grammar ick;
{
    boolean tagMode = false;
}

TAG_OPEN : { !tagMode }? '<' { tagMode = true; } ;
TAG_CLOSE : { tagMode }? '>' { tagMode = false; } ;

PCDATA :   ({ !tagMode }? ~'<')+ { System.out.println("Text"); } ;
TAG_TEXT : { tagMode }? (~'>')+ { System.out.println("Tag-Text"); } ;

is not working properly. For input <tag>, upon seeing the '<', the DFA predicts TAG_TEXT even though the predicate tagMode is false.

To make it easier to understand, I have derived a simpler grammar that illustrates the same problem...it's the keyword vs ID ambiguity:

X : {!p}? 'x' ;
ID : {p}? 'a'..'z'+ ;

ANTLR generates the following DFA:

Notice that the predicates are evaluated after the syntactic context has been evaluated. This is a requirement for the meaning of semantic predicates. The problem is that it doesn't generate the right code:

int alt2=2;
int LA2_0 = input.LA(1);
if ( LA2_0=='x' ) {
    int LA2_1 = input.LA(2);
    if ( (LA2_1>='a' && LA2_1<='z') ) {
        alt2=2;
    }
    else error
}
else if ( (LA2_0>='a' && LA2_0<='w')||(LA2_0>='y' && LA2_0<='z') ) {
    alt2=2;
}
else error

Notice that after input 'x', the only valid input is a..z and for predicting ID.

To understand what's up, check out the same w/o predicates:

X : 'x' ;
ID: 'a'..'z'+ ;

The DFA predictor for implicit Tokens rule is:

All is well, ANTLR generates:

int alt2=2;
int LA2_0 = input.LA(1);
if ( LA2_0=='x' ) {
    int LA2_1 = input.LA(2);
    if ( (LA2_1>='a' && LA2_1<='z') ) {
        alt2=2;
    }
    else {
        alt2=1;
    }
}
else if ( (LA2_0>='a' && LA2_0<='w')||(LA2_0>='y' && LA2_0<='z') ) {
    alt2=2;
}
else error

The EOT (end of token) transition predicts anything other than a..z as it should. EOT actually predicts both X and ID, but it is quiet (picking first alt like flex and friends: X).

The predicates p and !p for X and ID don't change the fact that ANTLR knows how to resolve everything between keyword vs ID except one condition: 'x' followed by a non-ID letter. For example, "xy" and "xx" even are ID tokens syntactically--no need for predicates! The isolated 'x' or "x[" or whatever is ambiguous: X or ID. So, here a predicate can help as the DFA shows (the code gen has a bug as I indicated, but still the DFA is right). To distinguish between X and ID for input 'x' followed by ~'a'..'z' we can use the predicates. The LL(*) algorithm works properly--it forks on the EOT node to distinguish.

So (once I fix the code gen bug) the lexer will operate as I intended, though I admit the behavior is not natural in this case. You mean to say that !p predicts X even when there is no syntactic ambiguity. You want the lexer to match the shortest sequence due to the predicate. ANTLR's rule of using predicates only for the ambiguous subsequences uses !p to predict X only when an isolated 'x' appears. It tries to match the longest possible match.

I'm not sure how to resolve this. The general behavior of the predicates is proper, but this case is a fairly strong argument that it is not always natural. There is a conflict between longest match and predicate evaluation. I don't see what general principle would force evaluation of the predicates at state 1 instead of state 3.

For example, consider the addition of a '.' char so that the ID loop doesn't see the end of the token, removing the ambiguity between X and ID.

X : {!p}? 'x' ;
ID : {p}? 'a'..'z'+ '.' ;

In this case, 'x' uniquely matches X as it does not have a '.' following it. The DFA has no predicates at all even though the grammar has them:

Where would I force evaluation of predicates? If I see input 'x', simply looking one more character tells me if I should stop and choose X or continue and match an ID.

September 10, 2005

Oliver Zeigermann writes to the list:

Folks,

I have a problem with code looking like this:

lexer grammar XMLLexer;
{
    boolean tagMode = false;
}

TAG_START_OPEN : { !tagMode }? '<' { tagMode = true; } ;
TAG_END_OPEN : { !tagMode }? "</" { tagMode = true; } ;

I respond:

Yes, this is intended behavior though not desirable! I spent hours trying (twice!) to come up with a convenient and efficient means to passing the "this" pointer around in the DFAs so that they would not be static. I can't remember the issues now. Originally, it was because I had separated the cyclic DFAs into bytecodes rather than java source code.

No matter what, if you have a ref to a parameter in a predicate that will end up in a cyclic DFA, it will not be visible! There is absolutely no way around this. You can generate an arbitrary DFA in java code without using ptrs to objects. Many people have try to show me some try-finally stuff that will do arbitrary gotos but I always show a backwards jump that isn't possible.

I'm very open to suggestions, but at the moment I'll be forced to define the semantics of predicates to not allow parameters (even though they will work for non-cyclic DFAs). This is highly undesirable! Heh, wait! What if I computed the value first and then passed to the DFA!!!!??? THe value cannot change during the DFA execution so the value would be the same! So I would compute the predicate(s) first and then pass a vector of results to the DFA. This means that for cyclic DFAs, all predicates are evaluated (mostly unnecessarily) but they are not allowed to have side-effects so this is ok.

Interesting! The following rule needs a cyclic DFA to see past the DOT*:

a[int x] : {x!=0}? DOT* ID | {x>=0}? DOT* SEMI ;

I would generate a DFA that looked like this:

boolean[] pred34 = {x!=0, x>=0};
int alt34 = DFA34.predict(input, pred34);
switch (alt34) {
    ...
}

Holy crap! That might work. :)

Actually, don't forget though that hoisting makes parameters unusable also.

a[int x] : {x!=0}? ID ;

If {x!=0}? gets hoisted into another decision it won't compile. Officially then only values visible to the class are valid, though in most cases even parameters would work if I can figure out the pre-evaluation thing.

Oliver brings up another problem that actually highlights an important issue.

It seems like we need a way to force semantic pred inclusion, right? This seems like a general issue for when you want to, for example, turn off certain parts of a syntactically deterministic grammar. ANTLR currently would optimizes the predicate out!

November 24, 2004

I'm working on the last hard problem before I can sprint to the finish on basic lexing/parsing for 3.0: displaying useful information to the programmer about what is wrong with their grammars.

One of my goals for 3.0 is to make it much easier to use and much easier for newbies to start building languages. First, I'm dramatically improving the power of the underlying parsers so that programmers can type in any old (non left-recursive) grammar and with high probability get a parser that works, even if it's slowish. In other words, I'm trying to avoid generating grammar warnings. Second, I want to provide really useful nondeterminism warnings so it's easy to figure out what's wrong with the grammar.

To achieve my goal for useful messages, we must take a giant step back and totally reconsider what programmers actually want rather than simply dumping out what the analysis engine has conveniently at hand. This blog entry is an exploration of possible strategies.

Terminology

First terminology. A grammar is ambiguous if the same input sequence can be matched in more than one way by a grammar; i.e., there are at least two derivations. For parsers based upon LL and LR technology (and any other expecting a unique, unambiguous parse), all ambiguous grammars are nondeterministic, meaning that at at least one decision point during the parse, the parser will not be able to decide which path to take. For example, the following grammar is ambiguous and, hence, nondeterministic:

a : A
  | A
  ;

A grammar can easily be nondeterministic with regards to a particular parsing strategy also. For example, the following grammar is unambiguous, but non-LL(1) and, hence, nondeterministic vis-a-vis LL(1):

a : A B
  | A C
  ;

What ANTLR does now

Currently, nondeterminism messages are based upon the complete set of input sequences that are common to (i.e., can begin) pairs of alternatives. For example, the following grammar is non-LL(2):

a : type ID EQUALS expr
  | type ID SEMI
  ;

type : "int" | "float" ;
...

ANTLR reports:

tt.g:7: warning:nondeterminism between alts 1 and 2 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID

For multiple alts with common lookahead sequences, you still get pairs of messages. If you add two other non-LL(2) alternatives:

a : type ID EQUALS expr
  | type ID SEMI
  | type ID PERIOD
  | type ID COLON
  ;

you get "n choose 2" or n(n-1)/2 pairs and, consequently, 6 messages.

tt.g:7: warning:nondeterminism between alts 1 and 2 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID
tt.g:7: warning:nondeterminism between alts 1 and 3 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID
tt.g:7: warning:nondeterminism between alts 1 and 4 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID
tt.g:7: warning:nondeterminism between alts 2 and 3 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID
tt.g:7: warning:nondeterminism between alts 2 and 4 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID
tt.g:7: warning:nondeterminism between alts 3 and 4 of block upon
tt.g:7:     k==1:"int","float"
tt.g:7:     k==2:ID

(See below to see what ANTLR 3 shows, saying the same thing but grouping by conflicting input not pairs of alternatives).

Naturally as the size of those sets and k get bigger, the error messages become overwhelming and, therefore, useless. Full LL(k) can yield exponentially large sets, though ANTLR 2 only does linear approximate lookahead. PCCTS used to generate some massive messages and ANTLR 3.0 will do LL(k) and LL(*) (LL(star) for google's sake).

Besides getting huge messages, I found that I didn't want all possible conflicting input sequences. What I really wanted was a single input sequence to investigate and I had to just pick one from the displayed set. In other words, there may be 30 token sequences that are messed up, but I really need to pick one and then trace the grammar looking for the two paths that cause trouble. Odds are that all conflicting sequences derive from 1 or 2 paths emanating from the nondeterministic alternatives. Fixing one sequence then will probably solve the entire nondeterminism or at least make it much easier to find the next one, if two alternatives are nondeterministic for multiple reasons.

Even given a single troublesome input sequence, discovering manually how the grammar can match that sequence in multiple ways is often challenging for large grammars such as Java and C. Finding issues in a C++ grammar by hand is extremely difficult. Wouldn't it be great if a tool could simply show you in your grammar how an input sequence could be matched starting in some rule?

In summary, the current system displays really big overwhelming messages containing all conflicting input sequences, tells you which alternatives are nondeterministic, and does not help you find problems in the grammar.

What we want for ANTLR 3.0

When confronted with a nondeterminism, what I really want is to see a sample input sequence (and the shortest sequence) that can be matched by following more than a single grammar path and then I want to see those paths visually in a highlighted in a syntax diagram (or possibly in the text-based grammar).

So I want to invert the situation from ANTLR 2.x to 3.0: group problems by conflicting input sequence not conflicting alternatives. Really, who cares which alternatives are nondeterministic? You care more about which input sequences can be matched in multiple ways. I suppose you care about which alternatives are nondeterministic only if you have to manually trace the grammar looking for the nondeterministic paths.

Consider a variant of the previously mentioned grammar with 4 alternatives that generated 6 warning messages because there are 6 pairs of 4 alts:

a : type ID 
  | type ID
  | type ID
  | type ID
  ;
type : "int" | "float" ;

(I had to make the simple alternatives totally ambiguous otherwise the LL(*) parsing algorithm was able to handle it deterministically. ;))

A better error message would clearly just say that, for example, input sequence '"int" ID' can be matched by all four alts:

Decision can match input '"int" ID' using multiple alternatives: 1,2,3,4
As a result, alternative(s) 2, 3, 4 were disabled for that input
leaving alternative(s) 2, 3, 4 totally unreachable

I generated all but the sample input sequence with the current 3.0 prototype; note the "unreachable" messages are nice too ;) All input sequences were the same so ANTLR saw them as all merging to a single lookahead language for all alts, hence, a single message.

Combining overlapping lookahead sequences for all alternatives can result in many fewer messages per nondeterministic decision, but some grammars actually can cause many more messages. For example, the following grammar would need to generate two messages:

a : (A|B)
  | (A|B)
  ;

One message for input A and one for B. Interestingly, if you add a common token to the end of each alt, the input sequence merges (the final common token causes the lookahead DFAs for both alts to "pinch" back together into a single node):

a : (A|B) X
  | (A|B) X
  ;

Here you would get only a single message.

All sounds great, right? Well, there are some problems:

  1. As I just mentioned, there are potentially many more input sequences than alternative pairs so we might see many many warnings for a single decision in some cases.
  2. What do we generate in text mode (from the command line)? Sometimes you want a quick indication there is an nondeterminism and sometimes you want a full English description of what is wrong with a decision.

I think I'll start by just accepting that many messages may appear in the worst case and just see how common it is. In text mode, I think I'll pick a single input sequence (as short as possible) and say that the decision is nondeterministic on it for a set of alts. The unreachable messages will only be displayed in "verbose" mode. In text mode, obviously there would be no path highlighting in the grammar.

Jean Bovet will be building a GUI for grammar development. It is possible that you could have the command-line tool pop up a small window displaying a syntax diagram highlighted to show the nondeterministic paths. Regardless, the GUI environment would let you click on a "warning" icon in your grammar so you could see the syntax diagram.

Oh, the GUI will also allow interactive stuff like "show me how input xyz can be matched from this point in the grammar". Very handy. Of course, it will also fully interpret your grammar for rapid prototyping. :)

Warnings And Semantic Predicates

If in verbose mode, ANTLR should let you know which predicate expressions were used to resolve syntactic nondeterminisms. Morever, if some predicates were found ("hoisted"), but there were too few to cover all possible nondeterministic input sequences the programmer should know that as well. For example, rule decl in the following grammar is nondeterministic syntactically, but predicates p1 and p2 can be hoisted from rules array and function to resolve it:

decl: array ...
    | function ...
    ;

array : {p1}? ID ;
function : {p2}? ID ;

The programmer should be notified in verbose mode about the predicates being hoisted into rule decl.

When predicates insufficiently cover a set of nondeterministic input sequences, the programmer needs to know. In the following grammar, predicate p1 is hoisted into the decision for rule a:

a : b | B;
b : {p1}? B | B ;

ANTLR should generate (which it does already) something like:

rule a: predicate for alt 1 insufficient: p1

The problem is that B could be matched by rule b w/o a covering predicate, hence, not all B paths are resolved. Note that rule b has two B alts with a single predicate and is resolved properly. The nth alt is covered with !(union-of-predicates-from-other-alts).

Lookahead DFA displays

I can also show the entire lookahead DFA predicting the various alternatives of a decision. Further, I can show the subset of the DFA that matches all input for a particular alternative. Even more interesting is to show the subset of the lookahead DFA that matches input for a set of nondeterministic alternatives. For cyclic DFAs, this would be the only way to illustrate the nondeterministic lookahead language as it is not a finite set of input sequences. Sometimes you would want to see the complete nondeterministic input language to ensure it's really ok; that is, there are not sequences that should really be handled differently.

DFAs show the combined predicates, which is really useful.

June 9, 2004

I have the full LL(*) lookahead algorithm implemented with full semantic predicate hoisting; got unit tests too. Code is only about 12k lines so far. I have an interpreted version of a lexer and parser running for testing.

I am wondering about the LL(*) name. Sriram Srinivasan suggested it and it is a great name, but I'm worried that search engines won't find it as the * may screw it up. Let's see if google picks up LL(*) as a search term after this entry. :) How about LL-star or LL(star) too?

April 23, 2004

Been doing lots of reading (mostly relearning the LR junk)...

Ok, so I was wrong in prior section "Wrong: 3. When to terminate conversion". The algorithm will not terminate (provided by Sriram Srinivasan) on the following non-LL grammar (heck, is it even LR? The language may be nondeterministic inherently like as the palidrome grammar):

a : A a X
  | A a Y
  ;

The full call stack context is used to decide how to guide NFA->DFA closure operations when "falling off" the end of a rule's NFA. That is, it is used to restrict the DFA construction to include only valid lookahead sequences. This provides context-sensitive FOLLOW rather than global FOLLOW information. Deciding whether we've seen a DFA state before, however, should be purely based up on the set of NFA states contained in that state--not the unbounded context as I thought.

This is like the difference between LR and LALR. LR takes into consideration the lookahead (essentially its context) associated with each item (a grammar position) when determining if the state has been been seen. So, any state with item

a : alpha . beta, x

is considered different than a state with the same item but different lookahead component as in

a : alpha . beta, y

LALR, on the other hand, would merge these states and combine their lookahead to be x union y. My algorithm without using the call stack context is analogous to building an LR(0) machine.

Note that my termination problem is solved if I restrict how much context I'm willing to look at to a small constant, m, which is the approach taken by Bermudez & Schimpf in 1990: Practical Arbitrary Lookahead LR Parsing. They have LAR(m) parsers, which are LR(0) machines with DFAs emanating from inconsistent states to look ahead arbitrarily far in an attempt to resolve conflicts. m is about 5 or 6; the user does not specify either k (which is arbitrarily big) nor m.

Note that LR(k) terminates because the lookahead context is bounded to at most k symbols. There is simply a finite number of possible lookahead sequences of depth k and, hence, there are a finite number of states in the LR machine. Each state is a function of the grammar items and lookahead, both of which are finite; hence, a unique set of states is finite.

Bermudez's algorithm terminates be he limits the size of the context stack he is willing to consider (rather than k). k can still vary arbitrarily which is what you want from a grammar developer point of view. I'm uncomfortable with this m constant, however. They say that LR isn't that much more powerful than LALR and I don't feel like having to explain why I put this constant in there just to make it terminate. ;) Plus, NFA->DFA conversion is something that people seem to grok. My goal is to reduce all this parser mumbo jumbo to a problem I can pose as a modified DFA algorithm. :)

Anyway, my new version of the algorithm terminates for sure. The only extra work is a quick linear check at the end to ensure the DFA is reduced (i.e., that every state has a path to the stop state). If reduced, the decision is deterministic (i.e., I can predict which alt of a rule will succeed).

More later...i believe I have discovered a really interesting relationship between LALR machines and my grammar looahead NFA after reading David Galles' soon-to-appear compiler textbook and chitchatting for a while. Essentially, if you create one of my NFAs and do a DFA from the start state, I should give you a regular approximation of the LALR language. If you jump to a rule other than the start symbol, you are effectively chopping off the prior context ("lookback"); this naturally would emasculate an LR-based system into an LL-based system ;) My DFA conversion at a rule's left edge, say, A is like computing the viable prefix machine for sentential forms beginning with A's alternatives. Well sort of. More properly, it is maybe a regular approximation to the "continuation language" (which is generally full LL not regular) for all sentential forms starting at A's alternatives.

Oh, and final thought: drum roll...I think I am close to finally figuring out what the hell LALL(k) is!

April 1, 2004

Been working on the NFA->DFA conversion algorithm. Having lots of conversations with Sriram Srinivasan. An excellent parsing "sparring" partner! Anyway, I think I finally fully understand the problem of using arbitrary lookahead automatically for ANTLR 3 (i.e., no k specified and automatic fast backtracking if needed). There are three issues that force alteration to the standard NFA->DFA conversion algorithm (Thompson's subset construction), two of which allow us to terminate the algorithm early (1 and 3).

Background

The overall process looks like this:

  1. Read in a grammar
  2. Convert to an AST
  3. Walk the AST and build an NFA of the whole grammar
  4. For each decision point in the grammar, attempt to create a DFA from the associated NFA position.

For example, rule

a : A | B ;

is converted to

from which I derive the DFA predictor:

where i=>j means state i predicts alternative j.

When a grammar rule can invoke another rule, r, the NFA has an epsilon transition to that rule and an implied epsilon transition from the tail of r back to the state following the state that invoked r. For example,

a : b A
  | b B
  | C
  ;

b : X
  ;

Results in the following NFA for rule a:

and for b:

The lookahead DFA created for rule a is

1. Which alternative is predicted

Normal NFA to DFA conversion begins by creating the DFA state from the closure of the NFA start state. That will indeed yield a DFA that covers (that is, "is a superset of") the underlying LL grammar, but it does not tell you how which alternative is predicted. To track that, you must augment the DFA state to have not just a set of NFA states, but also the predicted alternative.

From the a:A|B; rule above, the DFA start state is not just

{7,8}

then (start from the decision point not the rule start state of 0)--the DFA start state is the set of tuples:

{7|1,8|2}

meaning that state 7 predicts alt 1 and state 8 predicts alt 2. As you find transitions from an NFA state, copy the alt when making the new DFA target state during subset construction.

2. Epsilon transitions leading to another rule

Rule references in the grammar result in epsilon-transitions to the start state for the NFA associated with that rule. Because the LL grammar has an implied stack it uses to return to just after the rule invocation site, the NFA conversion algorithm must track NFA context in order to yield valid LL lookahead sequences. That is, we cannot just jump to another rule, r, and then blindly include all states following references to r.

For example, in the following grammar rule b is referenced in two places, but naturally it can only be called from a single spot at once. The lookahead for rule a must reflect that:

a : b A // lookahead is {A,B} not {A,B,C}
  | C   // lookahead is {C}
  ;
b : B   // {B}
  |     // must use FOLLOW, lookahead is {A,C}
  ;
c : b C // {B,C}
  ;

To handle this, we must expand the NFA configuration we track during DFA construction. Our tuple becomes

(state|predicted-alt|rule-invocation-stack)

where an empty stack is written $.

Here is the NFA for rule a:

and for rule b:

The DFA predicting alternatives in rule a:

and the DFA predicting alts in rule b:

DFA construction proceeds by creating DFA start state for rule a's alternative block. Start with

0:{(13|1|$),(14|2|$)}
and then add the epsilon-closure:
0:{(13|1|$),(14|2|$),
   (6|1|$),(4|1|7$),(20|1|7$),(21|1|7$),(15|1|7$),(17|1|7$),
   (10|2|$)}

Etc...

3. When to terminate conversion

Terminate: when it is deterministic or known to be nondeterministic. Algorithm proceeds until it finds either situation. It cannot get into a state where it is continuing to look for new DFA nodes to add, but is not either deterministic or nondeterministic.

Deterministic when all alts associated with NFA states in a single DFA state are the same. It uniquely predicts that alt. That is, for all (i|alt|ctx_i), alt is the same.

Nondeterministic when you have the same NFA state merged into DFA state predicting >1 alternative and the context is identical. That implies that given the same input sequence, there is more than one path through the NFA to a single state that (crucially) predicts more than one alternative. In normal NFA conversion you only care about collapsing similar paths, here I must track which alternative is predicted. So if both (s|i|ctx) and (s|j|ctx) is in a single DFA state, the DFA is nondeterministic.

Termination guarantee. If non-recursive, must terminate due to finite set of states, edges. If recursive and deterministic, must terminate as we will eventually find the proper DFA that predicts alts and then terminate. If recursive and nondeterministic, we must terminate as we will always detect the nondeterminism. We will always find the nondeterminism because in the worst case, there is a "pinch" point at the end of a subrule decision where all paths merge to a single "stop state". If we merge this state into a DFA state predicting two or more alternatives, it is a nondeterminism because the context is the same: the empty stack. There must be at least one input sequence that leads the NFA to reach this stop state starting at the left edge of two different alternatives such as:

(A|A)

The NFA looks exactly like the first NFA above in this section for A|B. Note that both alternatives reach the same convergence state at the end of the subrule: state 6. In this case, a DFA state will contain both (6|1|$) and (6|2|$), which is a nondeterminism.

The nontermination issue becomes clear when you look at self-recursive, nondeterministic stuff like (thanks to Sriram for this example):

a : P a P
  | P
  ;

and here is the NFA:

If you don't check for nondeterminism, the algorithm loops forever looking for more and more input that will render the DFA deterministic.

The proof sketch for termination is that to loop forever the decision must be nondeterministic, but we can always detect nondeterminism because in the worst case the nondeterminism will always be detected at the convergence "stop state" for a decision alternative block. That is, the algorithm may never proceed past the convergence point since the context will be identical by definition ($) and the state will be identical as it is a single NFA state. The predicted alt upon reaching the terminus will be different or the algorithm would have already stopped processing that path as deterministic. Since the alt is different and the state + context is the same, the DFA must be nondeterministic.

March 12, 2004

No explicit determinism checks

Had a relevation today! The modified NFA->DFA conversion also implicitly handles determinism checking! Right now, I compute lookahead for each alt of a block such as in:

a : alt1
  | alt2
  | alt3
  | alt4
  ;

Then, I compare alti with altj pairwise for "n choose 2" (or n(n-1)/2) combinations. For n=4, that's 6 combinations; anyway O(n^2) and hence expensive. The LL-regular algorithm looks at the lookahead machine (GLA I call 'em in my thesis) for all alternatives simulatenously trying to decide if there is a DFA that uniquely predicts which alt to predict. If there is such a DFA, the prediction is inherently deterministic, hence, avoiding the determinism check that normally follows lookahead computation.

The only issue is that the DFA is infinite not finite in general. Figuring out when there is a deterministic finite DFA may be hard. Of course, this is only needed if I want to do fixed k lookahead in the normal way.

Optimizing Lookahead Tests

Parsers

I'm beginning to think that not only should lookahead computation be done simultaneously for all alternatives, so should the lookahead prediction at parse time. In other words, prediction is now for k>1:

rule() {
  if ( LOOK(alt1) ) { match alt1; }
  else if ( LOOK(alt2) ) { match alt2; }
  ...
  else if ( LOOK(altn) ) { match altn; }
}

where LOOK(alt) is of the form:

LA(1) in set1 && LA(2) in set2 && .. && LA(n) in setn
I previously exploited the repeated tests to reduce prediction complexity by paring down the lookahead set sizes as you moved down the alternative list. Now I'm suggesting that we simulate a DFA that resultin in an alternative number, which can be used to jump immediately to the right alternative.
rule() {
  simulate DFA to predict alt
  rewind the input pointer
  switch (alt) { // jump to the right alt
    case 1 : alt1; break;
    case 2 : alt2; break;
    ...
    case n : alt; break;
  }
}

The switch should be much more efficient than a linear search when k>1. (For k=1 we generate a switch now). The state machine will left-factor common prefixes and consequently only test common prefixes once. For example,

a : {foo;} A B
  | {bar;} A C
  ;

would result in:

a() {
  // simulate DFA
  if (LA(1)!=A) error;
  switch (LA(2)) {
    case B : alt=1; break;
    case C : alt=2; break;
  }
  // don't rewind here: we used fixed k lookahead in DFA
  // predict alt
  switch (alt) {
    case 1 : foo; match(A); match(B); break;
    case 2 : bar; match(A); match(C); break;
  }
}

where I have put the actions in there so you don't worry about asking me to remove the redundant match() calls (yet).

Like all DFAs, we would consider one input symbol and then move state, requiring k=1 lookahead only. That implies each input symbol is touched at most once during the DFA simulation. In contrast, think about n alternatives for, say, k=3. In the worst case, you do k*n lookahead comparisons. The DFA would do exactly k. Yes, it's insensitive to the number of alternatives! (I'm doing the happy dance).

What about when the DFA is non-finite? That is, it may look really far ahead (to the end of a function declaration for example). You cannot use the fixed int constant inside the LA(i) test like I did in the finite prediction DFA above. You would need to record that you were at input location p and then reset the input pointer after prediction before attempting to match the predicted alternative. No biggie, but we need to buffer n in this scheme instead of a fixed k token window currently used.

This large lookahead highlights an inefficiency issue with the current tool. The following code has a redundant test for input symbol A:

switch (LA(1)) {
  case A :
    match(A);
    match(B);
    break;
  ...
}

Surely we can remove the match(A), leaving a simple consume() call in its place:

switch (LA(1)) {
  case A :
    consume();
    match(B);
    break;
  ...
}

The code is no smaller, but we have avoided a method call and a lookahead test. I'm not overly concerned with this for the parser, but it can add up when it appears in the lexer. The problem may become significant even in the parser when you have non-finite lookahead.

Lexers

To expose the issue, consider integers vs floats in the lexer. Currently, this is illegal due to finite k lookahead:

INT  : ('0'..'9')+ ;

FLOAT: ('0'..'9')+ '.' ('0'..'9')* ; // simplified

The int prefix can be larger than k for any fixed k. Currently, one needs to left-factor manually into:

INT  : ('0'..'9')+ // assume INT until you see the '.'
       ( '.' ('0'..'9')* {$setType(FLOAT);} )?
     ;

This is a pain and not nearly as readable. On the other hand, LL(k) versus regex power comes in handy such as when you want to handle nested lexical constructs like nested comments. How can we get the best of both? Use a DFA in nextToken() to predict which rule will match. For our case, we'd build a DFA that matched until a '.' was seen or not:

nextToken() {
  Simulate DFA: ('0'..'9'{predict INT})+ ('.'{predict FLOAT})?
  rewind input
  switch (predictor) {
    case INT : mINT(); break;
    case FLOAT : mFLOAT(); break;
  }
}

mINT() {
  match ('0'..'9')+
}

mFLOAT() {
  match ('0'..'9')+ '.' ('0'..'9')*
}

Notice that after we've matched an entire integer, we must rewind the input and have method mINT() match it all again. This is totally a waste of time (unless there are actions in the (...)+ loop). If nextToken() has matched an entire token, it should just return the appropriate token avoiding matching the rule completely.

What about when nextToken() has predicted half a rule like with the float? After seeing the '.', then it has predicted rule FLOAT. Wouldn't it be great if we could jump into the middle of FLOAT like the following, thus, avoiding the redundant input comparisons:

mFLOAT_partial() {
  match '.' ('0'..'9')*
}

Is this combined DFA / LL matching system feasible in general? Not sure. Certainly actions on the left edge of a lexer rule prevents such redundancy avoidance as the actions are implied to execute before any char for that rule are matched. What about when a rule has more than one alternative? You'd have to have multiple entry points, effectively breaking up a rule with n alts into n rules, each uniquely predicted (I guess multiple alts notation is just syntactic sugar anyway).

My other concern is how readable the lexer will be if I've broken apart all the rules. Perhaps the debugging option would turn off optimizations.

Given my initial experiments with LL-regular lookahead computations, I think that the algorithm has a good chance to know the entry point into a rule, allowing us to avoid redundant character checks. Every state in the DFA is a merged NFA state (specifically the states in the lookahead NFA created for FLOAT, for example). At the very least, I will know which of several places I might enter a rule...that may give me enough information to split/duplicate pieces of a rule in order to start at non left edges.

What about recursion in the lexer? This would most definitely prevent jumping into the middle of a rule as you need to build up the recursion stack to match for nested parentheses etc...

No need to specify k

Does the creation of DFAs to predict alternatives effectively eliminate the need to specify a max finite k for lookahead? I.e., can I just build a DFA and not worry about whether it is even finite?!

Lookahead computation for k is exponential in k. Interestingly, my fixed k lookhahead computation may be more expensive than an infinite DFA predictor because the finite DFA cannot use loops to signify token sequences. For example, the following rule is LL(4) (and of course LL-regular):

a : A (B)+ E F // lookahead(4) is {ABEF,ABBE,ABBB}
  | A B B D    // lookahead(4) is {ABBD}
  ;

The lookahead for alt 1 is encoded in an acyclic DFA as

o-A->o-B->o-E->o-->o-F->o
          |
          o-B->o-E->o
               |
               o-B->o

The lookahead algorithm has to walk the (B)+ loop multiple times up to k-1 in this case rather than just recording "heh, I found a loop". The fixed lookahead computation is in fact an NFA simulation that asks "what are the k-sequences that are reachable from this state?"

A cyclic DFA does not have to flatten loops and therefore can encode the lookahead for alt 1 literally as itself: A (B)+ E. This implies that the lookahead computation is not dependent on k, but rather the complexity and number of states in the NFA from which it is derived! Wow! (I'm doing the happy dance again)

Predicate Hoisting

While I'm grabbing the lookahead via NFA->DFA conversion, can I also pick up semantic predicates along the way and just encode them in the lookahead DFA? The following grammar has a predicate that, if hoisted into rule member from ctor, would disambiguate the two alternatives (that are syntactically identical).

member : ctor
       | methodDefinition
       ;
ctor   : {isEnclosingClassName(LT(1).getText())}? methodDefinition
       ;

The NFA for member looks like (using my NFA language from class project):

# define 
.member -> .alt1 -> @ctor
@member -> .alt2 -> @methodDefinition
.ctor -> @methodDefinition
.methodDefinition -> ...

The predictor for the alts of member is nondeterministic as it is simply

o-ID->o predict alt 1
|
--ID->o predict alt 2

We cannot merge the ID transition because the NFA stop states have different alt predictions. What if we carried the predicate with us in the DFA conversion so that it looked like:

o-{isEnclosingClassName(LT(1).getText())}?-ID->o predict alt 1
|
--ID->o predict alt 2

where the expression predicates the traversal of that edge in the DFA. We can trivially carry this information along at the cost of an addition "predicate pointer" instance variable in the DFAEdge object.

With the finite lookahead scheme in PCCTS, which did hoisting, it could not see over a token to a predicate beyond...Only a hoisting distance of 0 was allowed (i.e., no tokens between you and the predicate). We may get hoisting distances greater than 0 for free as the predicates will be added into the prediction DFA and will prevent merging of states as in the above example. The difference is that the unmerged states will cause a nondeterminism: the semantic predicate will guide the DFA between the identically-labeled edges. Actually you only avoid the merge if the states will predict different alternatives. Hmmm...more thinking needed here.

Approximate Lookahead

Approximate lookahead exists to avoid the exponential complexity of k>1 lookahead computation. Is it still necessary when building DFAs? As I say, the DFA construction is not sensitive to k any more; it's sensitive to the grammar size. Perhaps I have spelled the death of approximate lookahead!!! Damn, my dissertation down the drain? ;) <snicker>.

Wow. I'd be happy to dump the approximation as it does not lend itself to my new mode of thinking regarding NFA->DFA conversion. Perhaps the approximation can be seen as a DFA compression. First I'd have to remove cycles from the DFA and then do the compression. Might be worth it and faster than doing the cycle removal from the NFA. Traversing all of those epsilon transfers to compute lookhahead is very expensive particulary for rule invocation chains as in expression evaluation.

Holy shit...did I just find a way to compress DFAs in a huge way? Wow.

Tree lookahead

Ok, talking to Loring Craymer on the phone while I'm watching the freaks in San Francisco come into the cafe I'm sitting in. He asked about k>1 tree lookahead. Why not? It comes for free if you insert some "imaginary" tokens into the token stream. The issue is that

// non-LL(1); nondeterministic on A
a : A B // A then B
  | #(A B) // A over B
  ;

are very different structurally. You are no longer matching a one-dimensional structure; you are matching a 2D tree. The first k=1 lookahead is easy: it's just the first symbol you see regardless of whether or not it's a root.

To match the structure too, just pretend there are tokens for the down and right pointers. The lookahead for the above is changed to:

// LL(3); must see past the navigation token to the second token
a : A B    // lookahead is A RIGHT B
  | #(A B) // lookahead is A DOWN B
  ;

Pretty cool, eh?

August 24, 2003

Got LL-regular algorithm working. Using non-finite state machine to predict alternatives so that you can write rules like this (for C language):

function : declaration | definition ;
declaration : functionhead SEMICOLON ;
definition : functionhead LCURLY functionbody RCURLY ;

So because functionhead (which includes an infinite number of arguments), an LL(k) parser for fixed k cannot see past the head to the semicolon or left curly. An LL-regular parser races ahead until it finds the semi or curly and then decides what to do. It's like a syntactic predicate limited to regular prediction languages. Oh this is so groovy.

I have my unit tests going now for NFA->DFA conversion. It is smarter / faster than a normal conversion because I'm only trying to predict which alternative will match--not build a machine to match them. This is the auto-leftfactoring thing needed for parsers and lexers alike. I love it because it stops computing DFA states when it knows for sure where it must end up (i.e., which alt will be predicted). That means (for lexer grammars), I can use the DFA conversion to handle arbitrarily complicated lexer rules, but once it knows which rule will win, it jumps into the middle of that rule and continues lexing using a top-down style parse. :) Or, I could just roll backwards and jump to a fully-fleshed out lexer rule parser. Might be hard to generate half of a rule, but we'll see.

Anyway, the basic technology (poorly written for this DFA conversion) for the analysis engine is now available the Sunday before school starts. Wooohooo! I'll get back to this soon i hope. I'd like to keep playing around with this prototype to see what an intermediate form needs to look like before trying to build something clean to release.

As far as I know, the engine only needs to be made efficient (no caching at the moment) and to improve the lookahead so I can generate LALL(k) not SLL(k). There are some details around like "is this a lexer or parser grammar" etc... that need to be specified so I build the right NFA (e.g., parser start rules get an EOF loop on the end where lexer rules do not). So, I'm almost at the point where I need to define what a "for real" grammar looks like.

All coming together nicely :)

August 23, 2003

Got basic NFA->DFA conversion going.

More complicated than I thought...gotta track alternatives and watch out for ambiguities like in this test case:

public void ambiguous() throws Exception {
	Grammar g = new Grammar(
			"class P extends Parser;\n"+
			"a : ('a'|'a') 'b';");
	/*
	(0)--Ep-->(7)--Ep-->(2)--a-->(3)--Ep-->(6)--Ep-->(9)--b-->(10)--Ep-->(1,end)
		   |                            ^
		  (8)--Ep-->(4)--a-->(5)--------|
		ambiguous DFA construction.  Trying to add NFA
                state 6|1 to a DFA state which
	        targets same NFA state, but predicting alternative 0
	*/
	...
}

Ugh...seems to work, but won't know until I can build a test rig. I wonder if the right test rig is simply the DFA interpreter so I can just pass in a bunch of strings and if they all match properly and some bad ones fail then all is cool. Hmm..not at rigorous as dumping the DFA and comparing. You know I bet a simple print to string from DFA state set compared to an expected string might just do it. :) You know something like:

{2|0, 4|1} --'a'--> {3|0,5|1,6|0,6|1}
...

Yeah, that would work :)

August 17, 2003

building testing rigs and altering software to have testing hooks is a freaking pain in the ass. ;) Finally have the decision test rig. Example:

    public void k2AllImaginarySequencesConflictWithRealSequences() throws Exception {
		Grammar g = new Grammar(
                "class P extends Parser;\n"+
                "a : (A B|C D)" +
                "  | (A D|C B)" +
                "  ;");

		Map decisions = LLkAnalyzer.deterministic(g, 2);
		Decision a = (Decision)decisions.get("3");
		String expecting =
				"n=2"+nl+
				"look1[1]={A, C} {D, B}"+nl+
				"look1[2]={A, C} {D, B}"+nl+
				"1x2: k=2 LL(k)"+nl+
				"  look[1]= ( A B ) ( C D )"+nl+
				"  look[2]= ( A D ) ( C B )"+nl;
		assertTrue(a.toString().equals(expecting));
    }

Ugh. Was hoping to get onto a really cool concept today, but didn't get to it. :(

Think infinite regular (as in DFA) lookahead versus backtracking. The DFA would be much faster and could be added automatically. It would be like k lookahead accept k would vary til eof if need be. It is weaker than backtracking, but faster and can be dealt with automatically :)

This will be part of the left factoring needed to separate lexer rules.

August 18, 2003

Ok, so given that I have an NFA from a lexer grammar, I figured I could left-factor stuff that is non-LL(k). Perhaps I can do it all down to LL(1) for speed and then jump into methods to finish off routines that have been properly predicted. Then I thought "heck, this is like a generic regular language predictor concept for an alternative." The natural extension to parser grammars jumped out at me. I have seen regular infinite predictors before...10 years ago. Was it culik? Hmm...more reference to LR-regular grammars:

LR-regular grammars -- an Extension of LR(k) Grammars. Journal of Computer and System Sciences 7 (1971)

actually I quote: CuC73: Karel Culik and Rina Cohen. LR-Regular Grammars---an Extension of LR(k) Grammars. Journal of Computer and System Sciences, 7:66--96, 1973.

http://compilers.iecc.com/comparch/article/99-10-180

I sent this to Chris Clark:

I have been working on the new analysis engine for ANTLR 3 :) and
"reinvented" using DFAs as infinite lookahead languages.  I remembered
LR-regular languages and then saw your post 4 years ago:

http://compilers.iecc.com/comparch/article/99-10-180

I like the LL-regular concept as it can automatically figure out a
sort of fast backtrack limited to regular language versus my syntactic
predicates which backtrack over full LL (and you specify manually).

Anyway, I think I have an extremely simple algorithm for computing the
DFAs for LL from what looks like an LR(0) item set machine, but is in
fact what I use to compute LL lookahead also.  I believe I can also
answer the question of whether or not there exists a separating DFA.
I'm pretty sure it means if you end up finishing the NFA to DFA
conversion and there are still "merged" DFA states (predicting more
than one alternative) then you don't have a DFA that will work.  OTOH,
perhaps this means that my algorithm can't find one but perhaps a
human could ;)

Anyway, this will work great for things like

d : declaration
  | definition
  ;

where you want to distinguish

int foo(...);

from

int foo(...) { ... }

from the left a la LL.  You don't know what you have until you see the
; or the {.  A simple scan ahead will find the token and you can
predict which alt to take in rule d.

From your post:
  I would be curious if anyone who has access to Visual
  Parse++ (which at one point claimed to be an LR-regular parser
  generator) can determine what it does?

Did you ever hear back about whether they do LR-regular grammars?

Best regards,
Terence

August 16, 2003:

Use a set for the last lookahead depth as it's always a leaf and makes it faster, more space efficient etc... Will this also somehow make the comparison faster? Perhaps optimize the k=2 case as it's most common (comparison routine could be special cased).

August 12, 2003:

I played "hookie" today from school and got very far into analysis stuff. I have it properly doing determinism checking and lookahead depth computations etc... for SLL(k) [LALL(k) ala PCCTS coming soon to a theatre near you] Well, i have only a few tests, but the infrastructure is there now. Interesting how testability forces you often into better encapsulation/compartmentalization. I have a Decision object that is returned from the determinism routine that records everything one would ever need to know about a list of alternatives (can be a good debugging aid for building grammars too). Tracks all the kinds of lookahead and will eventually have predicate hoisting info in there too. I also need to write the testing rig to compare decision objects. For now, check out a few dumps:

    public void k1TwoAlts() throws Exception {
        NFABuilder b = Main.NFA(
                "class P extends Parser;\n"+
                "a : A | B;");

        Node rule = (Node)b.ruleToStartNodeMap.get("a");
        Node blockAlt1 = rule.edge(0).target();
        Decision decision = LLkAnalyzer.deterministic(blockAlt1, 1);
		System.out.println(decision);
    }

yields

Decision with 2 alternatives:
* approximate lookahead
lookahead[1]={A}
lookahead[2]={B}
* approximate decisions
1x2: k=1
* k-sequence decisions

And

public void k2MixedApproximateAndSequence() throws Exception {
	NFABuilder b = Main.NFA(
			"class P extends Parser;\n"+
			"a : (A B|C D)" +
			"  | X Y" +
			"  | (A D|C B)" +
			"  ;");

	Node rule = (Node)b.ruleToStartNodeMap.get("a");
	Node blockAlt1 = rule.edge(0).target();
	Decision decision = LLkAnalyzer.deterministic(blockAlt1, 2);
	System.out.println(decision);
}

yields:

Decision with 3 alternatives:
* approximate lookahead
lookahead[1]={A, C} {D, B}
lookahead[2]={X}
lookahead[3]={A, C} {D, B}
* approximate decisions
1x2: k=1
2x3: k=1
* k-sequence decisions
1x3: k=0
 lookahead[1]= ( A B ) ( C D )
 lookahead[3]= ( A D ) ( C B )
 collisions=[A.D, C.B]

I call that all pretty damn cool! :)

An inefficiency exists when computing full k-sequence lookahead. If k=2 fails, you have to do k=3 but STARTING OVER. Instead, try to leave info in the nodes of LTree indicating where it left off in the NFA so you can simply fill it out further to go from k=2 to k=3!!! Cool.

Heh, can we reduce the size of the constraint set by intersecting approx look for all alts prior to alt i? If we predict alts in order provided, we are constantly reducing the number of input tuples.

a : A F         // {A}   {F}
  | (A B|C D)   // {A,C} {B,D}
  | (A D|Z E)   // {A,Z} {D,E}
  ;

1x2 == {A},nil
1x3 == {A},nil
2x3 == {A},{D}
1x2x3 == {A},nil

Hmm...not obvious that you can, but thesis says something about this during SLL(k) construction.

August 8, 2003:

Ok, so to aid you/me/everybody in dealing with nondeterminism warnings, we need a way to answer the question "huh? how can ID SEMI ID be matched at that point in the grammar? I can't find the path!!!"

I have a great way to deal with this. When I construct the NFA from the grammar, I add a String to each node. Then, I make a routine called node.PATH(lookahead) that finds how that lookahead weaves thru the NFA. All it does is "match" the lookahead by backtracking and spitting out the Strings as it passes each node. It will say things like:

Enter alt 1 of rule a match ID enter rule b enter alt 1 of rule b match SEMI ...

You get the picture. So, the algorithm doesn't ahve to think...just print :) Optimally we'd highlight a path in the syn diag visually, but that's easy too as I can just record the state numbers as I pass thru and then tell the visual sucker to highlight those nodes.

Note that this PATH is essentially the interpretered version of ANTLR. :) Woohoo!

August 7, 2003

I'm back in business. My LL(k) algorithm took me 30 minutes once I had verified that my NFA was constructed properly from the grammar. Sweet...this is just a prototype of course, but demonstrates to me that I'm on the right path. :) No matter how complicated the grammar is, it is represented as a set of nodes and edges with at most two edges leaving a node. Amazing. The algorithm is a very simple bounded (by k) recursive walk of the tree. Analogous to NFA-to-DFA conversion bounded by k. Heh, now it looks just like my thesis. :) Back to the future as they say.

basic grammar -> analysis NFA testing complete. Now just a simple (inefficient) recursive walk and I should be able to create real k>1 lookahead sets. Then to formalize into a spec and make it efficient.

August 5, 2003

I just read a bunch of my thesis again which led me to look in my "legacy" directory. Guess what!!!!

Sonuvabitch if I didn't find the "motherlode": real clean documented code that does all algorithms for lookahead I have ever worked on. This was what I used for my thesis, but it only works for BNF not EBNF grammars (though I did make an EBNF->BNF converter).

PCCTS caches the approximate lookahead as it is complete, but it computes full LL(k) stuff only when approx fails. Further, it computes full lookahead in a consstrained manner--it only looks at the possible collisions of two productions rather than computing full lookahead for both productions and then comparing. It therefore cannot cache, but still b