|
Home |
Download |
ANTLRWorks |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs |
v2
|
|
|
Latest version is 3.0.1 Download now! » |
|
|
ANTLR 3.0 Lookahead Analysis
December 06, 2006[I started putting this LL(*) description in the upcoming The Complete ANTLR reference guide book, but decided it was beyond the scope of the reference guide so here it is in the lookahead blog. Not sure that it is useful to industry or academia because it is neither practical, nor complete, nor rigorous. ;) Anyway, here you go.] Notes on the LL(*) analysis algorithmWhen designing the LL(*) algorithm, I wanted something that modified a well understood algorithm so that it would be more accessible to researchers and developers. Indeed, the algorithm in its general strategy is straightforward, but the details of getting this algorithm to be efficient and to catch analysis errors properly are fiendishly difficult; I've spent nearly 4 years getting it practical and correct. For example, my original algorithm had a lot of landmines in the sense that the algorithm would hit certain grammar rules and never terminate. I had to add a failsafe that terminated the NFA to DFA conversion for a decision if it exceeded a certain threshold in time. Clearly, this makes the academic in me queasy. After looking at a grammar with a lot of these landmines carefully, I arrived at an important conclusion. It turns out that the set of decisions for which ANTLR timed out was the same as the set of non-LL(*) decisions for which ANTLR would never be able to succeed anyway due to recursion occurring in more than one alternative in a single decision. Morever, the algorithm was able to detect these situation quickly and report the non-LL(*) decision rather than attempting the NFA to DFA conversion and then timing out. I should mention that, after constructing the algorithm, I realized that I had invented the analog of Manual Bermudez's interesting work with arbitrary lookahead for LR-based parsing, LAR(m) where m corresponds roughly to my recursion stack threshold that guarantees termination. The difference is that Bermudez's m corresponds to the simple LR stack size whereas my threshold only counts recursion. Both restrict the stack to a finite size, but ANTLR's threshold allows it to roam farther ahead in order to resolve nondeterminisms. The stack threshold is not the amount of lookahead although it is related. Here's the basic idea of how LL(*) works. First, ANTLR builds an NFA similar to a recursive transition network (RTN) from the grammar and then performs a modified classical NFA-to-DFA conversion using the ``subset construction algorithm''. The NFA are created are as follows. Token: Rule: Decision: (in ANTLRWorks, select ``Show NFA'' checkbox when viewing the syntax diagram for the rule to see the NFA ANTLR constructs). Recall that, in the classical algorithm, each DFA state encodes the set of NFA states that the NFA could be in after having seen a particular input sequence. For example, if input ``abb'' means that the NFA could be in one of two states then there will be a DFA state, reachable upon ``abb'' from the start state, that contains those two NFA states. ANTLR's algorithm differs from the classical algorithm in that DFA states must encode more than just the set of NFA state numbers. If you applied the classical NFA to DFA conversion to a decision, the resulting DFA would simply answer yes or no as to whether an input matched the entire decision rather than yielding a predicted alternative number. The classical algorithm continues building DFA states while there are reachable NFA states. Our goal, on the other hand, is purely to distinguish between alternative productions rather than actually parse with the DFA. As such, the algorithm needs to distinguish between NFA states derived from different productions and DFA states are split according to the predicted alternative just like the LR(1) algorithm splits otherwise identical LR(0) states according to lookahead sets. An LL(*) NFA configuration tracks the NFA state number, predicted alternative, and rule invocation stack needed to get to that NFA state: (s|alt|context). As the closure operation enters a rule, it pushes the NFA state that invoked that rule. As the closure operation reaches the accepts state for a rule, it pops the rule invocation NFA state from the top of the NFA configuration's stack and continues just like it had done a method call. The algorithm stops working on a DFA state and considers it an accept state when that DFA state uniquely predicts an alternative or when it finds a nondeterministic DFA state. While there is at least one NFA configuration that predicts another alternative, the algorithm will pursue edges out of that state; this means it uses more and more lookahead in an attempt to distinguish between the alternatives. A DFA state uniquely predicts an alternative if all NFA configurations have the same predicted alternative. A DFA state is nondeterministic if there are two NFA configurations (s|i|ctx) and (s|j|ctx) for the same NFA state s, different alts i,j and the same context (rule invocation stack). Finally, the algorithm verifies that the DFA is reduced and all alternatives have an associated prediction stop state. This guarantees that there are no dangling states, meaning that ANTLR cannot figure out which alternative to predict upon a lookahead sequence, and guarantees that every alternative within a rule is reachable. The algorithm has exponential complexity just like the classical NFA to DFA conversion algorithm, but that pathological scenario does not seem to come up in practice. Also like the classical algorithm, the LL(*) algorithm terminates because there is a finite amount of work. There is a finite number of NFA states, which leads to a finite number of DFA states. Because the algorithm works on DFA states there is a finite amount of work and it terminates. While the stack context for each NFA configuration within a DFA state can grow arbitrarily large in principle (making the number of NFA configurations infinite), a simple threshold causes the algorithm to abort analysis for that node and report a message to the user. To demonstrate why the algorithm must track the rule invocation stack,
consider the following simple LL(2) grammar that matchs {A, B, AA, AB}.
The lookahead for the first alternative of rule s is {AA,A EOF} and {AB,B EOF} for the second alternative. When the lookahead computation encounters the invocation of rule a it asks for the lookahead from a within the context of its invocation so that B does not appear in the lookahead for alternative one at all. In other words, the lookahead for rule a must be {A, epsilon} not {A, FOLLOW(a)}={A,B}. Here is the NFA for s: and here is the NFA for a: The DFA describing the decision for rule s is: Using the -Xdfaverbose option with -dfa shows the internal NFA configurations in each DFA state: Examine state s1. Notice that all of the NFA configurations associated with alternative one ("alt1:...") have a context stack with state 4 on top and the NFA configurations associated with alternative two have state 8 on their context stack (``17|[4 $]'' means NFA state 17 with a context of NFA rule invocation state 4). State 4 is the rule a invocation state in the first alternative of rule s and state 4 is the invocation state in the second alternative. The algorithm uses the context stack to properly compute lookahead, but the resulting DFA naturally does not have a stack--it is not a pushdown automata. When designing the LL(*) algorithm, I wanted something that modified a well understood algorithm so that it would be more accessible to researchers and developers. Indeed, the algorithm in its general strategy is straightforward, but the details of getting this algorithm to be efficient and to catch analysis errors properly are fiendishly difficult; I've spent nearly 4 years getting it practical and correct. For example, my original algorithm had a lot of landmines in the sense that the algorithm would hit certain grammar rules and never terminate. I had to add a fail safe that terminate the NFA to DFA conversion for a decision if it exceeded a certain threshold in time. Clearly, this makes the academic in me queasy. After looking at a grammar with a lot of these landmines carefully, I arrived at an important conclusion. It turns out that the set of decisions for which ANTLR timed out was the same as the set of non-LL(*) decisions for which ANTLR would never be able to succeed anyway due to recursion occurring in more than one alternative in a single decision. Morever, the algorithm was able to detect these situation quickly and report the non-LL(*) decision rather than attempting the NFA to DFA conversion and then timing out. October 19, 2006I had started a paper, but I'm not going to finish for a while. Here is some intro stuff in latex format (sorry). IT IS REALLY JUST MY NOTES and should be considered a draft and possibly inaccurate! See also slides from ANTLR2005 workshop. <begin notes> When building language applications, the stronger the parsing technology, the easier it is to build suitable and natural grammars for even small or uncomplicated languages. Moreover, some languages, such as C++ and Perl, seem to resist specification using many commonly-available and practical parser generators because of ambiguities and context-sensitive structures. The computing resources available on the desktop now permit a number of impressive parsing strategies that allow natural descriptions for even large and complicated languages. Most recently Elkhound citeelkhound, a more efficient $GLR$ citeGLR implementation, and \em Rats! citerats, a parser generator for Parsing Expression Grammars (PEGs) citepeg, accept all context-free grammars (CFGs) and PEGs (ordered non-left-recursive CFGs (### not quite accurate), respectively. Roughly same power. There does not seem to be a strict strength ordering between CFGs and PEGs, however, because there exists a PEG for at least one non-context-free language: $a^nb^nc^n$ citepeg. The extra power of $GLR$ and PEGs comes at the cost of time and space. Grimm's experiments citerats show SDF2 citesdf2 (traditional $GLR$) and Elkhound are roughly 7 times slower than a more traditional $LL(k)$ parser generator such as JavaCC citejavacc at parsing Java source code. \em Rats! is currently about 2 times slower than JavaCC. The extra machinery beyond the $LR$ and $LL$ foundations slows down parsing for even deterministic languages like Java. $GLR$ is much more efficient than Earley's citeearley algorithm because it relies on traditional $LR$ parsing for all but the nondeterministic decisions. Similarly, \em Rats!, which always chooses paths by backtracking (squirreling away partial results like Earley), might be able to rely upon traditional $LL$ lookahead in some decisions to improve efficiency. One may conclude that increasing the power of the underlying $LR$ and $LL$ technology would then further increase efficiency of $GLR$ and PEG-based tools as long as the new technology itself did not incur a heavy cost. The obvious place to look for increasing the power of $LL(k)$ and $LR(k)$ parsers is to increase the maximum lookahead depth, $k$. Increasing $k$ increases the number of grammars acceptable to $LL(k)$ and $LR(k)$ parser generators (though all $LR(k)$ grammars can be rewritten in $LR(1)$ form, they are usually huge and less readable). In the early 1990s, ANTLR citeantlr demonstrated that the $O(|T|^k)$ exponential worst-case for $k>1$ lookahead and token space $T$could usually be avoided by using \em linear approximate lookahead citeparr93, which compresses lookahead $k$-tuples into $k$ sets of tokens, $O(|T|times k)$, whenever possible and by modulating the lookahead depth as necessary from decision to decision. Experiments showed that the vast majority of decisions could use linear approximate lookahead, implying that it is often a token at a particular depth rather than a sequence that distinguishes alternatives. Gagnon reports that SableCC citesablecc now also implements linear approximate $LR(k)$ to good effect. Because traditional $k$-token lookahead is finite, the lookahead language is regular and can be described with an acyclic DFA or trie. A cyclic DFA, in contrast, could describe the unbounded regular lookahead, significantly reducing the number of nondeterministic parser states (relative to finite lookahead). By simply allowing cycles and without introducing a new concept, recognition strength can be drastically improved. The extra heavyweight machinery used by $GLR$ at parse time to deal with finite-lookahead nondeterminisms can sometimes be condensed to a (stack-less) DFA matching a regular approximation of the underlying context-free lookahead. In effect, much of the parser simulation work can be done statically at parser generation time by computing lookahead DFA. Bermudez's $LAR(m)$ citeBS90 computes such lookahead automata to resolve nondeterministic $LR(0)$ states (parameter $m$ is not the lookahead depth, but is a small constant needed by the algorithm as explained in section refalgorithm). $LAR(m)$ is similar to $GLR$ in that both engage a form of arbitrary lookahead when confronted with nondeterminisms. $LAR(m)$ engages a DFA checking the regular lookahead language and $GLR$ engages multiple, full $GLR$ parsers to explore all paths, in a sense, also examining the context-free lookahead language. A $GLR$ system based upon $LAR(m)$ would be exactly as powerful, but in principle would be much more efficient due to fewer $GLR$ states. Further, the exact same parse forest would result because all grammar ambiguities result in nondeterministic $LAR(m)$ states (just like $LR(k)$) for which $GLR$ would be engaged. This paper introduces $LL(*)$, the dual of $LAR(m)$ and the core parsing engine of the new ANTLR v3. $LL(*)$ is a simple extension to traditional $LL(k)$ that neither constrains the maximum lookahead depth nor requires the programmer to specify a $k$ value \em a priori. $LL(*)$ represents a much larger class of grammars than $LL(k)$, even accepting some non-$LR(k)$ grammars as does $LAR(m)$. $LL(*)$ retains the simplicity and semantic flexibility of traditional $LL$ parsers while dramatically improving recognition strength. As with $LAR(m)$ and $GLR$, an $LL(*)$ base for PEGs would increase efficiency without loss of strength. The $LL(*)$ lookahead language is a covering (### actually, LL(*) builds an exact DFA if the lookahead language is regular; if not regular, it's not LL(*). works by simulating finite stack with states which is possible unless recursive) regular approximation to the underlying context-free lookahead language. This loss of precision makes $LL(*)$ weaker than PEGs, but scanning ahead quickly with a DFA is much faster than backtracking with a full parser. $LL(*)$ is $O(n^2)$ because ..., but the max loookahead depth for a given parse is on average very small. Java is max 6 for my tests. No one claims $LL(6)$ is non-linear. Just because it's arb doesn't mean it's large. To illustrate the fundamental difference between $LL(*)$ and $LL(k)$,
consider the following ANTLR v3 grammar fragment that matches Java
type definitions:
noindent Rule \tt typeDef is not $LL(k)$ for any finite $k$ because the common left-prefix, \tt modifier*, can be arbitrarily long. By inserting user actions at the left edges of the productions, rule \tt typeDef is also easily shown to be non-$LR(k)$ for any fixed $k$. \tt typeDef is, however, $LL(*)$ because the lookahead languages for both alternative productions are regular and disjoint. The following cyclic DFA uniquely predicts alternatives by scanning ahead until it sees either keyword ``\tt class'' (implying alternative 1) or ``\tt interface'' (implying alternative 2). begincenter resizebox2in!\includegraphics{figures/typeDef-decision} endcenter ### note: no backtracking with parser; no stack needed. Analyzed statically to discover look language is regular--don't need stack to encode language. Reg look languages are disjoint and hence the alts are deterministic given arb reg look. noindent As with linear approximate lookahead, the distinguishing lookahead feature is usually a key symbol at a particular depth rather than a sequence; here the common modifier prefix actually hinders prediction. Compressing a lookahead parser to a lookahead DFA is analogous to compressing full $LL(k)$ lookahead to linear approximate $LL(k)$. $LL(*)$ also proves an excellent enhanced parsing strategy upon which backtracking may be grafted to achieve the recognition strength of PEGs without the cost of continuous backtracking and memoization. Moreover, dealing with arbitrary user actions written in the target language is much easier because backtracking is engaged so infrequently. This ability in turn allows semantic predicates to be functions of user actions so that, for example, symbol table information can easily direct the parse. A future paper will discuss how both semantic and syntactic predicates citePQ94 may be trivially folded into the $LL(*)$ analysis algorithm described herein to generate parsers that dynamically engage predicates and backtracking when necessary. As a pleasant side-effect, these augmented decisions are optimal in that the extra machinery is only engaged for those specific, individual, $LL(*)$ DFA paths that do not uniquely predict an alternative production. ANTLR v3 parsers combines the power of PEGs with remarkable efficiency and ANTLR's traditional semantic flexibility. <end notes> July 21, 2006Been racking my brain trying to find a more graceful way of "failing"
to analyze a decision with LL(*). For the Rats! Java grammar
(converted to v3 format), ANTLR took 18s on my fast box and spewed all
sorts of errors. Decisions that were too complex had to time out (in
one second) before I figured out there was a problem. You would see
stuff like:
It would retry at k=1 and dump lots of errors at you. Ugh. LL(*) got a few decisions (3) that were not LL(k), but many others timed out. 18s was way too long also. So, then I implemented the auto-backtracking feature that made ANTLR hide all the errors and just backtrack at runtime instead of saying it couldn't figure out what to do at static grammar analysis time. Still had to time out before ANTLR could know to backtrack. :( This also made the LL(*) algorithm look bad--it failed to terminate for many decisions in a big grammar. I spent time looking carefully at the Rats! Java grammar. Looks like this: with just pure LL(*) and no backtracking on the Java 1.4 grammar ripped directly from Grimm's packrat parser, I get 12 decisions with trouble. I detect 4 of them as truly ambiguous and I tell you so. 8 decisions fail to terminate and I have to abort after my simplistic failsafe triggers. Eventually I realized that all 8 have nonregular lookahead languages because of recursion and are, hence, non-LL(*) by definition (DFAs have no stack so LL(*) can't predict these alts). SO! It turns out that the set of decisions for which ANTLR times out
is the same set (so far) of non-LL(*) decisions. Morever, I'm able
to detect this situation quickly and report rather than timing
out. Errors in this situation (3.0b3) look like:
ANTLR terminates in 8 seconds now, shaving off 10 seconds. 8s is still a bit long, but the lexer analysis and code generation is taking about half of that. Woohoo! Whew! Now, I think LL(*) is ready for prime time! You can type in any ol' crap and ANTLR will do something sane and reasonably quickly. May 11, 2006Robert Grimm, creator of Rats! PEG/packrat parser generator, was here
speaking to my graduate class yesterday and it reminded me that Bryan
Ford's PEGs (parser expression grammars) have not predicates also.
They are really only useful in the lexer and seemingly only for single
elements (all examples so far have been "not semicolon" or something
similar). In ANTLR, you say ~';' so I don't think we need them.
However, if we did, I believe a simple transformation will yield them:
which means "match alpha if not followed by beta". We can transform
to:
by using a semantic predicate. Heh heh heh... To get an "and
predicate" at the end is easy too:
(which means "match alpha only if followed by beta") converts to
Easy and optimized for the common case. :) December 8, 2005So, we have a syntax / notation problem for predicates now. We have the following functionality for which to create syntax:
The order of productions specifies the order of ambiguity resolution. So if alt i and alt j are predicted, alt i is chosen if i<j. Originally, in PCCTS (ANTLR v1), we used:
The symmetry of the ? op was/is appealing. Then, in ANTLR v2 we had:
Now, we've added gated predicates and we have currently:
The problem here is that we have an inconsistency. Note that the syn pred and the disambig sem pred are only used when syntax is insufficient whereas gated sem preds are always used. The => operator means implies and, therefore, sort of means gate. The problem is that syn preds are not executed unless LL(*) fails. I sort of like the idea that => means forced evaluation or something similar. In that case, {...}?=> or {...}=> seems right, but (...)=> seems wrong. Optimally, I'd use
but, what about optional subrules then? decl? looks very nice as
optional match decl. Some people use [...] or {...} to mean
option, but we use square brackets for rule args and curlies for
actions. In PCCTS, I used <<...>> for actions, but that was a drag.
We can't use parentheses for rule arguments as that would be
ambiguous. I wonder if an action as args would work: rule{args}.
Nope that is ambig also. I wonder if using real EBNF would work:
<rule(args)> or plain rule if no args? Heh, that might work!!!
Uh oh, what about labels then? r=<rule(args)> might look weird.
Let's see what happens with parens for arguments/return values and
angle brackets around a rule ref with args:
I don't like the returns syntax with a paren, but it might work. I note that JavaCC syntax uses <STATIC> notation for keywords, which I think is the opposite of old EBNF. Actually they require parens on all method calls, which makes them stand out. That is a pain, but very clear. Interesting. The parens are overloaded, but as long as they are required, there is not ambiguity between args and subrule. If we use <rule(args)> notation then we can use [...] for optional and then we can do this:
I don't like the difference in syntax for rule ref with and w/o args. :( Also, I used ID<AST=VarAST> in v2 to indicate the heterogeneous tree node to create for a particular reference. Perhaps angle brackets means I'm going to add options to something, which has a different little syntax on the inside like <ID(AST=VarAST)> and <rule(args)>. Heh, that could work. Add:
Heh, that could even be a nice syntax to flip the type or text or line
or something:
Interesting. Ok, people hate me messing with the syntax that much. ;) Let's try again
Not bad I suppose. I still don't like the inconsistency with gated sem pred vs syn pred. Hmm...how about the parsimonious:
December 4, 2005Ok, took a few hours to get it right, but I have syntactic predicate memoization going. It only took an hour to get it going yesterday morning but then I got stuck trying to make an efficient cache. Worst case the cache must be O(|R|n) where |R| is the number of rules. Not sure where Ford's cache takes exponential space because once you have commit to an input position, you can throw away all cached info for that position. That must be the difference; his grammars are probably stronger as they never fully commit until end of file whereas I take a hybrid LL(*)-backtracking approach. I suspect the constant in front of my linear O(n) parsing will be much smaller than a pure packrat parser. Anyway, here's the idea. Every time you finish a rule when guessing,
just record the final token matched for that rule (-1 indicates
failure) but associate it with the start token. You get a map like
this:
The next time you enter that rule at startTokenIndex and guessing>0, check the cache. If there is an entry, seek to the stopTokenIndex+1 and immediately return w/o descending into the rule. The invoking rule will be none-the-wiser. Action execution. Note that when you enter a rule and you're not guessing, you have to actually do the rule and all associated actions. To guess r you execute it with gates around the actions and then after you have decided to do r, enter again and execute. With a cache you will never be able to execute a rule more than twice for any input position, hence time O(2n), for input size n. Here is an example:
For input
You get:
Review this carefully...here is what is actually going on with the
cache:
Note that alt 1 of the loop in s starts the guessing by entering (b)=> which starts to parse b. Upon success, it records the stop token for b+1. It then enters b to do it "with feeling", executing the actions: printing "alt 1". Input "(x) ;" is done. Any cache for this input can be removed (though I do not remove anything for now). Next, rule s tries to match (b)=> again for the first alt of the loop. It fails due to input "((" and records that for rule b and in fact the predicate itself. It then tries alt 2's pred (b '.')=> which tells it to match b, but it says it's seen it before and it will never succeed. It returns immediately w/o entering b. Next, it defaults to trying rule c w/o a predicate. It eventually executes the action to print "alt 3" after recursing twice. I specifically built rule s so that the same rule will be called, b, during guessing to show that it's just as important to record failure as success so you don't keep trying a rule when you will it will fail. Here is the entire rule b to show you how the cache works:
So, this memoization is easy to do except for the cache memory
performance. I'll have to come up with a nice, fast cache later. The
support code is a fairly significant chunk of code to insert into
every rule as you can see. Perhaps this can be turned off. Ok, I
just added a parameter to outputFile template called memoize. Here is
the result of running the input through w/o the rule cache:
We only save one rule invocation with memoization. Without the memoization, however, the code space is much smaller. With the caching code, W.class is 7314 and 5492 without; a 25% increase. The new cmd-line option is:
December 2, 2005I'm thinking about the backtracking mechanism now. v2 uses exceptions which everyone is hating because they can be really slow particularly when you invoke them exponentially many times. Much of the time, you don't need backtracking in v3. For example,
consider the usual expression rule chain:
with a rule that has productions with the same left-prefix in common:
This is LL(*) believe it or not. Here's the DFA that predicts which alt of rule stat will succeed: So, my dream of just feeding stuff to antlr and having it work is coming alive! The problem is that we still need a way to handle ambiguous grammars because some things can be matched two ways but you can order the productions such that it chooses the right path. We need the ordered productions and backtracking implied with syntactic predicates (the term I coined when I built them into PCCTS). So, when you have a construct that is too hard for LL(*) to parse or
you know it's ambiguous (also too hard for LL(*)) then you would
manually specify the backtracking with a syntactic predicate. You get
the power but don't have to resort to a full backtracking parser (very
slow). Here's the classic C++ ambiguity problem solution in v2 ANTLR:
Despite the power of LL(*), I'm finding that I still get a bunch of warnings I don't want and newbies will find perplexing. My main goal is for newbies to be able to just type stuff in and, if it's at all reasonable, it will sort of parse. :) As I consider the addition of syn preds to v3, I have come to an interesting generalization. Syntactic predicates can be converted to functions that return true/false if that grammar fragment matches the current input. Then the analysis engine can be fooled into thinking they are semantic predicates like {synpred42()}?. The resulting DFA will split an ambiguous state based upon semantic information to decide what to do...it's just more expensive than a normal predicate. Anyway, if you have predicates, they can easily be added into the analysis and are evaluated only on those input streams that are non-LL(*). So then I asked myself, why not just always backtrack if I find a problem? This would imply that all productions had an implied syntactic predicate on the left edge. Then I started looking at Bryan Ford's extension of my grammars to PEGs and his memoization that guarantees linear time (but exponential space worst case). Greg Benson here at USF reminded me that this memoization is similar to the incremental parsing strategy Prashant Deva has and the partial solution I came up with (Prashant and I will write up the stuff together). Then I realized just how freaking easy it would be add the memoization and then I realized, after reading Bob Foster's blog that I could toss out all memoized data before position i after I have commit to a parse at that point. The data requirements would be minimal and we'd get linear time. I need to look at Ford's work more carefully to figure out how close to full CFLs he gets with PEGs. Anyway, here's the key: I could guarantee that no matter how much you backtrack, you would parse something at most twice, once to check and once to exec actions! So, the question is, should I always automatically backtrack now and just avoid giving errors except for left-recursion? We'd have an LL(*) option to allow you to manually specify the backtracking. Or, perhaps we should start out having it plain LL(*) with manual backtracking and have a -newbie switch to make ANTLR shut up and just do its job. Hmm...start with a -newbie switch I guess. Another question, do I implement Ford's "not syntactic predicate"? Seems useful and easy to implement. You could say things like match this as long as it's not followed by that. (part II after feeding my face) So what should backtracking look like (no memoization)? I am pretty
sure now that exceptions are unnecessary. An instance variable called
failed can be used instead of return values (sort of like UNIX
errno). As long as that value is used or saved immediately
following a method call, no stack is needed and we can avoid having to
use the method return value mechanism. Then, just alter the parsing
routines to "return" failed instead of throwing an exception:
In the generated code, you'd see things like:
Clearly the cost is two new bytecode instructions to push failed onto the stack and then a branch instruction...for every token match and rule invocation. But, who cares...still readable. As usually, every action needs if (guessing>0)... around it. The same syntactic predicate may need to be invoked in multiple places
within a DFA so we need to move this to a separate method.
Step 1. I would need to convert (b)=> into
Step 2. Generate code for the synpred23 function as if it were a rule
with a boolean return value:
The only difference is the return false instead of plain return
and the guessing stuff; then the final return true. Actually, make a
wrapper function around a normal rule:
Yes, that is cleaner and if the fragment is just a rulename like x, then I can directly call that rule. :) That is all there is too it. Magic! Memoization. So talking to Prashant about his incremental parsing
work has led to this simple version of Ford's memoization for
backtracking (i'll have to look at this implementation). Revisit the
above expression example. Imagine parsing an e at token position
i--you will descent into e2 then e3. To avoid ever having to do
that again at position i, you just remember that you've already done
it and, most importantly, how many tokens you consumed by the end of
the rule the first time. In general you need a map:
I'd probably do this as an array of dictionaries that answered the
stop token for any rule that has parsed at that position
successfully. Once I consume() past position i, I can set it's memo
to null. :)
Note to self, should we record a rule that has been attempted but that did NOT succeed to avoid unnecessary backtracking later? Anyway, so, you'd see a new rule preamble that checks to see if we've
seen this before.
The important point here is that calling rule e at the same input position immediately returns success without descending all the way down to e3! Ok, I've got the dev environment up...go! Easy...faked out ANTLR by adding a few lines of code and converting
syn pred to sem pred. This is a good sign that ANTLR's code is well
structured when I'm able to make such changes in 15 minutes.
results in the following DFA for rule a prediction. Notice how that it only backtracks on what it knows is a problem, (x), but not on the other two possible prefixes! :) Ok, 15 minutes later and it's building the code for the predicate
fragments:
Gawd, that was easy. Now to add failed bits and the guessing stuff and pred wrappers. I suppose I'll optimize later to remove that extra try/catch. Ok, 20 minutes more and it's generating nice wrappers and all the
if-failed stuff.
It compiles. I guess now I actually need to check the execution of mark/rewind. Oh, seems like I should make my example have:
So rule b is the special case of a single parens. Let's see if it matches. Wow. 15 minutes later and the example works:
Upon "(x)" I get "alt 1" and with "((x))", I get "alt 2". All total, implementation sufficient to get first example working, 1.5 hours. :) Boy do I love this v3 code base! My tendonitis is flaring up in my hands/forearms. Better quit for the night while I'm ahead! I'll do memoization over the weekend. Wow, this is great. Backtracking with no exceptions and no change at all to the analysis. That's nice. December 1, 2005hooray...solved the gated predicate thing. Imagine:
Imagine that input "aa" and p true should result in two A tokens rather than 1 B token. Currently, it matches a single B because it matches the longest string here. The DFA predictor makes it clear how the two rules are predicted. =>2 implies alt 2 (2nd token) is predicted. Now imagine a new kind of gated predicate (as opposed to
disambiguating) that literally turns the production on or off:
The syntax came from a Loring Craymer idea. In this case, I take the DFA and I force evaluation of p or !p on any path that can match that input. Then at runtime if p goes false, the related arcs will be unavailable. Here's the DFA. Notice that it's the same DFA minus the additional predicates on the arcs. This took me quite a lot of thinking to get right. I considered and rejected lots of different alternatives. One of the hard things was what it means when disambiguating and gated
predicates are available in the same predictor DFA! It turns out that
only gated predicates are collected from the normal predicate
expressions. So if you have
Then the DFA looks like: Note that the p pred is never used as syntax alone is enough to predict that alternative. :) That is the key difference: disambiguating predicates are only evaluated when syntax fails and gated predicates turn off a production completely (by gating out edges in the prediction DFA). This works for cyclic DFA as well :) I discovered a fabulous way to do syntactic predicates today. :) Carry them along like sem preds, treating the result of their evaluation like a predicate. The predicate will be just a method call that returns true/false if the related grammar fragment matches. November 29, 2005Gated productions via forced predicates So, I'm playing around with real grammars (python and ruby) to find
nasty parsing problems (boy did I find them). In python, there are
only lexical problems really. newline is context-sensitive and
somethings you want to ignore them and sometimes you don't. Smells
like we need a predicate, but problems arise. For example, here is
the comment rule:
You want it to scarf all trailing comments as they are not statement terminators; we don't want COMMENT to fall out, which would let NEWLINE have them (annoying the parser). The loop:
looks like it would not match the subrule according to the predicate, but....it is not even part of the prediction! Remember predicates exist to disambiguate syntactic problems. In this case, the lexers thinks that as long as newlines are present, it will scarf them. No semantics required. This is the problem that Oliver ran into for XML. He needs a predicate that "gates" in/out a production or loop. I know how to insert that (analysis would ignore; code gen would simply insert always), but what the hell would the syntax be to distinguish? The parser seems to work naturally with disambiguating predicates but the lexer seems to want gating predicates. Interesting. Loring Craymer had an interesting suggestion at the workshop: use syn pred notation around it. (...?)=> Perhaps a variation:
where the action implies when it's ok to proceed. These would NOT hoist as you'd be evaluating them out of context (which is why the {..}? variety are always evaluated after the lookahead test). These gate predicates would simply get shoved into the lookahead decision for that specific production. The problem is that sometimes you want to distinguish between two
lexical rules like Oliver did:
Perhaps the special case of gate predicates on the left edge of lexical rules get hoisted one level into the special Tokens rule that chooses among lexer rules. Actually should it be
I wonder. Well, it is more consistent and is really then a shorthand for ({p}?)=> ok, a problem.
It's clear what you mean: throw no viable alt exception if !p and ab
is the lookahead. However, look at the predictor. The lookahead DFA
is:
You want the gating predicate to be evaluated in such a way that the first alternative is not even a possibility if p is false. Doesn't that mean we'd need a DFA specific to this situation that would carve out the "ab" lookahead leaving "ac" and "x"? I can't think of where to stick the predicate in the DFA predictor since the shared common prefix of "a" merges in the DFA. I can't just stick the predicate before the evaluation of the DFA either. Re-examine the canonical problem:
Upon input "aa" and p is true, ANTLR does the "wrong" thing--it matches a single B token. Two 'a' char is sufficient for syntax alone to distinguish between the alts and ANTLR doesn't even eval the preds. To get "aa" and p to yield two A tokens, ANTLR must literally turn off
multiple branches in the predicting DFA. Any edge that leads to a B
predictor state must be gated off or a new DFA must be created w/o the
B alternative. Consider:
Now, input "aa" would for sure match AA. The DFA would have all predicates for each edge evaluated to shut off invalid paths. I looked at the analysis can stay the same...only code gen is different! September 30, 2005More on the predicate issue with lexers. Consider the following
ambiguous grammar:
ANTLR says:
If you add predicates, all is well and ANTLR generates the following prediction DFA for rule 'a': If you convert to a lexer grammar:
it still works. You get an analogous DFA. When the two alternatives are two separate tokens, however, everything
changes.
Now ANTLR cannot detect the ambiguity soon enough in order to gate out one token or the other depending on p1 or p2. Yes, it sees that 'a' can predict both A and B, but it knows that EOT ("end of token") appears afterwards and it tries to use syntax (that is, more lookahead) before resorting to semantic predicates. You get this DFA: This DFA implies that 'a' followed by the end of token (in this case anything other than another 'a' or 'b') is ambiguous but it has to see the EOT first. I actually don't generate the right code here, but that is a code gen bug not an analysis bug. EOT is like EOF--it's not really an input symbol, but it is a useful analysis tool. Without it I couldn't properly handle keyword versus ID (if A is a keyword, B looks like ID in this case). What is needed is a "gate" that literally makes rule A or B invisible. For example, "aa" is a B according to the syntax and what people expect; "intint" should be an ID not two int tokens, right? So I'm afraid we need a new concept. With such a "gate", you could turn off rule B, making "aa" give two A matches depending on the predicate p2 being false. This is definitely not how I have defined predicates. September 28, 2005The .* idiom says match anything, but actually means "match
anything until you see what follows it". In the lexer this
definition works great:
Unfortunately, at the moment it is by default greedy like all other
loops and we need:
I am changing the def of this in the lexer to mean nongreedy by default. Oh and (.)* is the same as .*. As for the parser, what does .* mean? Well, how about this
filtering example:
It seems to me that it should work for nonnested blocks. With .* nongreedy by default now, it generates a proper loop that exits when it sees '}' followed by EOF. The problem comes when the .* loop is deeply nested in your grammar and it could look at many many rules and tokens looking for the sequence until EOF. In the lexer, it stops at end of token and doesn't look into other rules (unless the rule is a fragment). This parser version then is very inefficient. I wonder if in the parser shouldn't assume nongreedy and k=1 by default. Then the above grammar would stop when it saw '}'. Yes, I think that will do it. You can set k yourself if you want. Ok, just fixed the code so that the following exits the loop upon R:
Plus loop on wildcard works same way. September 25, 2005Oliver Zeigermann writes again to say that his grammar:
is not working properly. For input <tag>, upon seeing the '<', the DFA predicts TAG_TEXT even though the predicate tagMode is false. To make it easier to understand, I have derived a simpler grammar that
illustrates the same problem...it's the keyword vs ID ambiguity:
ANTLR generates the following DFA: Notice that the predicates are evaluated after the syntactic context
has been evaluated. This is a requirement for the meaning of semantic
predicates. The problem is that it doesn't generate the right code:
Notice that after input 'x', the only valid input is a..z and for predicting ID. To understand what's up, check out the same w/o predicates:
The DFA predictor for implicit Tokens rule is: All is well, ANTLR generates:
The EOT (end of token) transition predicts anything other than a..z as it should. EOT actually predicts both X and ID, but it is quiet (picking first alt like flex and friends: X). The predicates p and !p for X and ID don't change the fact that ANTLR knows how to resolve everything between keyword vs ID except one condition: 'x' followed by a non-ID letter. For example, "xy" and "xx" even are ID tokens syntactically--no need for predicates! The isolated 'x' or "x[" or whatever is ambiguous: X or ID. So, here a predicate can help as the DFA shows (the code gen has a bug as I indicated, but still the DFA is right). To distinguish between X and ID for input 'x' followed by ~'a'..'z' we can use the predicates. The LL(*) algorithm works properly--it forks on the EOT node to distinguish. So (once I fix the code gen bug) the lexer will operate as I intended, though I admit the behavior is not natural in this case. You mean to say that !p predicts X even when there is no syntactic ambiguity. You want the lexer to match the shortest sequence due to the predicate. ANTLR's rule of using predicates only for the ambiguous subsequences uses !p to predict X only when an isolated 'x' appears. It tries to match the longest possible match. I'm not sure how to resolve this. The general behavior of the predicates is proper, but this case is a fairly strong argument that it is not always natural. There is a conflict between longest match and predicate evaluation. I don't see what general principle would force evaluation of the predicates at state 1 instead of state 3. For example, consider the addition of a '.' char so that the ID loop
doesn't see the end of the token, removing the ambiguity between X and
ID.
In this case, 'x' uniquely matches X as it does not have a '.' following it. The DFA has no predicates at all even though the grammar has them: Where would I force evaluation of predicates? If I see input 'x', simply looking one more character tells me if I should stop and choose X or continue and match an ID. September 10, 2005Oliver Zeigermann writes to the list:
I respond: Yes, this is intended behavior though not desirable! I spent hours trying (twice!) to come up with a convenient and efficient means to passing the "this" pointer around in the DFAs so that they would not be static. I can't remember the issues now. Originally, it was because I had separated the cyclic DFAs into bytecodes rather than java source code. No matter what, if you have a ref to a parameter in a predicate that will end up in a cyclic DFA, it will not be visible! There is absolutely no way around this. You can generate an arbitrary DFA in java code without using ptrs to objects. Many people have try to show me some try-finally stuff that will do arbitrary gotos but I always show a backwards jump that isn't possible. I'm very open to suggestions, but at the moment I'll be forced to define the semantics of predicates to not allow parameters (even though they will work for non-cyclic DFAs). This is highly undesirable! Heh, wait! What if I computed the value first and then passed to the DFA!!!!??? THe value cannot change during the DFA execution so the value would be the same! So I would compute the predicate(s) first and then pass a vector of results to the DFA. This means that for cyclic DFAs, all predicates are evaluated (mostly unnecessarily) but they are not allowed to have side-effects so this is ok. Interesting! The following rule needs a cyclic DFA to see past the DOT*:
I would generate a DFA that looked like this:
Holy crap! That might work. :) Actually, don't forget though that hoisting makes parameters unusable also.
If {x!=0}? gets hoisted into another decision it won't compile. Officially then only values visible to the class are valid, though in most cases even parameters would work if I can figure out the pre-evaluation thing. Oliver brings up another problem that actually highlights an important issue. It seems like we need a way to force semantic pred inclusion, right? This seems like a general issue for when you want to, for example, turn off certain parts of a syntactically deterministic grammar. ANTLR currently would optimizes the predicate out! November 24, 2004I'm working on the last hard problem before I can sprint to the finish on basic lexing/parsing for 3.0: displaying useful information to the programmer about what is wrong with their grammars. One of my goals for 3.0 is to make it much easier to use and much easier for newbies to start building languages. First, I'm dramatically improving the power of the underlying parsers so that programmers can type in any old (non left-recursive) grammar and with high probability get a parser that works, even if it's slowish. In other words, I'm trying to avoid generating grammar warnings. Second, I want to provide really useful nondeterminism warnings so it's easy to figure out what's wrong with the grammar. To achieve my goal for useful messages, we must take a giant step back and totally reconsider what programmers actually want rather than simply dumping out what the analysis engine has conveniently at hand. This blog entry is an exploration of possible strategies. TerminologyFirst terminology. A grammar is ambiguous if the same input sequence
can be matched in more than one way by a grammar; i.e., there are at
least two derivations. For parsers based upon LL and LR technology
(and any other expecting a unique, unambiguous parse), all ambiguous
grammars are nondeterministic, meaning that at at least one decision
point during the parse, the parser will not be able to decide which
path to take. For example, the following grammar is ambiguous and,
hence, nondeterministic:
A grammar can easily be nondeterministic with regards to a particular
parsing strategy also. For example, the following grammar is
unambiguous, but non-LL(1) and, hence, nondeterministic vis-a-vis
LL(1):
What ANTLR does nowCurrently, nondeterminism messages are based upon the complete set of
input sequences that are common to (i.e., can begin) pairs of
alternatives. For example, the following grammar is non-LL(2):
ANTLR reports:
For multiple alts with common lookahead sequences, you still get pairs
of messages. If you add two other non-LL(2) alternatives:
you get "n choose 2" or n(n-1)/2 pairs and, consequently, 6
messages.
(See below to see what ANTLR 3 shows, saying the same thing but grouping by conflicting input not pairs of alternatives). Naturally as the size of those sets and k get bigger, the error messages become overwhelming and, therefore, useless. Full LL(k) can yield exponentially large sets, though ANTLR 2 only does linear approximate lookahead. PCCTS used to generate some massive messages and ANTLR 3.0 will do LL(k) and LL(*) (LL(star) for google's sake). Besides getting huge messages, I found that I didn't want all possible conflicting input sequences. What I really wanted was a single input sequence to investigate and I had to just pick one from the displayed set. In other words, there may be 30 token sequences that are messed up, but I really need to pick one and then trace the grammar looking for the two paths that cause trouble. Odds are that all conflicting sequences derive from 1 or 2 paths emanating from the nondeterministic alternatives. Fixing one sequence then will probably solve the entire nondeterminism or at least make it much easier to find the next one, if two alternatives are nondeterministic for multiple reasons. Even given a single troublesome input sequence, discovering manually how the grammar can match that sequence in multiple ways is often challenging for large grammars such as Java and C. Finding issues in a C++ grammar by hand is extremely difficult. Wouldn't it be great if a tool could simply show you in your grammar how an input sequence could be matched starting in some rule? In summary, the current system displays really big overwhelming messages containing all conflicting input sequences, tells you which alternatives are nondeterministic, and does not help you find problems in the grammar. What we want for ANTLR 3.0When confronted with a nondeterminism, what I really want is to see a sample input sequence (and the shortest sequence) that can be matched by following more than a single grammar path and then I want to see those paths visually in a highlighted in a syntax diagram (or possibly in the text-based grammar). So I want to invert the situation from ANTLR 2.x to 3.0: group problems by conflicting input sequence not conflicting alternatives. Really, who cares which alternatives are nondeterministic? You care more about which input sequences can be matched in multiple ways. I suppose you care about which alternatives are nondeterministic only if you have to manually trace the grammar looking for the nondeterministic paths. Consider a variant of the previously mentioned grammar with 4
alternatives that generated 6 warning messages because there are 6
pairs of 4 alts:
(I had to make the simple alternatives totally ambiguous otherwise the LL(*) parsing algorithm was able to handle it deterministically. ;)) A better error message would clearly just say that, for example, input
sequence '"int" ID' can be matched by all four alts:
I generated all but the sample input sequence with the current 3.0 prototype; note the "unreachable" messages are nice too ;) All input sequences were the same so ANTLR saw them as all merging to a single lookahead language for all alts, hence, a single message. Combining overlapping lookahead sequences for all alternatives can
result in many fewer messages per nondeterministic decision, but some
grammars actually can cause many more messages. For example, the
following grammar would need to generate two messages:
One message for input A and one for B. Interestingly, if you add
a common token to the end of each alt, the input sequence merges (the
final common token causes the lookahead DFAs for both alts to "pinch"
back together into a single node):
Here you would get only a single message. All sounds great, right? Well, there are some problems:
I think I'll start by just accepting that many messages may appear in the worst case and just see how common it is. In text mode, I think I'll pick a single input sequence (as short as possible) and say that the decision is nondeterministic on it for a set of alts. The unreachable messages will only be displayed in "verbose" mode. In text mode, obviously there would be no path highlighting in the grammar. Jean Bovet will be building a GUI for grammar development. It is possible that you could have the command-line tool pop up a small window displaying a syntax diagram highlighted to show the nondeterministic paths. Regardless, the GUI environment would let you click on a "warning" icon in your grammar so you could see the syntax diagram. Oh, the GUI will also allow interactive stuff like "show me how input xyz can be matched from this point in the grammar". Very handy. Of course, it will also fully interpret your grammar for rapid prototyping. :) Warnings And Semantic PredicatesIf in verbose mode, ANTLR should let you know which predicate
expressions were used to resolve syntactic nondeterminisms. Morever,
if some predicates were found ("hoisted"), but there were too few to
cover all possible nondeterministic input sequences the programmer
should know that as well. For example, rule decl in the following
grammar is nondeterministic syntactically, but predicates p1 and
p2 can be hoisted from rules array and function to resolve it:
The programmer should be notified in verbose mode about the predicates being hoisted into rule decl. When predicates insufficiently cover a set of nondeterministic input
sequences, the programmer needs to know. In the following grammar,
predicate p1 is hoisted into the decision for rule a:
ANTLR should generate (which it does already) something like:
The problem is that B could be matched by rule b w/o a covering predicate, hence, not all B paths are resolved. Note that rule b has two B alts with a single predicate and is resolved properly. The nth alt is covered with !(union-of-predicates-from-other-alts). Lookahead DFA displaysI can also show the entire lookahead DFA predicting the various alternatives of a decision. Further, I can show the subset of the DFA that matches all input for a particular alternative. Even more interesting is to show the subset of the lookahead DFA that matches input for a set of nondeterministic alternatives. For cyclic DFAs, this would be the only way to illustrate the nondeterministic lookahead language as it is not a finite set of input sequences. Sometimes you would want to see the complete nondeterministic input language to ensure it's really ok; that is, there are not sequences that should really be handled differently. DFAs show the combined predicates, which is really useful. June 9, 2004I have the full LL(*) lookahead algorithm implemented with full semantic predicate hoisting; got unit tests too. Code is only about 12k lines so far. I have an interpreted version of a lexer and parser running for testing. I am wondering about the LL(*) name. Sriram Srinivasan suggested it and it is a great name, but I'm worried that search engines won't find it as the * may screw it up. Let's see if google picks up LL(*) as a search term after this entry. :) How about LL-star or LL(star) too? April 23, 2004Been doing lots of reading (mostly relearning the LR junk)... Ok, so I was wrong in prior section "Wrong: 3. When to terminate
conversion". The algorithm will not terminate (provided by Sriram
Srinivasan) on the following non-LL grammar (heck, is it even LR? The
language may be nondeterministic inherently like as the palidrome
grammar):
The full call stack context is used to decide how to guide NFA->DFA closure operations when "falling off" the end of a rule's NFA. That is, it is used to restrict the DFA construction to include only valid lookahead sequences. This provides context-sensitive FOLLOW rather than global FOLLOW information. Deciding whether we've seen a DFA state before, however, should be purely based up on the set of NFA states contained in that state--not the unbounded context as I thought. This is like the difference between LR and LALR. LR takes into consideration the lookahead (essentially its context) associated with each item (a grammar position) when determining if the state has been been seen. So, any state with item a : alpha . beta, x is considered different than a state with the same item but different lookahead component as in a : alpha . beta, y LALR, on the other hand, would merge these states and combine their lookahead to be x union y. My algorithm without using the call stack context is analogous to building an LR(0) machine. Note that my termination problem is solved if I restrict how much context I'm willing to look at to a small constant, m, which is the approach taken by Bermudez & Schimpf in 1990: Practical Arbitrary Lookahead LR Parsing. They have LAR(m) parsers, which are LR(0) machines with DFAs emanating from inconsistent states to look ahead arbitrarily far in an attempt to resolve conflicts. m is about 5 or 6; the user does not specify either k (which is arbitrarily big) nor m. Note that LR(k) terminates because the lookahead context is bounded to at most k symbols. There is simply a finite number of possible lookahead sequences of depth k and, hence, there are a finite number of states in the LR machine. Each state is a function of the grammar items and lookahead, both of which are finite; hence, a unique set of states is finite. Bermudez's algorithm terminates be he limits the size of the context stack he is willing to consider (rather than k). k can still vary arbitrarily which is what you want from a grammar developer point of view. I'm uncomfortable with this m constant, however. They say that LR isn't that much more powerful than LALR and I don't feel like having to explain why I put this constant in there just to make it terminate. ;) Plus, NFA->DFA conversion is something that people seem to grok. My goal is to reduce all this parser mumbo jumbo to a problem I can pose as a modified DFA algorithm. :) Anyway, my new version of the algorithm terminates for sure. The only extra work is a quick linear check at the end to ensure the DFA is reduced (i.e., that every state has a path to the stop state). If reduced, the decision is deterministic (i.e., I can predict which alt of a rule will succeed). More later...i believe I have discovered a really interesting relationship between LALR machines and my grammar looahead NFA after reading David Galles' soon-to-appear compiler textbook and chitchatting for a while. Essentially, if you create one of my NFAs and do a DFA from the start state, I should give you a regular approximation of the LALR language. If you jump to a rule other than the start symbol, you are effectively chopping off the prior context ("lookback"); this naturally would emasculate an LR-based system into an LL-based system ;) My DFA conversion at a rule's left edge, say, A is like computing the viable prefix machine for sentential forms beginning with A's alternatives. Well sort of. More properly, it is maybe a regular approximation to the "continuation language" (which is generally full LL not regular) for all sentential forms starting at A's alternatives. Oh, and final thought: drum roll...I think I am close to finally figuring out what the hell LALL(k) is! April 1, 2004Been working on the NFA->DFA conversion algorithm. Having lots of conversations with Sriram Srinivasan. An excellent parsing "sparring" partner! Anyway, I think I finally fully understand the problem of using arbitrary lookahead automatically for ANTLR 3 (i.e., no k specified and automatic fast backtracking if needed). There are three issues that force alteration to the standard NFA->DFA conversion algorithm (Thompson's subset construction), two of which allow us to terminate the algorithm early (1 and 3). BackgroundThe overall process looks like this:
For example, rule
is converted to from which I derive the DFA predictor: where i=>j means state i predicts alternative j. When a grammar rule can invoke another rule, r, the NFA has an epsilon
transition to that rule and an implied epsilon transition from the
tail of r back to the state following the state that invoked r. For
example,
Results in the following NFA for rule a: and for b: The lookahead DFA created for rule a is 1. Which alternative is predictedNormal NFA to DFA conversion begins by creating the DFA state from the closure of the NFA start state. That will indeed yield a DFA that covers (that is, "is a superset of") the underlying LL grammar, but it does not tell you how which alternative is predicted. To track that, you must augment the DFA state to have not just a set of NFA states, but also the predicted alternative. From the a:A|B; rule above, the DFA start state is not just
then (start from the decision point not the rule start state of
0)--the DFA start state is the set of tuples:
meaning that state 7 predicts alt 1 and state 8 predicts alt 2. As you find transitions from an NFA state, copy the alt when making the new DFA target state during subset construction. 2. Epsilon transitions leading to another ruleRule references in the grammar result in epsilon-transitions to the start state for the NFA associated with that rule. Because the LL grammar has an implied stack it uses to return to just after the rule invocation site, the NFA conversion algorithm must track NFA context in order to yield valid LL lookahead sequences. That is, we cannot just jump to another rule, r, and then blindly include all states following references to r. For example, in the following grammar rule b is referenced in two
places, but naturally it can only be called from a single spot at
once. The lookahead for rule a must reflect that:
To handle this, we must expand the NFA configuration we track during
DFA construction. Our tuple becomes
where an empty stack is written $. Here is the NFA for rule a: and for rule b: The DFA predicting alternatives in rule a: and the DFA predicting alts in rule b: DFA construction proceeds by creating DFA start state for rule a's
alternative block. Start with
Etc... 3. When to terminate conversionTerminate: when it is deterministic or known to be nondeterministic. Algorithm proceeds until it finds either situation. It cannot get into a state where it is continuing to look for new DFA nodes to add, but is not either deterministic or nondeterministic. Deterministic when all alts associated with NFA states in a single DFA state are the same. It uniquely predicts that alt. That is, for all (i|alt|ctx_i), alt is the same. Nondeterministic when you have the same NFA state merged into DFA state predicting >1 alternative and the context is identical. That implies that given the same input sequence, there is more than one path through the NFA to a single state that (crucially) predicts more than one alternative. In normal NFA conversion you only care about collapsing similar paths, here I must track which alternative is predicted. So if both (s|i|ctx) and (s|j|ctx) is in a single DFA state, the DFA is nondeterministic. Termination guarantee. If non-recursive, must terminate due to
finite set of states, edges. If recursive and deterministic, must
terminate as we will eventually find the proper DFA that predicts alts
and then terminate. If recursive and nondeterministic, we must
terminate as we will always detect the nondeterminism. We will always
find the nondeterminism because in the worst case, there is a "pinch"
point at the end of a subrule decision where all paths merge to a
single "stop state". If we merge this state into a DFA state
predicting two or more alternatives, it is a nondeterminism because
the context is the same: the empty stack. There must be at least one
input sequence that leads the NFA to reach this stop state starting at
the left edge of two different alternatives such as:
The NFA looks exactly like the first NFA above in this section for A|B. Note that both alternatives reach the same convergence state at the end of the subrule: state 6. In this case, a DFA state will contain both (6|1|$) and (6|2|$), which is a nondeterminism. The nontermination issue becomes clear when you look at
self-recursive, nondeterministic stuff like (thanks to Sriram for this
example):
and here is the NFA: If you don't check for nondeterminism, the algorithm loops forever looking for more and more input that will render the DFA deterministic. The proof sketch for termination is that to loop forever the decision must be nondeterministic, but we can always detect nondeterminism because in the worst case the nondeterminism will always be detected at the convergence "stop state" for a decision alternative block. That is, the algorithm may never proceed past the convergence point since the context will be identical by definition ($) and the state will be identical as it is a single NFA state. The predicted alt upon reaching the terminus will be different or the algorithm would have already stopped processing that path as deterministic. Since the alt is different and the state + context is the same, the DFA must be nondeterministic. March 12, 2004No explicit determinism checksHad a relevation today! The modified NFA->DFA conversion also
implicitly handles determinism checking! Right now, I compute
lookahead for each alt of a block such as in:
Then, I compare alti with altj pairwise for "n choose 2" (or n(n-1)/2) combinations. For n=4, that's 6 combinations; anyway O(n^2) and hence expensive. The LL-regular algorithm looks at the lookahead machine (GLA I call 'em in my thesis) for all alternatives simulatenously trying to decide if there is a DFA that uniquely predicts which alt to predict. If there is such a DFA, the prediction is inherently deterministic, hence, avoiding the determinism check that normally follows lookahead computation. The only issue is that the DFA is infinite not finite in general. Figuring out when there is a deterministic finite DFA may be hard. Of course, this is only needed if I want to do fixed k lookahead in the normal way. Optimizing Lookahead TestsParsersI'm beginning to think that not only should lookahead computation be
done simultaneously for all alternatives, so should the lookahead
prediction at parse time. In other words, prediction is now for k>1:
where LOOK(alt) is of the form:
The switch should be much more efficient than a linear search when
k>1. (For k=1 we generate a switch now). The state machine will
left-factor common prefixes and consequently only test common
prefixes once. For example,
would result in:
where I have put the actions in there so you don't worry about asking me to remove the redundant match() calls (yet). Like all DFAs, we would consider one input symbol and then move state, requiring k=1 lookahead only. That implies each input symbol is touched at most once during the DFA simulation. In contrast, think about n alternatives for, say, k=3. In the worst case, you do k*n lookahead comparisons. The DFA would do exactly k. Yes, it's insensitive to the number of alternatives! (I'm doing the happy dance). What about when the DFA is non-finite? That is, it may look really far ahead (to the end of a function declaration for example). You cannot use the fixed int constant inside the LA(i) test like I did in the finite prediction DFA above. You would need to record that you were at input location p and then reset the input pointer after prediction before attempting to match the predicted alternative. No biggie, but we need to buffer n in this scheme instead of a fixed k token window currently used. This large lookahead highlights an inefficiency issue with the current
tool. The following code has a redundant test for input symbol A:
Surely we can remove the match(A), leaving a simple consume() call
in its place:
The code is no smaller, but we have avoided a method call and a lookahead test. I'm not overly concerned with this for the parser, but it can add up when it appears in the lexer. The problem may become significant even in the parser when you have non-finite lookahead. LexersTo expose the issue, consider integers vs floats in the lexer.
Currently, this is illegal due to finite k lookahead:
The int prefix can be larger than k for any fixed k. Currently, one
needs to left-factor manually into:
This is a pain and not nearly as readable. On the other hand, LL(k)
versus regex power comes in handy such as when you want to handle
nested lexical constructs like nested comments. How can we get the
best of both? Use a DFA in nextToken() to predict which rule will
match. For our case, we'd build a DFA that matched until a '.' was
seen or not:
Notice that after we've matched an entire integer, we must rewind the input and have method mINT() match it all again. This is totally a waste of time (unless there are actions in the (...)+ loop). If nextToken() has matched an entire token, it should just return the appropriate token avoiding matching the rule completely. What about when nextToken() has predicted half a rule like with the
float? After seeing the '.', then it has predicted rule FLOAT.
Wouldn't it be great if we could jump into the middle of FLOAT like
the following, thus, avoiding the redundant input comparisons:
Is this combined DFA / LL matching system feasible in general? Not sure. Certainly actions on the left edge of a lexer rule prevents such redundancy avoidance as the actions are implied to execute before any char for that rule are matched. What about when a rule has more than one alternative? You'd have to have multiple entry points, effectively breaking up a rule with n alts into n rules, each uniquely predicted (I guess multiple alts notation is just syntactic sugar anyway). My other concern is how readable the lexer will be if I've broken apart all the rules. Perhaps the debugging option would turn off optimizations. Given my initial experiments with LL-regular lookahead computations, I think that the algorithm has a good chance to know the entry point into a rule, allowing us to avoid redundant character checks. Every state in the DFA is a merged NFA state (specifically the states in the lookahead NFA created for FLOAT, for example). At the very least, I will know which of several places I might enter a rule...that may give me enough information to split/duplicate pieces of a rule in order to start at non left edges. What about recursion in the lexer? This would most definitely prevent jumping into the middle of a rule as you need to build up the recursion stack to match for nested parentheses etc... No need to specify kDoes the creation of DFAs to predict alternatives effectively eliminate the need to specify a max finite k for lookahead? I.e., can I just build a DFA and not worry about whether it is even finite?! Lookahead computation for k is exponential in k. Interestingly, my
fixed k lookhahead computation may be more expensive than an infinite
DFA predictor because the finite DFA cannot use loops to signify token
sequences. For example, the following rule is LL(4) (and of course
LL-regular):
The lookahead for alt 1 is encoded in an acyclic DFA as
The lookahead algorithm has to walk the (B)+ loop multiple times up to k-1 in this case rather than just recording "heh, I found a loop". The fixed lookahead computation is in fact an NFA simulation that asks "what are the k-sequences that are reachable from this state?" A cyclic DFA does not have to flatten loops and therefore can encode the lookahead for alt 1 literally as itself: A (B)+ E. This implies that the lookahead computation is not dependent on k, but rather the complexity and number of states in the NFA from which it is derived! Wow! (I'm doing the happy dance again) Predicate HoistingWhile I'm grabbing the lookahead via NFA->DFA conversion, can I also
pick up semantic predicates along the way and just encode them in the
lookahead DFA? The following grammar has a predicate that, if hoisted
into rule member from ctor, would disambiguate the two
alternatives (that are syntactically identical).
The NFA for member looks like (using my NFA language from class
project):
The predictor for the alts of member is nondeterministic as it
is simply
We cannot merge the ID transition because the NFA stop states have
different alt predictions. What if we carried the predicate with us
in the DFA conversion so that it looked like:
where the expression predicates the traversal of that edge in the DFA. We can trivially carry this information along at the cost of an addition "predicate pointer" instance variable in the DFAEdge object. With the finite lookahead scheme in PCCTS, which did hoisting, it could not see over a token to a predicate beyond...Only a hoisting distance of 0 was allowed (i.e., no tokens between you and the predicate). We may get hoisting distances greater than 0 for free as the predicates will be added into the prediction DFA and will prevent merging of states as in the above example. The difference is that the unmerged states will cause a nondeterminism: the semantic predicate will guide the DFA between the identically-labeled edges. Actually you only avoid the merge if the states will predict different alternatives. Hmmm...more thinking needed here. Approximate LookaheadApproximate lookahead exists to avoid the exponential complexity of k>1 lookahead computation. Is it still necessary when building DFAs? As I say, the DFA construction is not sensitive to k any more; it's sensitive to the grammar size. Perhaps I have spelled the death of approximate lookahead!!! Damn, my dissertation down the drain? ;) <snicker>. Wow. I'd be happy to dump the approximation as it does not lend itself to my new mode of thinking regarding NFA->DFA conversion. Perhaps the approximation can be seen as a DFA compression. First I'd have to remove cycles from the DFA and then do the compression. Might be worth it and faster than doing the cycle removal from the NFA. Traversing all of those epsilon transfers to compute lookhahead is very expensive particulary for rule invocation chains as in expression evaluation. Holy shit...did I just find a way to compress DFAs in a huge way? Wow. Tree lookaheadOk, talking to Loring Craymer on the phone while I'm watching the
freaks in San Francisco come into the cafe I'm sitting in. He asked
about k>1 tree lookahead. Why not? It comes for free if you insert
some "imaginary" tokens into the token stream. The issue is that
are very different structurally. You are no longer matching a one-dimensional structure; you are matching a 2D tree. The first k=1 lookahead is easy: it's just the first symbol you see regardless of whether or not it's a root. To match the structure too, just pretend there are tokens for the down
and right pointers. The lookahead for the above is changed to:
Pretty cool, eh? August 24, 2003Got LL-regular algorithm working. Using non-finite state machine to
predict alternatives so that you can write rules like this (for C
language):
So because functionhead (which includes an infinite number of arguments), an LL(k) parser for fixed k cannot see past the head to the semicolon or left curly. An LL-regular parser races ahead until it finds the semi or curly and then decides what to do. It's like a syntactic predicate limited to regular prediction languages. Oh this is so groovy. I have my unit tests going now for NFA->DFA conversion. It is smarter / faster than a normal conversion because I'm only trying to predict which alternative will match--not build a machine to match them. This is the auto-leftfactoring thing needed for parsers and lexers alike. I love it because it stops computing DFA states when it knows for sure where it must end up (i.e., which alt will be predicted). That means (for lexer grammars), I can use the DFA conversion to handle arbitrarily complicated lexer rules, but once it knows which rule will win, it jumps into the middle of that rule and continues lexing using a top-down style parse. :) Or, I could just roll backwards and jump to a fully-fleshed out lexer rule parser. Might be hard to generate half of a rule, but we'll see. Anyway, the basic technology (poorly written for this DFA conversion) for the analysis engine is now available the Sunday before school starts. Wooohooo! I'll get back to this soon i hope. I'd like to keep playing around with this prototype to see what an intermediate form needs to look like before trying to build something clean to release. As far as I know, the engine only needs to be made efficient (no caching at the moment) and to improve the lookahead so I can generate LALL(k) not SLL(k). There are some details around like "is this a lexer or parser grammar" etc... that need to be specified so I build the right NFA (e.g., parser start rules get an EOF loop on the end where lexer rules do not). So, I'm almost at the point where I need to define what a "for real" grammar looks like. All coming together nicely :) August 23, 2003Got basic NFA->DFA conversion going. More complicated than I thought...gotta track alternatives and watch out for ambiguities like in this test case:
Ugh...seems to work, but won't know until I can build a test rig. I wonder if the right test rig is simply the DFA interpreter so I can just pass in a bunch of strings and if they all match properly and some bad ones fail then all is cool. Hmm..not at rigorous as dumping the DFA and comparing. You know I bet a simple print to string from DFA state set compared to an expected string might just do it. :) You know something like:
Yeah, that would work :) August 17, 2003building testing rigs and altering software to have testing hooks is a
freaking pain in the ass. ;) Finally have the decision test rig.
Example:
Ugh. Was hoping to get onto a really cool concept today, but didn't get to it. :( Think infinite regular (as in DFA) lookahead versus backtracking. The DFA would be much faster and could be added automatically. It would be like k lookahead accept k would vary til eof if need be. It is weaker than backtracking, but faster and can be dealt with automatically :) This will be part of the left factoring needed to separate lexer rules. August 18, 2003Ok, so given that I have an NFA from a lexer grammar, I figured I could left-factor stuff that is non-LL(k). Perhaps I can do it all down to LL(1) for speed and then jump into methods to finish off routines that have been properly predicted. Then I thought "heck, this is like a generic regular language predictor concept for an alternative." The natural extension to parser grammars jumped out at me. I have seen regular infinite predictors before...10 years ago. Was it culik? Hmm...more reference to LR-regular grammars: LR-regular grammars -- an Extension of LR(k) Grammars. Journal of Computer and System Sciences 7 (1971) actually I quote: CuC73: Karel Culik and Rina Cohen. LR-Regular Grammars---an Extension of LR(k) Grammars. Journal of Computer and System Sciences, 7:66--96, 1973. http://compilers.iecc.com/comparch/article/99-10-180 I sent this to Chris Clark:
August 16, 2003:Use a set for the last lookahead depth as it's always a leaf and makes it faster, more space efficient etc... Will this also somehow make the comparison faster? Perhaps optimize the k=2 case as it's most common (comparison routine could be special cased). August 12, 2003:I played "hookie" today from school and got very far into analysis
stuff. I have it properly doing determinism checking and lookahead
depth computations etc... for SLL(k) [LALL(k) ala PCCTS coming soon to
a theatre near you] Well, i have only a few tests, but the
infrastructure is there now. Interesting how testability forces you
often into better encapsulation/compartmentalization. I have a
Decision object that is returned from the determinism routine that
records everything one would ever need to know about a list of
alternatives (can be a good debugging aid for building grammars too).
Tracks all the kinds of lookahead and will eventually have predicate
hoisting info in there too. I also need to write the testing rig to
compare decision objects. For now, check out a few dumps:
yields
And
yields:
I call that all pretty damn cool! :) An inefficiency exists when computing full k-sequence lookahead. If k=2 fails, you have to do k=3 but STARTING OVER. Instead, try to leave info in the nodes of LTree indicating where it left off in the NFA so you can simply fill it out further to go from k=2 to k=3!!! Cool. Heh, can we reduce the size of the constraint set by intersecting approx look for all alts prior to alt i? If we predict alts in order provided, we are constantly reducing the number of input tuples.
Hmm...not obvious that you can, but thesis says something about this during SLL(k) construction. August 8, 2003:Ok, so to aid you/me/everybody in dealing with nondeterminism warnings, we need a way to answer the question "huh? how can ID SEMI ID be matched at that point in the grammar? I can't find the path!!!" I have a great way to deal with this. When I construct the NFA from the grammar, I add a String to each node. Then, I make a routine called node.PATH(lookahead) that finds how that lookahead weaves thru the NFA. All it does is "match" the lookahead by backtracking and spitting out the Strings as it passes each node. It will say things like: Enter alt 1 of rule a match ID enter rule b enter alt 1 of rule b match SEMI ... You get the picture. So, the algorithm doesn't ahve to think...just print :) Optimally we'd highlight a path in the syn diag visually, but that's easy too as I can just record the state numbers as I pass thru and then tell the visual sucker to highlight those nodes. Note that this PATH is essentially the interpretered version of ANTLR. :) Woohoo! August 7, 2003I'm back in business. My LL(k) algorithm took me 30 minutes once I had verified that my NFA was constructed properly from the grammar. Sweet...this is just a prototype of course, but demonstrates to me that I'm on the right path. :) No matter how complicated the grammar is, it is represented as a set of nodes and edges with at most two edges leaving a node. Amazing. The algorithm is a very simple bounded (by k) recursive walk of the tree. Analogous to NFA-to-DFA conversion bounded by k. Heh, now it looks just like my thesis. :) Back to the future as they say. basic grammar -> analysis NFA testing complete. Now just a simple (inefficient) recursive walk and I should be able to create real k>1 lookahead sets. Then to formalize into a spec and make it efficient. August 5, 2003I just read a bunch of my thesis again which led me to look in my "legacy" directory. Guess what!!!! Sonuvabitch if I didn't find the "motherlode": real clean documented code that does all algorithms for lookahead I have ever worked on. This was what I used for my thesis, but it only works for BNF not EBNF grammars (though I did make an EBNF->BNF converter). PCCTS caches the approximate lookahead as it is complete, but it computes full LL(k) stuff only when approx fails. Further, it computes full lookahead in a consstrained manner--it only looks at the possible collisions of two productions rather than computing full lookahead for both productions and then comparing. It therefore cannot cache, but still b | |||||||||||||||||||||||||||||||||||||||||||||||