|
Home |
Download |
ANTLRWorks |
Wiki |
About ANTLR |
Feedback |
Support |
Bugs |
v2
|
|
|
Latest version is 3.2 Download now! » |
|
|
2003 Cabal Notes Terence Parr, Loring Craymer, Monty Zukowski Tree stuffBuild/specify (parse tree too?) syntax Loring's tree construction syntax:
symlinks:
Loring's concept is similar to the @-variables of sorcerer. Like a
local variable, each invocation of a rule (or #{"A"} in his case)
creates a new version of the variable. A reference to the
label/symlink accesses the most recently defined variable. His syntax
implies "leave a placeholder" not label a tree node. The
implementation describes it best. Each label acts like a stack of
values; push a new value upon label def. The labels are naturally
globally visible. The #{"A"=tree_expr} really says to insert
tree_expr where "A" is above. We used the example:
or to have a #(DECL d d1 d1 d1) created:
Like a string template (doc with holes), this is a tree with holes you can fill in. Loring says that once parsing is done, he backpatches all "holes" by recursively resolving. He just does bookkeeping during the parse rather doing on-demand linking up. Rewrite syntax for tree construction:
or
s2 would either be null and leave a placeholder or just not appear. This stuff doesn't work well for expr...still would use annotation. Labels vs direct reference to elements: Post to group to find out. Parse trees: why not? Easy and could almost be a runtime option (wouldn't need to regenerate parser from grammar). Monty talks about a JCL problem where you see statements but an EXEC
statement starts a new subtree. A null or other EXEC ends current
subtree and bumps you back up. Using Loring's notation of symlinks,
you define an EXEC insertion point (symlink) only when you see EXEC.
All other statements then add themselves to the current EXEC.
Hmm...can't figure out what to do on NULL statement:
Ok, now we're thinking "-> new tree" syntax is not worth adding. Gen a tree grammar To make things easier, "no antlr modifying code inside regular actions." He'll make this a module. :) To push forward changes, use
where t0 is the first tree grammar generated, t1 is the generated result from modifying text parser grammar. t0annotated has the t0 with actions. Actually, monty is right: we can back out the actions and then we don't need the original t0. :) Tree rewrite Same as tree construction. Labeled subrules:
In Loring's notation, we'd now do:
Hmm...notation is kinda nutty with the (...)! Ask the interest list. tree -> text Talking about AST to flat token list or to parse tree. Trying to incorporate StringTemplate. Perhaps string template is added to AST walk to use for rule returns. Maybe the use of templates for the various components of your output language like method, decl, class def etc... and then you still use an ASt tree walk to drive the translation, but it uses the string templates to actually spit out stuff. That way you can change the templates w/o having to mess with the grammar. Hmm...maybe generate parse tree from AST and then use regular Java grammar to "unparse" it back to text. Perhaps even generate a token stream from the AST and then have the parser walk that list dumping things in the right order. search/apply to tree Add tscanf("#(... %1 ...)", List-to-fill). Make sure we have the findalltree() method there to look for all expr or something satisfying a predicate or something. attribute syntax We want all $foo and #foo stuff to be better hooked into code generator. Loring goes so far as to suggest we move the $foo.setText(...) outside the action. We need to allow IF statements around it however and we have no good answer for that. Ask the group. We decided it was a good thing to define token and AST properties in ANTLR syntax so you can know when somebody refs/sets the properties and it's language independent. The code generator can dump the right thing. token string text; int line; List args; // for html args for example AST token originalToken; xml read, type relationship Code generator should be able to generate ASTs that satisfy DOM trees or whatever as long as it's compatible with the run-time they supply. So that walking routines and tree walkers etc... can use the same AST type. XML reading/writing should be standard for serializing trees. StringTemplate, code gen templatesOk, concerning string template and trees for final code emission. Trying to pull data from tree into string template and have it run the show seemed wrong because you have to have a tree grammar fragment in order to compute the values:
They said well, just put the template into the parse tree. That
smelled wrong. I think (and still do after this is all over that we
might be able to get string template to run things). For now, we
agreed some that associating a template with a tree grammar rule is
good concrete approach. Each rule would push a new scope and create a
new string template with for rule r there is a template r. It makes
new variables available that even invoked rules could set. Labeling
an item (atom or rule reference) sets the associated attribute for
most recent hole on the scope stack. example:
Rule args does not create a new template and, hence, does not create a new scope of symbols. Labels act assignments to the most recently defined hole of that name. A labeled tree indicates what template to create for that tree. Hmm..doesn't work if the tree is flat with no root...only can label first element. ISSUES:
Actually label does not look right. For example "slist:stat" is a little weird since it means put stat into the slist hole of a template. "slist:=stat" is better since it implies slist is assigned the value of stat invocation. Another thought is to merge the string templates directly into the
grammar that way you don't need to figure out when to create a
template:
What about precedence for expressions? You need parens around the subexprs sometimes if prec of current op is higher than subexpr op. Can simply ref <lparen> in the expression template and then use a generic setAttribute() method of the tree parser to set the value for the currently scoped attribute name (such as lparen). What if we allowed a predicate to allow a choice of templates on the end:
Then values set with := assignments are queued up and then at end of scope when you must choose a template, it flushes the values from the stack before it pops the scope. What about case of needing to create a template for implicitly-defined
scalars. There is no template associated with the, say, "a=1"
statement to stuff into decls somewhere. One solution is to put an
anonymous template at the def site:
This is a double assignment (with scope override) but it doesn't say when you are done createing the anonymous template. Need to do this probably; move template to area where you know template will be complete:
The "flush" operation pushes values into the multiple templates. The new scope for this assign rule is the set of all attributes referenced in all => templates. Multiple refs to <id> in same scope yield same value(s). Can move these anonymous templates to a template grammar file and then refer to them by name so that you can factor these out for reuse within grammar or from within another grammar. Concept: the rule yields a template (or more) and then you can do "attr:=ruleRef" to get the value above. The decls:="int <lhs>;" yield actually doesn't return anything--it sets an attribute in an existing template. A purely return value concept is totally totally wrong as it implies the order of output is the same or similar order as input tree. The tree walk just walks and sets values then the overall template is dumped at the end. Using string templates that actually produce tree templates. For output template
you want tree
If you have a parser that generates trees then can easily parse the template with holes to get the right tree:
What if the template is ambiguous? Will have to wed the template to a particular parser rule. Might have to refactor your grammar if you want to leave a hole of a particular sequence of tokens. Normally though it'll be obvious stuff like expr or stat or decl or method. Still you might not specify enough text on the left edge for me to predict the correct alternative. Ah. If you insist that all holes correspond directly to a grammar (parser grammar) rule, then you'll never have an issue, but there are no decisions: the entire rule is replaced by rule_epsilon special token type. Somehow we must have the ability to say what start rule to use (don't forget the EOF issue) in the parser corresponding to the template text. Each hole also must specify the rule it replaces. So you have <attributeName:ruleName>. Technically then you can't just specify a string template and then decide w/o modification to build trees. Though, you can go back the other way. If you specify a tree template it's easy to ignore the rulenames after the colons and then generate text. So that is the consistency thingie for trees and text output using the same syntax and concepts. Queue up values and then fill string or tree template. If you want to do replacements like "i=<expr>" you can simply record the token positions that begin/end expr rather than creating parse tree. Code reuse (pushing forward changes); for whole grammars; Composition of grammarsThe problem is that people need to add actions to stock grammars to make them do something useful. But, when the original say Java grammar is changed, you want all the grammar changes to push forward to everybody's specific grammar. Want antlr-specific diff3 not regular diff3 'cause want it to merge at alternative level. Forking new copy of grammar and push forward looks like this:
This pushes changes from g0 to g0' into g1 to yield g1' Seems like this really is a process or principle that people use in conjunction with their RCS as opposed to a specific tool we build (minus the antlrdiff3). Any tool would have to pretty much use an RCS to manage g0 and g0' and we'd have no idea how they have set up their depot so couldn't effectively build a tool to work in all environments. Just write a good paper on how to reuse grammars. Even if you just cp into a different dir and manually do the diff3, at least you can push forward changes. Aspects. add tags that can be filled by separate aspect file. What if don't have tag in right spot? Hard to read and figure out what actions do in aggregate. hard to figure out where are errors are due to preprocessing. We'll leave this to users who want it; they can build the tools (Bogdan has one). Composition. Nice to allow in workbench (click on the tokens you want) to generate a lexer. I'm less sure it's useful for grammars. Hmm...might as well cut-n-paste or use RCS. Be nice to have an idiom plugin or just a special expression idiom where you can specify the operators and precendence and atom rule and it generates the rules or possibly a precedence parser. Might not be good enough for the wackiness in Java, but would work for vast majority of simple languages. :) Conclusion: we need not add anything to ANTLR grammars like inheritance, composition, etc... because it's not right and adds lots of complexity. Will have to have a workbench to supply ease-of-use stuff. Keep the tool core! Lexer impl (DFA or top-down); insert multiple tokens ability like for python?Special case the single char tokens like operators. Inline rules that are not invoked by other rules (which is usually all of them (top level token rules)). LL(1) rules go into switch, others go into if-then-else after switch. Then use syn preds to distinguish any others that are nondeterministic. Implies there will never be a warning from antlr that it cannot distinguish between tokens. Idioms; common task ameliorationA poorman's version of emitting back out with no is to remember the char position for each token and then just sort and print as a backend. Does not allow reordering, but allows you to manipulate the nodes (i.e., the properties) of the tree and even augment for say instrumentation. If you wanted to change the char position, you could actually fake reordering. Then just walk and print "in order" and boom you have your text back out including the whitespace if you want on a hidden channel. To do a simple source to source using same input/output language, should be easy to have it read in and spit out as an identify transform. Then can add ! and other string templates to insert in output stream to do augmentations like code instrumentation. This should be made easy. Feature is "string template" that is inserted into output stream at that point. Easy and consistent. :) Input token stream to output token stream...just insert a new set of tokens and can feed new parser. Or just dump text and feed into another lexer/parser pair. Hmm...to add "instrument("x")" to the stream, you'd need to convert this to a stream of tokens. But, you have a lexer since you're using one. Could use it to build the token stream from the text and then spit this token stream out. Could have an echoToken(Token t) method that either spits as text or token stream; you could decide. :) See expression idiom from above. How about an option or idiom that says "ignore whitespace". I always use the same thing anyway. TestingLevels (or what to test):
Support:
In general, comparison of text needs to be tolerant of whitespace and formatting. Worst case is checking the java code generated by the code generator. Perhaps normalize the txt by reading in with Java grammar and dumping or something. How to run:
Workbench requirements and wishlist
None of this seems to need to be in eclipse. ANTLR implementation detailsIntermediate form that can be interpreted as well as used to generate code from? That form could be read from disk by any language then. E.g., python could read and then generate python. Most important output from antlr is the lookahead analysis. Good for testing etc... LL(k), generalized lookahead; sem pred hoistingModule boundariesseparate runtime/tool lib. Standard library spec for trees etc... to aid porting to new output languages Build issues
|
|||||||||||||||||||||||||||||||||||||||||||||