Semantics and error checking
- warn about non-unique refs to elements from actions like $ID
- don't let modes refer to each others' token defs
- do more option (all levels) checking
- if a lexer rule references a lexer rule which doesn't exist, i get an NPE instead of telling me what rule what referencesd from where
- chk for undef'd rules before allowing ATN build
- warn if $e used for rule ref in rule e. e : ... | '(' e ')' {print $e.v;} // doesn't translate to rule ref
parse trees
- xpath or jquery like feature to do find nodes
- tree pattern matching? e.g., find all (e 1 1) trees.
- in left recursive expression rules, allow op=('+' | '-' | '|' | '^')
- Allow rewrite by return value (enter or exit)?
- Create method to create new parse tree with concrete syntax.
return parse("stat", "while (i>3) {...}"); - Find a way to have some nodes not appear in parse tree or at least in listener. just skip creating new _localctx.
- Take this example:
e : ID | e '.' ID;
With the input "a.a.a" you get (select (select a a) a) the .stop value for the outer ctx is correct, but for the inner (select a a) it's null correction, i'm not sure it's correct for the outer due to a syntax error occuring in mine, but it's definitely null for the innere :ID | e '.' ID;With the input "a.a.a" you get (select (select a a) a) the .stop value for the outer ctx is correct, but for the inner (select a a) it's null correction, i'm not sure it's correct for the outer due to a syntax error occuring in mine, but it's definitely null for the inner. yep, _localctx.stop is set at the very end of the rule, but it needs to be inside the postfix expr loop. right after "_prevctx = _localctx;" add "_prevctx.stop = _input.LT(-1);"
syntax
- returns[Map<String, Object> x]
- ID*[','] comma-separated list of ID
- tokens {} allowed in lexer?
- &foo syn preds like pegs; 0 width. where are they allowed?
Code generation
- move all ctx objects to bottom?
- be able to split big grammar into chunks using inheritance
- split serialized atn
Token type issues
- Sam: in a combined grammar, the following does not work:
analysis
- Sam's LL(1) optimization
- Prediction can return incorrect results when alts are partially predicated
- I said "LL(1) == SLL(1)", Sam says "this results in a faulty reportAmbiguity on the else in "{ if a then foo else bar }", when in fact it's only a conflict for "{ if a then if a then foo else bar }"
- bug in preds for full ctx parse. in ParserATNSimulator.predTransition, you assume that empty stack is in context. that doesn't work for full context parsing. i have a checkin on one of my branches that shows the fix...see:https://github.com/sharwell/antlr4/commit/99ce3cba5cfe2ecaa81244e59400fa28f2104be3
- predicated alt taken over naked alt causes trouble:
With the current rules, unit test LR expressions, the following test will fail in the LR tests:
- DFAs mostly have one edge. optimize to avoid array for this case
- hit a case in my parseratnsim where getAmbiguousAlts returns {2..3} but there are still configs which lead to alt 1. it's going immediately into conflict resolution and incorrectly choosing alt 2 you should be able to repro it with the following rule with the input "x.3" choosing alt 2:
i also have a config set containing both of the following configs: (397,2,[],up=7), (397,2,[],{46:0}?,up=7) the second config there can be pruned
- sam's pred test that fails. treat missing pred in config set for alt as a true; say no preds then for that alt.
- remove add ERROR edge from execDFA, put in execATN when reach is null.
- testNotSetRuleRootInLoop. ~set in LL1Analyzer doesn't compute ~
- optimize LL(1) in adaptivePredict?
- Sam's idea: with a lexer or with -Xforceatn for a parser, you can modify the ATN when constructing the parser instead of using sempreds to address language differences or customizations (ST delimiters, enum as keyword in java, etc)
- Does DOT match EOF in (foo|.) EOF block?
Errors
- add sync()-like functionality to prediction so that, even during prediction, we can do single token insertion or deletion.
Visitors/ event listeners
Options
- header{} goes into lexer and parser, same with members?
Runtime
- CommonToken.getText() points at the current input stream for the lexer, but if somebody resets it, it will point at the wrong stream. added new pointer or replace the token source in the token object; Sam suggests sharing a 2 ptr object that has the lexer and the input stream pointer.
- Make an efficient token object
- Sam: ParserRuleContext.getStop() returns null if exception occurred - could still be valuable to know where it ended up on the parsing though
- commontokenstream.reset can leave the stream on an off-channel token
lexers
- generating unneeded action funcs in lexer
- predicates should be allowed in the lexer.
- should we allow same token name in multiple modes? seems useful.
- Using the first ANY_GENERAL rule, it consumes everything. Swapping for the second ANY_GENERAL rule, it works as intended.
- I don't understand why they are not doing the same thing?
- import mode pulls rules into another mode; shares common stuff like WS, ID, etc...
- FOO : X (Y {foo();} | Z {bar();}); silently drops the {foo();} after Y. what is allowed? it's fine to have FOO : X Y {foo();} | X Z {bar();};
- Using '\n' in a parser works but then using tokenVocab to pull into a parser grammar seems to gen a real newline in token string list in gen'd parser and maybe .tokens file.
Misc
- @ANTLR(...) to compile grammars in package
- @api to signify stuff to use from antlr api vs public; Sam points out that this really should be a Java interface.
- .g4?
- '\n' in parser goes to real newline in .tokens file.
- Consider this example:
In this example $expr should bind to the sub-expression in my opinion.
However, it does not. Since the rule is also named expr, $expr refers to
the rule context instead of the context of the sub-expression. I think
most of the time this is not what the user wants.
Raw Sam notes:
can you override rules in a mode?
A : 'x'; mode M; A : 'y';
omg lol
i guess i get what i asked for
i just saw the generated code from the lexer i'm experimenting with
I'm getting error 31 with the construct "x='.' {$x}"
*correction*
x=~'.' {$x}
"X : Y;" in lexer causes NPE in ParserATNFactory if Y doesn't exist
why not emit the serialized ATN as a file next to the .java file? embedded resources are standard practice in java
then the java file is even cleaner and you can't get loader errors
*compile errors
i'll toss in .atn as a potential file extension
tty tomorrow
for output=AST, an empty parser rule results in a compile error because _root0 is not defined.
actually that happens for any empty top-level alternative
so "x : y | ;" gives it
you should either make xContext final, or use a method Create_xContext to construct it
i think BlankGrammarListener should be GrammarListenerBase. among other things GrammarListener and GrammarListenerBase will appear side by side in autocomplete dropdowns
you'll get a small performance boost by allowing a null listener and including "if (listener != null)" in your enterRule and exitRule methods
no need to SuppressWarnings on the listener interface or blank implementation
You can also add @Override to the methods in the blank listener implementation and to the enterRule and exitRule methods in each rule context
something definitely needs to be done about the call to sync()
a minimum of a static LL(1) check for the set that keeps the recognizer in the loop
caching results won't be enough
obviously that's something that can come later
to make it clearer that 0 in enterRule(_localctx, 0) has particular meaning, you could instead generate it as enterRule(_localctx, RULE_x)
should consider making 'mode' and 'locals' context-sensitive keywords. already broke grammars for me and it's pretty easy to distinguish the usage when parsing.
should probably use a copy ctor instead of the copyFrom method
that's exactly what a copy ctor looks like