Lexer grammar for floating point, dot, range, time specs

Skip to end of metadata
Go to start of metadata

How to lex numbers, dots/periods, and range operators all at the same time.

(... and catch errors with malformed literals.)

The question "How do I parse floating point numbers, and single periods/dots and I need a range operator which is two dots at the same time?" comes up so frequently, that I decided to publish the lexical rules from the commercial grade lexer in the JavaFX compiler, with the blessings of Sun Microsystems, as the compiler is open source. See JavaFX for more details of the JavaFX project.

In the source code below, you will see action call outs to produce error messages from your lexer. Programming for errors is generally ignored by programmers new to ANTLR because it is easy to program a rule that matches, but not so easy to program one that catches badly formed constructs - this is important for your users (wink) . Replace these JavaFX compiler specific calls with calls to your own error manager - the errors they are throwing should be obvious. There are also callouts to check numeric ranges, which also log errors if numbers are out of defined ranges - again, implement your own code here.

The very few action elements here are in Java, but are easily adapted to other targets.

Notes

Note the use of fragment rules with no body to define the token types that the main FLOATING_POINT_LITERAL uses.
You may feel that this looks like a complicated rule, but in fact it is very simple as all the possible paths through a literal definition are laid out and you can read it directly, without having to infer any decisions that ANTLR tried to make for you.

numericlex.g
  1. Dec 03, 2008

    Tyro here: Can you also show the version without any "action semantics"? If I want to use this in a tree walker that will be hosted in, say C#, won't I have to strip out all the Java?

     Learning this stuff, it is not always clear what statements in the grammar specs are used during construction of the lexer/parser algorithms and which would be executed at runtime in the host application.

     I know, with practice, it will become evident but being able compare pure and language-specific grammar specs would be very educational...

  2. Dec 04, 2008

    To be honest, part of the point was to illustrate what to do to raise errors from actions. Here are some guidelines though:

    • The action code is always within '{' '}';
    • References that start with a $ are language neutral so $type = XXX; works for all languages;
    • All the rest are just method calls, and the syntax is the same in C#;
    • The apparent misuse of setText here is because this lexer uses a super class that provides these methods;
    • Change the @init block to be compatible with your target language;
  3. Dec 04, 2008

    Forgive me, I am dimwitted: I see that the point of the example is to illustrate insitu error reporting for malformed inputs. However, is the example also the canonical recommended way to write a grammar for parsing numeric literals--without regard to whether or not one kindly detects malformities?

    For example, programmatically peeking ahead via input.LA(2) rather than using a pattern matcher that implicitly looks at the character two characters ahead. Is that the preferred style?

    Also, is this compound case/switch rule the recommended approach versus a collection of "token-specific" rules? That is, isn't it more elegant-when possible-to write a separate rule for OCTAL_LITERAL, FLOATING_POINT_LITERAL, HEX_LITERAL, TIME_LITERAL, etc? The approach shown seems very procedural versus a declarative approach.

    (I realize that I can't have my Elegance Cake and eat it in the presence of ambiguous sentences...)

  4. Dec 28, 2011

    Hi everybody,

    Though I admire the ANTLR project, a lot, and am actually using it in some projects, and also *I love* the book, I'm surprised to realize that the bulk of the scanning presented in this article can be solved by mere gnu flex in 1/10 of the lines ... or much less.

    And it's taken me 3 days of intense work to realize this ... 

    Or ... what am I  missing .. ?

    My conclusion is:

    • If the job involves parseing complexity, then use ANTLR
    • If it's lexer complexity,  use any tool strong in regex, such as flex or sed or java Patterns

    Be sure: I would be glad to be informed that I'm supremely wrong on this regard, and why.

    Cheers

    -- Ariel Tejera