Lexer Rules

Skip to end of metadata
Go to start of metadata

ANTLR 4 Documentation Home

A lexer grammar is composed of lexer rules, optionally broken into multiple modes, as we saw in Issuing Context-Sensitive Tokens with Lexical Modes. Lexical modes allow us to split a single lexer grammar into multiple sublexers. The lexer can only return tokens matched by rules from the current mode.

Lexer rules specify token definitions and more or less follow the syntax of parser rules except that lexer rules cannot have arguments, return values, or local variables. Lexer rule names must begin with an uppercase letter, which distinguishes them from parser rule names:

 /** Optional document comment */
 TokenName : alternative1 | ... | alternativeN ;

You can also define rules that are not tokens but rather aid in the recognition of tokens. These fragment rules do not result in tokens visible to the parser:

 fragment HelperTokenRule : alternative1 | ... | alternativeN ;

For example, DIGIT is a pretty common fragment rule:

 INT : DIGIT+ ; // references the DIGIT helper rule
 fragment DIGIT : [0-9] ; // not a token by itself

Lexical Modes

Modes allow you to group lexical rules by context, such as inside and outside of XML tags. It’s like having multiple sublexers, one for context. The lexer can only return tokens matched by entering a rule in the current mode. Lexers start out in the so-called default mode. All rules are considered to be within the default mode unless you specify a mode command. Modes are not allowed within combined grammars, just lexer grammars. (See grammar XMLLexer from Tokenizing XML.)

 rules in default mode
 ...
 mode MODE1;
 rules in MODE1
 ...
 mode MODEN;
 rules in MODEN
 ...

Lexer Rule Elements

Lexer rules allow two constructs that are unavailable to parser rules: the .. range operator and the character set notation enclosed in square brackets, [characters]. Don’t confuse character sets with arguments to parser rules. [characters] only means character set in a lexer. Here’s a summary of all lexer rule elements:

SyntaxDescription

literal

Match that character or sequence of characters. E.g., ’while’ or ’=’.

[char set]

Match one of the characters specified in the character set. Interpret x-y as set of characters between range x and y, inclusively. The following escaped characters are interpreted as single special characters: \n\r\b\t, and \f. To get ]\, or - you must escape them with \. You can also use Unicode character specifications: \uXXXX. Here are a few examples:

 
WS : [ \n\u000D] -> skip ; // same as [ \n\r]
 
ID : [a-zA-Z] [a-zA-Z0-9]* ; // match usual identifier spec
 
DASHBRACK : [\-\]]+ ; // match - or ] one or more times

x’..’y

Match any single character between range x and y, inclusively. E.g., ’a’..’z’’a’..’z’ is identical to [a-z].

T

Invoke lexer rule T; recursion is allowed in general, but not left recursion. T can be a regular token or fragment rule.

 
ID : LETTER (LETTER|'0'..'9')* ;
 
fragment
 
LETTER : [a-zA-Z\u0080-\u00FF_] ;

.

The dot is a single-character wildcard that matches any single character. Example:

 
ESC : '\\' . ; // match any escaped \x character

action»}

Lexer actions must appear at the end of the outermost alternative. If a lexer rule has more than one alternative, enclose them in parentheses and put the action afterwards:

 
END : ('endif'|'end') {System.out.println("found an end");} ;

The action conforms to the syntax of the target language. ANTLR copies the action’s contents into the generated code verbatim; there is no translation of expressions like $x.y as there is in parser actions.

p»}?

Evaluate semantic predicate «p». If «p» evaluates to false at runtime, the surrounding rule becomes “invisible” (nonviable). Expression «p» conforms to the target language syntax. While semantic predicates can appear anywhere within a lexer rule, it is most efficient to have them at the end of the rule. The one caveat is that semantic predicates must precede lexer actions. See Predicates in Lexer Rules.

~x

Match any single character not in the set described by x. Set x can be a single character literal, a range, or a subrule set like ~(’x’|’y’|’z’) or ~[xyz]. Here is a rule that uses ~ to match any character other than characters using ~[\r\n]*:

 
COMMENT : '#' ~[\r\n]* '\r'? '\n' -> skip ;

Just as with parser rules, lexer rules allow subrules in parentheses and EBNF operators: ?*+. The COMMENT rule illustrates the * and ? operators. A common use of+ is [0-9]+ to match integers. Lexer subrules can also use the nongreedy ? suffix on those EBNF operators.

Recursive Lexer Rules

ANTLR lexer rules can be recursive, unlike most lexical grammar tools. This comes in really handy when you want to match nested tokens like nested action blocks: {...{...}...}.

reference/Recur.g4
 lexer grammar Recur;
  
 ACTION : '{' ( ACTION | ~[{}] )* '}' ;
  
 WS : [ \r\t\n]+ -> skip ;

Redundant String Literals

Be careful that you don’t specify the same string literal on the right-hand side of multiple lexer rules. Such literals are ambiguous and could match multiple token types. ANTLR makes this literal unavailable to the parser. The same is true for rules across modes. For example, the following lexer grammar defines two tokens with the same character sequence:

reference/L.g4
 lexer grammar L;
 AND : '&' ;
 mode STR;
 MASK : '&' ;

A parser grammar cannot reference literal ’&’, but it can reference the name of the tokens:

reference/P.g4
 parser grammar P;
 options { tokenVocab=L; }
 a : '&' // results in a tool error: no such token
  AND // no problem
  MASK // no problem
  ;

Here’s a build and test sequence:

=> $ antlr4 L.g4 # yields L.tokens file needed by tokenVocab option in P.g4
=> $ antlr4 P.g4
<= error(126): P.g4:3:4: cannot create implicit token for string literal '&'
  in non-combined grammar

Lexer Rule Actions

An ANTLR lexer creates a Token object after matching a lexical rule. Each request for a token starts in Lexer.nextToken, which calls emit once it has identified a token.emit collects information from the current state of the lexer to build the token. It accesses fields _type_text_channel_tokenStartCharIndex_tokenStartLine, and_tokenStartCharPositionInLine. You can set the state of these with the various setter methods such as setType. For example, the following rule turns enum into an identifier if enumIsKeyword is false.

 ENUM : 'enum' {if (!enumIsKeyword) setType(Identifier);} ;

ANTLR does no special $ x attribute translations in lexer actions (unlike v3).

There can be at most a single action for a lexical rule, regardless of how many alternatives there are in that rule.

Lexer Commands

To avoid tying a grammar to a particular target language, ANTLR supports lexer commands. Unlike arbitrary embedded actions, these commands follow specific syntax and are limited to a few common commands. Lexer commands appear at the end of the outermost alternative of a lexer rule definition. Like arbitrary actions, there can only be one per token rule. A lexer command consists of the -> operator followed by one or more command names that can optionally take parameters:

TokenName : «alternative» -> command-name

TokenName : «alternative» -> command-name («identifier or integer»)

An alternative can have more than one command separated by commas. Here are the valid command names:

Labels:

var _gaq = _gaq || []; _gaq.push(['_setAccount', 'UA-1024344-1']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })();