A language tool begins processing by answering the question " is the input consistent with my language?" Somehow the programmer must formally specify the infinite set of sentences, normally with a grammar. The act of recognition involves answering whether or not a grammar can generate that sentence. This lecture provides a complete tour of the common grammar meta-languages in use today either in academic papers or practical tools used by developers. Further, it illustrates how grammars generate languages.
A language can be an infinitely large set of sentences, making it impossible to define languages by simply delineating the sentences. State machine diagrams are easy to understand due to their visual nature, but diagrams are tiresome to build for real languages and very hard for a computer to understand. Further, diagrams are almost free form in that two people could build two different diagrams for the same machine design. We need a better means of formally describing languages.
This lecture is not talking about parsing per se. We are talking about the specification of languages and the relationship between input sentences and grammars.
Introduction
From chapter 2 of the ANTLR book, we generated input
using the following syntax diagrams
statement: 
assignment: 
expr: 
We found that the tree structure of the statement mirrored the call graph

of the following code:
void statement() {
assignment();
System.out.println(";");
}
void assignment() {
System.out.println("x");
System.out.println("=");
expr();
}
void expr() {
System.out.println("0");
}
Another, textual and more formal, version of the syntax diagram is the following grammar notation:
Instead of using a tree structure to specify how a grammar generates an input sentence, we can use a derivation format:
Leftmost derivation because we replace rules from left to right.
What about ID and INT? It turns out that we can use grammar notation to specify what those look like also:
These rules are a special kind of rule called regular. The stuff on the right-hand side is called a regular expression.
Here is what the complete ANTLR grammar looks like in ANTLRWorks:

Formal grammars
A formal grammar, G, is a finite formal description that generates a language, L, over some vocabulary; i.e., G defines the set of valid sentences in L. L(G) is the language generated by G. A grammar is the 4-tuple:
G = (N,T,P,S)
where
N is the finite set of nonterminals (rule names for our purposes)
T is the finite set of terminals (token names)
P is the finite set of productions (rule right-hand-sides or alternatives)
S ∈ N is the start symbol (the goal; the highest level rule such as compilationUnit or dataFile ).
Typically you will see notation where
are strings of both terminals and nonterminals and w is a strings of terminals. That is,
.
Please note that there are an infinite number of grammars capable of generating any particular language. This becomes important later with restricted (i.e., weakened) grammars as you can sometimes rewrite a grammar to be satisfying stronger constraints while generating the same language.
CFG notation
The first meta-language used extensively and the meta-language preferred by computer language theorists is called CFG (context-free grammar) notation (there are a number of dialects). CFG specifications provide a list of rules with left- and right-hand sides separated by a right-arrow symbol. One of the rules is identified as the start rule or start symbol, implying that the overall structure of any sentence in the language is described by that rule. The left-hand side specifies the name of the substructure you are defining and the right hand side specifies the actual structure (sometimes called a production or alternative): a sequence of references to other rules and/or words in the vocabulary. The rules are of the form:
Rulename → finite sequence of rule and token references
For convenience, rule names (nonterminals) are always single uppercase letters and vocabulary words, terminals, are always single lowercase letters. CFG grammars are presumed to specify the syntactic structure not the lexical structure so terminals are token types really that represent a class of input symbols. For example, you could decree that i means "integer". Consequently, to specify that an expression can be an integer, you would write:
E → i
If a structure name can be one of several alternatives, repeat the structure name on the left-hand side. The follow grammar indicates that rule A can be either a or b.
A → a
A → b
The right-hand-side may be empty:
A →
Sometimes you will see empty productions written
A → 
Grammar classification
In the 1950's, Noam Chomsky
defined a classification or hierarchy of grammars
that neatly categorizes the difficulty of describing your language into four categories:
Type-3: Regular grammars; Generates the regular languages, which can be generated/recognized with simple finite state machines without stacks. Productions are of the form:
A → a
or
A → aB
where rule references such as B are only allowed at the extreme right edge of a production and a is a single character. You can have S → ε as long as S is not on the right side of a production.
We use regular expressions in practice because they are easier to specify but all regular expressions are convertible to this form.
All finite languages can be generated by a regular grammar.
Type-2: Context-free grammars. Generates the context-free languages. These languages can all be generated/recognized with pushdown automata. Productions are of the form:
A → α
where α ∈ (N ∪ T)* meaning it is a possibly-empty sequence of terminals and nonterminals.
Type-1: Context-sensitive grammars; grammars where you can restrict validity of applying a rule to a certain context. Generates the context-sensitive languages and has productions of the form:
αAβ → αγβ
where γ is non-empty, but you can have S → ε as long as S is not on the right side of a production.
Essentially α and β specify the context in which rule A is applicable.
In programming languages, you could define a grammar where variables could not be used except in a context where they had been previously defined. Rather than use context-sensitive grammars, we use symbol tables and semantic actions to check the validity of programs.
Type-0: unrestricted; all formal grammars. Generates all languages recognizable by a Turing machine. That is, you can write a program by hand to generate/recognize it even if it's hard to do. Grammars have no restrictions on their form.
Power: Type-3 ⊂ Type-2 ⊂ Type-1 ⊂ Type-0
Type 0 and type 1 languages are not used for computer languages because their power is not needed and no efficient means of generating parsers for them is known. Fortunately, for most things, a context-free grammar immediately presents itself or you can rephrase your problem such that a context-free grammar suffices. Easier for humans to understand type 2 and 3 grammars anyway. In the remaining discussion we will restrict ourselves to context-free grammars and sometimes regular expressions.
Grammars Generate Languages
Recall the maze metaphor. The maze represents the grammar for its language, with the maze entrance representing the start symbol, S. The language is the set of all passphrases collected from the floor along all paths leading to the maze exit. Taking a branch of the maze is like applying a rule in a grammar. Formally, we use notation L(G) to mean the language generated by grammar G.
Similarly, in language theory, L(G) is the set of all possible terminal strings w that you can derive by starting at S and repeatedly applying rules. Applying a rule means replacing a reference to the rule with one of its right-hand-sides. If you can convert α to β by applying rules, we say that α ⇒ β or,
α ⇒* β
where ⇒ means to derive in one step and ⇒* means to derive in zero-or-more steps. ⇒+ means to derive in one-or-more steps. Clearly α derives itself in 0 steps: α ⇒ α.
Consider an expression grammar:
E → E + E
E → E * E
E → i
Rule E can generate sentence i in one step and so we say
E ⇒ i
Rule E can also generate E + E:
E ⇒ E + E
and
E ⇒ E * E
in one step. In two steps, you can derive the following.
E ⇒ E + E ⇒ i + E
And then
E ⇒ E + E ⇒ i + E ⇒ i + E * E ⇒ i + i * i
A partially derived string is called a sentential form and contains both terminals and nonterminals. If
S ⇒* α
then α is a sentential form.
Finally, then, L(G) = {w ∈ T* | S ⇒* w}. That is, L(G) is the set of all sentences that can be reached by a derivation from the start symbol, S.
Is a sentence generated by a grammar?
Given a grammar, you can determine the set of all sentences. Can you go the other way? That is, given a sentence, can you determine if it is a member of L(G)? If you can find a derivation from S to your sentence, then that sentence is in L(G). For example, is abc generated by the following grammar?
A → Bc
B → ab
Yes, the derivation look like this
A ⇒ Bc ⇒ abc
then since
A ⇒* abc
abc ∈ L(G).
This can get very complicated for large grammars. We will address this shortly.
Leftmost and rightmost derivations
Sometimes you have a choice as to which nonterminal to replace in a sentential form during derivation. For example "jim jumps" is in the language generated by the following grammar, but there are multiple derivations because you can choose to replace N first or V first.
S → NV
N → jim
N → ashar
V → speaks
V → jumps
If you choose to replace the leftmost nonterminal, N, first you get:
S ⇒ NV ⇒ jim V ⇒ jim jumps
where sometimes you will see S ⇒lm NV, meaning "derive leftmost". In contrast, if you replace the rightmost nonterminal, V, first you get:
S ⇒ NV ⇒ N jumps ⇒ jim jumps
Important: Every sentence has both leftmost and rightmost derivations. But, while the order of replacement differs, the same rules are applied for each nonterminal! So in that sense the sentence structure is the same and, hence, has the same "meaning" (it's just examined in a different order). For example, N is not ever replaced with ashar instead of jim though N is replaced with jim first for leftmost and second for rightmost derivations.
Either leftmost or rightmost derivation is ok because you can test " jim jumps" for membership in L(G). As a Human this choice does not bother you, however, programs that test for membership (so-called parsers) tend not to be that smart and must choose one derivation or the other; e.g., LL parsers do leftmost derivations and LR parsers do rightmost derivations.
Derivation trees
A derivation tree is a two-dimensional tree that records the derivation from start symbol to sentence. If you are testing a sentence for membership in L(G) rather than generating a sentence, the record of the derivation is called a parse tree.
Interior nodes are nonterminals (rules) and leaves are terminals (tokens).
Derviation trees are insensitive to derivation order (leftmost or rightmost)--you get the same tree regardless. In fact you can think of them as expanding all nonterminals in parallel. The tree only changes if you change which rule you apply in a particular situation.
Begin construction by creating a root node labeled with start symbol, S. Then for each nonterminal leaf node, A, add a child to A for each symbol in α when applying rule A → α.
For example, consider the following simple grammar for assignments and an if-then statement where expressions are just integers.
S → if E then S
S → v = E ;
E → i
The leftmost derivation of "if 1 then x = 4;" is
S ⇒
if E then S ⇒
if 1 then S ⇒
if 1 then v = E ; ⇒
if 1 then v = 4 ;
and the rightmost derivation is:
S ⇒
if E then S ⇒
if E then v = E ; ⇒
if E then v *=4 *; ⇒
if 1 then v = 4 ;
The derivation tree looks like

Notice that the tree has the same form regardless of the order in which the branches are added.
Student question: what does the tree look like for "abbb" given the left-recursive grammar:
S → Sb
S → a
What would happen if you started with the sentence and worked backwards towards the S instead of starting with S?
Ambiguous grammars
As a Human performing a derivation to check w ∈ L(G), you are choosing which nonterminal in a sentential form to replace with which rule right hand side. For a valid sentence, there will normally be exactly one replacement choice that will lead to complete derivation (i.e., you reach a sentential form, w). What happens when you have a choice of rules to apply, both of which yield valid derivations? Your grammar is ambiguous.
Consider a simple expression grammar (the most commonly used example for demonstrating ambiguity):
E → E + E
E → E * E
E → i
Input "3+4*5" yields the following (leftmost) derivation:
E ⇒
E + E ⇒
3 + E ⇒
3 + E * E ⇒
3 + 4 * E ⇒
3 + 4 * 5
with accompanying derivation tree:

This is the usual interpretation of "3+4*5 ": do the "4*5" first.
Unfortunately, I could have chosen a different initial alternative production of E to start out the derivation (again with a leftmost derivation):
E ⇒
E * E ⇒
E + E * E ⇒
3 + E * E ⇒
3 + 4 * E ⇒
3 + 4 * 5

The first rule replacement in the derivation can go two ways legally and, hence, there are two valid derivations resulting in two derivation trees.
Any grammar that can generate the same sentence with more than a single derivation tree, is ambiguous.
There is no algorithm (it is undecidable) to check to see if a CFG is ambiguous. In contrast, for certain subsets of CFGs, you can show that they are unambiguous. Specifically, if you can generate a valid parser from the grammar (we'll see what valid means later).
Abstracting Structure From Exemplars
See if you can analyze a few English sentences to abstract their underlying structure. Proficiency at determining language structure from representative sentences must become second nature to you. Look at the following exemplar sentences that define a subset of English.
John gesticulates
John gesticulates vigorously
The dog ate steak
The dog ate ravenously
The sentences have meaning to you because your brain is subconsciously associating concepts with the words and structuring the words into phrases and groups of phrases, which conveys meaning. Try to do this by abstracting structure from these exemplars. There is always a person or thing (a subject) and a verb describing an action (a verb phrase) and sometimes an object that the subject acts upon. Replace the phrases in the sentences of similar structure with a symbol representing that structure. First, abstract all the subjects:
Subject gesticulates
Subject gesticulates vigorously
Subject ate steak
Subject ate ravenously
Next, abstract the verb phrases:
Subject VerbPhrase
Subject VerbPhrase
Subject VerbPhrase steak
Subject VerbPhrase
Finally, abstract the objects:
Subject VerbPhrase
Subject VerbPhrase
Subject VerbPhrase Object
Subject VerbPhrase
Removing duplicate abstract sentences leaves only two types of sentences and reveals the overall structure of a sentence:
Subject VerbPhrase
Subject VerbPhrase Object
Turning to the substructures, the subject phrases:
are further structured as:
The verb phrases:
gesticulates
gesticulates vigorously
ate
ate ravenously
are structured as:
You can summarize the finite language defined by the exemplars with the following set of syntax rules, or grammar.
- A sentence is a subject followed by a verb phrase optionally followed by an object.
- A subject is a noun optionally preceded by a determiner.
- A verb phrase is a verb optionally followed by an adverb.
- A noun is John or dog.
- A verb is gesticulates or ate.
- An adverb is vigorously or ravenously.
- An object is steak.
- A determiner is The.
The structure of these rules is right, given our intuitive understanding of English, but the rules are actually too loose because they generate more than the four sentences in our language. For example, the rules let you say, "dog gesticulates ravenously". You will encounter this situation frequently as you build grammars. There is usually a tradeoff between writing a natural (that is, readable) description and achieving strict conformance to the language.
In summary, producing a grammar means analyzing the structure of a language in a top down fashion. What name describes the overall structure of an input file? In Java you might call it compilationUnit . What are the next largest substructures and how are those structures broken down? In Java, the next substructure is probably classDefinition and interfaceDefinition , which are in turn broken down into field and method etc... Your goal is to break things down into finer and finer substructures until you arrive at a rule that has no substructures (it only contains tokens).
BNF Notation
Language theorists love CFG notation, but most language reference guides use BNF (Backus-Naur Form) notation, which is really just a more readable version of CFG notation. All rule names are surrounded by <...> and → is replaced with " ::= ". Also, alternative productions are separated by ' | ' rather than repeating the rule name on the left-hand side. BNF is more verbose, but has the advantage that you can write meaningful rule names and you are not constrained vis-a-vis capitalization. Rules are therefore of the form:
The eight rules for the subset of English described previously are written in full English, but you can say the same thing more tersely and precisely as follows using BNF notation:
YACC notation
YACC introduced a simpler derivative without <...> at the cost of using upper/lower case indicator for terminal/nonterminal. YACC notation is important as it was the de facto standard for around 20 years; YACC notation is the progenitor for ANTLR notation. It is a more terse version of BNF.
ANTLR Notation
ANTLR uses YACC notation with extended BNF (EBNF) constructs, borrowed from regular expression notation, for looping and expressing optionality. Further, ANTLR grammars may reference literals (such as keywords) directly as strings in the grammar. In ANTLR notation, you could specify the English subset as follows.
Like YACC, ANTLR also uses the convention that words starting with an uppercase letter are symbols in the vocabulary. For example, to make the English subset even more broad or loose than the original description written in English, you could substitute categories of words for the actual words. With the aid of the EBNF optional construct and given new vocabulary symbols (categories of words in this case) DETERMINER , NOUN , VERB , and ADVERB , the grammar becomes the following.
or
This has the effect of simplifying the grammar as well because you do not have to list all possible nouns and so on.
In a programming language context, you will see things like:
which means that an expression can be any identifier or integer; we lump all identifiers into a category of symbols called ID . Normally the term token type is used to refer to a category.
Let's do a derivation for sample input sentence:
It has the following derivation:
Demos
Start with Meet my little friends, ANTLR and StringTemplate