Introduction
Just as you must learn arithmetic before it is appropriate to use a calculator, you must learn to build language tools from scratch before using a tool such as ANTLR. Most importantly, before embarking on our intrepid journey through abstract language theory and language implementation problems, I want you to have concrete notions and practical experience to associate with the following terms and ideas:* xml* token* scanner* grammar* lookahead* parser* abstract syntax tree, internal representation* tree parser* sentences have nested structure* stacks required for sentence processing
Your goal is to build a series of programs in Java that read in an XML file doing progressively more complicated operations. The first program will read in XML and simply print out the list of tags you encountered. Next, you will ensure that the document is well-formed; that is, that the begin and end tags all match up properly. Then you will ensure that the XML follows a specific format and build a tree data structure. At that point, you will have an in-memory representation of the input file and will repeat the recognition tasks upon the tree--you will apply a grammatical structure to the tree rather than a stream of tokens. Overall the components of this project look like this data-flow diagram:
This project is designed to give you an idea of the mental gymnastics you will encounter in this class and gives you some practice writing code that properly manipulates and analyzes data. Once you recognize the various "software patterns", you will be able to tackle almost any recognition or translation task.
Your program will read in XML that looks like;
<program>
<assign><id>x</id><expr><int>6</int></expr></assign>
<for>
<id>i</id>
<low>1</low>
<hi>10</hi>
<slist>
<call>
<id>print</id>
<expr><add><id>x</id><id>i</id></add></expr>
</call>
</slist>
</for>
</program>
which might represent the following "program" in a fictitious language:
x = 6;
for i=1 to 10 {
print x+i;
}
This project may appear overwhelming at first, but the amount of code you have to write is minimal; I have provided code templates for you to fill in. Further, I have broken down the tasks into very small chunks demonstrating the concepts you must master.
Please begin this project by reading the first part of[Humans
should not have to grok XML
to learn about XML.
You will also find introduction in the ANTLR reference guide useful for this project and as a general aid to help you understand recognizer and translator construction.
Project details
Printing out XML tags
Read in an XML file from standard input and print a list of the tags to standard output. You will be given XML that has no tag arguments so it is pretty easy to distinguish a tag from embedded text "arguments". Please do not return text tokens that are purely whitespace. Your output given the above XML, should be similar to:
[<program>:1]
[<assign>:1]
[<id>:1]
[x:3]
[</id>:2]
[<expr>:1]
[<int>:1]
[6:3]
[</int>:2]
[</expr>:2]
[</assign>:2]
[<for>:1]
[<id>:1]
[i:3]
[</id>:2]
[<low>:1]
[1:3]
...
where Token.toString() prints tokens as [text:token-type] .
You will build two classes: Tags that contains your main() and TagScanner that implements the TokenStream interface:
public interface TokenStream {
public Token nextToken() throws IOException;
}
Printing out the tags means making a loop in main() that repeatedly calls Tags.nextToken() and then asking the Token objects to print themselves out. Note that the text between tags should be collected into tokens and also printed out.
Hint: your scanner, embodied by nextToken() , will look like an if that routes program flow either to something that matches a tag or matches text until the start of a tag:
Here is a TagScanner template:[TagScanner.java|code/templates/xml/TagScanner.java]and here is a suitable[Token.java|code/templates/xml/Token.java].
This entire scanner can be built in less than 100 lines of Java.
Lessons learned* you can break up text files into different types of chunks called tokens* sometimes you want to ignore certain token types* scanners should produce streams of token objects, isolating the implementation from the outside world* the basic structure of a scanner* that you need a character of lookahead to decide what token type your scanner will match on each nextToken() request* how to deal with the end of file condition* scanning a text file to pull out some information is surprisingly easy* a program that reads in another program is no big deal
Checking for well-formedness
Once you have a scanner that emits a stream of Token objects, you can check to ensure the XML input is well-formed. All you have to do is make sure that tags are properly nested such as:
<add><id>x</id><id>i</id></add>
rather than, for example:
<add><id>x</id><id>i</add></id>
Build a class with a main called WellFormed that prints "OK" if the document is well-formed else print an error describing what tag you were expecting and what tag you found such as:
WellFormed: expecting </id>, found </add>
WellFormed: expecting </add>, found </id>
You must also catch missing tags. For input:
you should generate (note the reverse order):
WellFormed: missing end tag for <id>
WellFormed: missing end tag for <add>
Hint: Your main will look very similar to the last one. Use a stack, pushing/popping tags; you can ignore the text tokens.
Lessons learned* divide and conquer strategy of a scanner and structure checker; i.e., don't mess with the scanner when checking structure; scanner only emits a stream of tokens* simple parsing and syntax-directed actions* manual error handling* you need a stack to enforce token dependencies (a certain end tag must follow a certain begin tag)
Applying grammatical structure
In the first part, you broke up the input into chunks called tokens then, in the second part, you checked to see that certain tokens match up (begin with end tags) like curlies in a programing language. While the stack does indeed also enforce a token tag ordering, it does not enforce any *particular*ordering such as that <program> tag must be first. In fact, you could simply count how many of each begin/end tags you encounter to handle token dependencies. Token order dependencies, on the other hand, definitely require a stack to solve; order dependencies typically arise from nested structures. This is pretty easy to see from the XML input:
<program>
<assign><id>x</id><expr><int>6</int></expr></assign>
...
Your brain automatically notices (aided by my use of whitespace) that assignment statements belong inside programs and that assignments consist of an identifier and an expression. The expression consists of a simple integer.
Functionality
In this third part, you will again use a stack, but instead of software stack, you will use the method call hardware stack to check for a particular structure. To make it easier, let's only check a subset of the XML structure. Here is the structure you must enforce:# a *program*is a list of expressions; this is the starting rule# an *expression*can be an expression atom or the addition to two expression atoms# an *expression atom*is either an integer or an identifier
Here is some sample input:
<program>
<expr><int>1</int></expr>
<expr><add><id>x</id><int>6</int></add></expr>
</program>
More formally, in ANTLR EBNF notation:
program : BEGIN_PROGRAM (expression)* END_PROGRAM
;
expression
: BEGIN_EXPR
( BEGIN_ADD expressionAtom expressionAtom END_ADD
| expressionAtom
)
END_EXPR
;
expressionAtom
: BEGIN_INT TEXT_TAG END_INT
| BEGIN_ID TEXT_TAG END_ID
;
You must build a program that verifies the input conforms to the grammar. Print "OK" if it conforms else print an error describing what tag you were expecting and what tag you found. For example, input
<program>
<expr><add><id>x</id><int>6</int></expr>
</program>
might result in:
Exception in thread "main" java.lang.Exception: token mismatch:
expecting </add>; found </expr>
at xml.CodeParser.match(CodeParser.java:29)
at xml.CodeParser.expression(CodeParser.java:50)
at xml.CodeParser.program(CodeParser.java:38)
at xml.CodeParser.main(CodeParser.java:14)
Throw Exception objects to indicate parsing errors.
Implementation
I am providing you with the infrastructure to implement this part because I want you to focus on translating a grammar rule into a parser method.
Scanner
First, you need to distinguish individual tags (such as <program> from <id> ). I have subclassed TagScanner to make a class called CodeScanner . You must do the following to make it work:# add a few extra token definitions to CodeScanner .# change your TagScanner.nextToken() from the previous task to make a call to method getTagType(...) that I have overridden in CodeScanner --it converts token "<id>" to CodeScanner.BEGIN_ID etc...
Here is the CodeScanner template:[CodeScanner.java|code/templates/xml/CodeScanner.java]
Parser
Your CodeScanner will produce a series of specific tokens with different token types for different XML tags. You must now create a parser that verifies the grammatical structure of the token stream. The basic "software patterns" are as follows:* there will be one method in the parser per rule in the grammar* you should always have a lookahead buffer (a single Token object) that knows what you are about to match. This is the only way to make a decision about what is coming down the stream* to "match" a token means that you are expecting _T_and you have found _T_in the lookahead buffer. Once you have matched a token, you must "move forward" by fetching the next lookahead token. See method CodeParser.match() * to match a substructure like expression*within *grammar*just have method grammar() call method expression() * to distinguish between a list of alternative structures within a rule (like *expression atom*from the addition of two *expression atom*s), test the lookahead buffer. If the lookahead is <add> you know you have detected the start of an addition substructure. If the lookahead is <int> or <id> you know you have found an *expression atom
Here is the CodeParser template:[CodeParser.java|code/templates/xml/CodeParser.java]
Lessons learned* you need a stack to solve token order dependencies* sentences are typically built up from nested substructures* methods are a great way to match a substructure in your input; the method call stack conveniently "rememembers" what the necessary token order is* how to build a recognizer for a simple grammar using a divide-and-conquer approach: a set of mutually-recursive methods embodying rules in your grammar
Building trees
In this part, you will build some abstract syntax trees (ASTs) by adding some actions to your parser. But first, you must learn a few things about trees in the parsing world.
ASTs
Sentences are by their very nature nested tree-like structures not clever linear sequences of tokens. Consequently, most often, translators internally represent data in the form of a tree. Trees not only record the input tokens, but also the structure applied to those tokens. For example, as Humans, we know that the flat input string "3+4" is a valid sentence, but a computer only sees three tokens: "3", "+", and "4". The string does does not specify what structure it satisfies. On the other hand, the tree structure:
encodes the fact that the "+" is an operator and the integers are operands. There is no point in parsing if your internal representation does not encode something about the sentence structure (structure implies meaning).
If all tree structures were binary (each node had only two children), then you could use a simple tree node that held token information and two child pointers. In reality, nodes may have an arbitrary list of children so the LISP child-sibling data structure is useful. It has "first child" and "next sibling" pointers--the above tree would look like this:
To retain the structure information but describe the tree linearly in text form, use
ASTs for this project
Your program must accept input of the form:
<program>
<expr><int>1</int></expr>
<expr><add><id>x</id><int>6</int></add></expr>
</program>
and build a tree in memory that looks like:
program
|
expr -- expr
| |
int add
| |
1 id -- int
| |
x 6
Copy your CodeParser to make CodeTree and then modify your parser methods to return an AST :
protected AST program() throws Exception {
AST root = null;
Token program = lookahead;
root = new AST(program);
match(CodeScanner.BEGIN_PROGRAM);
...
return root;
}
Here is the[AST.java|code/templates/xml/AST.java]support code you will need. You may find[What's
the difference between a parse tree and an abstract syntax tree?|http://www.jguru.com/faq/view.jsp?EID=814505]useful.
Print out the tree you get from CodeTree.program() using AST.toStringTree() . For example, the above input will yield:
( [<program>:10]
( [<expr>:16] ( [<int>:14] [1:3] ) )
( [<expr>:16]
( [<add>:18]
( [<id>:12] [x:3] )
( [<int>:14] [6:3] ) )
)
)
which is the LISP first-child-sibling notation for your tree.
The modifications to your parser to generate trees are tricky but only 20 or so lines of code.
Applying grammatical structure to trees
You can consider our original text-based XML input to be a "freeze-dried" or serialized version of an XML tree in memory. You have so far converted it to a token stream, checked that the sequence conforms to a few grammatical rules, and constructed an in-memory tree representation of the input. The point of this next part is that there are many ways to represent the same data, but the strategies and principles of structure still apply (otherwise your data would be just a bunch of random values). To demonstrate this, you will apply the same grammatical rules to the tree rather than the token stream. Your main program will now go through the entire prior AST construction sequence and then hand the resulting tree to the tree parser, effectively walking the input in two passes:
public static void main(String[] args) throws Exception {
CodeScanner scanner =
new CodeScanner(new InputStreamReader(System.in));
CodeTree parser = new CodeTree(scanner);
AST result = parser.program();
System.out.println("parsing: "+result.toStringTree());
CodeTreeParser treeParser = new CodeTreeParser();
treeParser.program(result);
}
To fill out the CodeTreeParser template I provide, you'll again just have to fill in the parsing methods. The differences from CodeParser are that each rule accepts a tree argument that is like an AST "cursor" or lookahead pointer representing a particular substructure to parse. There is no more global lookahead pointer. Also, because we have a two-dimensional structure to chew on rather than a one-dimensional token stream, method match() no longer knows how to consume a token/node. The parsing rules must explicitly indicate whether the cursor should move down or to the right. For example, we know that the BEGIN_PROGRAM node is a root node and everything is a child underneath so after matching it, you move the cursor down via getFirstChild() .
public void program(AST t) throws Exception {
match(t, CodeScanner.BEGIN_PROGRAM);
t = t.getFirstChild();
...
}
Another difference from CodeParser is that you don't have to look for any "end" tags because the tree structure clearly delineates the end of a structure by having no more children or siblings. Only a flat list needs to know when a nested structure is complete.
Here is the CodeParser template:[CodeTreeParser.java|code/templates/xml/CodeTreeParser.java]
To ensure your tree parser catches invalid trees, have your tree parser attempt to parse the tree constructed via:
AST result = new AST(new Token(CodeScanner.BEGIN_PROGRAM,"<program>"));
AST exprRoot = new AST(new Token(CodeScanner.BEGIN_EXPR,"<expr>"));
result.addChild(exprRoot);
exprRoot.addChild(new AST(new Token(CodeScanner.END_EXPR,"</expr>")));
which looks like:
Your tree parser should generate an error message essentially identical to the previous token-based parser:
Exception in thread "main" java.lang.Exception: Expecting expression;
found </expr>
at xml.CodeTreeParser.expression(CodeTreeParser.java:56)
at xml.CodeTreeParser.program(CodeTreeParser.java:34)
at xml.CodeTreeParser.main(CodeTreeParser.java:19)
The reason you cannot pass in the XML text as before is that your token stream based parser would balk while creating this invalid tree--the resulting exception would make your parser bail out before getting to the tree parser.
Lessons learned* you can apply grammatical structure to a tree just as easily as a token stream* tree parsers and token stream parsers look remarkably alike
Submission
You will submit a jar file called xml.jar containing source and *.class files into the submit directory:
/home/submit/cs652/userid
For example, I would submit my project as file:
/home/submit/cs652/parrt/xml.jar
It must include all classes needed to run your various project parts including files:* xml/AST(.java,.class)* xml/CodeParser(.java,.class)* xml/CodeScanner(.java,.class)* xml/CodeTree(.java,.class)* xml/CodeTreeParser(.java,.class)* xml/TagScanner(.java,.class)* xml/Tags(.java,.class)* xml/Token(.java,.class)* xml/TokenStream(.java,.class)* xml/WellFormed(.java,.class)
where the following classes must have main() methods:* xml.CodeParser * xml.CodeTree * xml.CodeTreeParser * xml.Tags * xml.WellFormed
Grading
I will run your project parts via
java -cp /home/submit/cs652/userid/xml.jar xml.class-name < an-input-xml-file
for each class that must have a main() method.
Your grade is from 0..50 points.