Lexer for XML

Skip to end of metadata
Go to start of metadata

In this lab, you will read in an XML file from standard input and print a list of the tags to standard output. You will be given XML that has no tag arguments so it is pretty easy to distinguish a tag from embedded text "arguments". Please do not return text tokens that are purely whitespace.

Requirements

You will read in XML representing the books in the catalog:

Your output given the above XML, should be similar to:

[<catalog>:1]
[<book>:1]
[<id>:1]
[101:3]
[</id>:2]
[<genre>:1]
[Computer:3]
[</genre>:2]
[<author>:1]
[Jim Cortez:3]
[</author>:2]
[<title>:1]
[XML for dummies:3]
[</title>:2]
[<price>:1]
[44.95:3]
[</price>:2]
[<description>:1]
[An in-depth look at creating mashed potatoes
      with XML.:3]
[</description>:2]
[</book>:2]
[<book>:1]
[<id>:1]
[102:3]
[</id>:2]
[<author>:1]
[George Bush:3]
[</author>:2]
[<title>:1]
[I'm the decider:3]
[</title>:2]
[<genre>:1]
[Fantasy:3]
[</genre>:2]
[<price>:1]
[0.95:3]
[</price>:2]
[<description>:1]
[I like milk and cookies.:3]
[</description>:2]
[</book>:2]
[</catalog>:2]

where Token.toString() prints tokens as [text:token-type].

You will build two classes: Tags that contains your main() and TagScanner that implements the TokenStream interface:

Printing out the tags means making a loop in main() that repeatedly calls Tags.nextToken() and then asking the Token objects to print themselves out. Note that the text between tags should be collected into tokens and also printed out. Here is the test rig:

Note: The program should read from standard input not a file argument. CTRL-D sends the end of file signal to the standard input; on a PC this is CTRL-Z.

Here is a Token template:

Hint: your scanner, embodied by nextToken(), will look like an if that routes program flow either to something that matches a tag or matches text until the start of a tag.

This entire scanner can be built in less than 100 lines of Java.

Lessons learned

  • you can break up text files into different types of chunks called tokens
  • sometimes you want to ignore certain token types
  • scanners should produce streams of token objects, isolating the implementation from the outside world
  • the basic structure of a scanner
  • that you need a character of lookahead to decide what token type your scanner will match on each nextToken() request
  • how to deal with the end of file condition
  • scanning a text file to pull out some information is surprisingly easy

Submission

You will create a jar file called xml.jar containing source and *.class files and place in your build directory:

https://www/svn/userid/cs652/xml/build/xml.jar

I will run your code by executing the following:

$java -cp "xml.jar:$CLASSPATH" Tags < test.xml

You can use the svn account for development of the software too if you would like, but I will only be looking at your jar file in the build directory.

For more information, see svn in CS601. Naturally you will have to substitute cs652 for cs601.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.