In this lab, you will read in an XML file from standard input and print a list of the tags to standard output. You will be given XML that has no tag arguments so it is pretty easy to distinguish a tag from embedded text "arguments". Please do not return text tokens that are purely whitespace.
Requirements
You will read in XML representing the books in the catalog:
<catalog> <book> <id>101</id> <genre>Computer</genre> <author>Jim Cortez</author> <title>XML for dummies</title> <price>44.95</price> <description>An in-depth look at creating mashed potatoes with XML.</description> </book> <book> <id>102</id> <author>George Bush</author> <title>I'm the decider</title> <genre>Fantasy</genre> <price>0.95</price> <description>I like milk and cookies.</description> </book> </catalog>
Your output given the above XML, should be similar to:
[<catalog>:1]
[<book>:1]
[<id>:1]
[101:3]
[</id>:2]
[<genre>:1]
[Computer:3]
[</genre>:2]
[<author>:1]
[Jim Cortez:3]
[</author>:2]
[<title>:1]
[XML for dummies:3]
[</title>:2]
[<price>:1]
[44.95:3]
[</price>:2]
[<description>:1]
[An in-depth look at creating mashed potatoes
with XML.:3]
[</description>:2]
[</book>:2]
[<book>:1]
[<id>:1]
[102:3]
[</id>:2]
[<author>:1]
[George Bush:3]
[</author>:2]
[<title>:1]
[I'm the decider:3]
[</title>:2]
[<genre>:1]
[Fantasy:3]
[</genre>:2]
[<price>:1]
[0.95:3]
[</price>:2]
[<description>:1]
[I like milk and cookies.:3]
[</description>:2]
[</book>:2]
[</catalog>:2]
where Token.toString() prints tokens as [text:token-type].
You will build two classes: Tags that contains your main() and TagScanner that implements the TokenStream interface:
public interface TokenStream { public Token nextToken() throws IOException; }
Printing out the tags means making a loop in main() that repeatedly calls Tags.nextToken() and then asking the Token objects to print themselves out. Note that the text between tags should be collected into tokens and also printed out. Here is the test rig:
import java.io.*; class Tags { public static void main(String[] args) throws IOException { TagScanner scanner = new TagScanner(new InputStreamReader(System.in)); Token t = scanner.nextToken(); while ( t.getType()!=Token.EOF_TYPE ) { System.out.println(t); t = scanner.nextToken(); } } }
Here is a TagScanner and Token template:
import java.io.*; class TagScanner implements TokenStream { public static final int BEGIN_TAG_TYPE = 1; public static final int END_TAG_TYPE = 2; public static final int TEXT_TYPE = 3; protected Reader reader = null; /** Lookahead char */ protected char c; /** Text of currently matched token */ protected StringBuffer text = new StringBuffer(100); public TagScanner(Reader reader) throws IOException { this.reader = reader; nextChar(); } protected void nextChar() throws IOException { c = (char)reader.read(); } public Token nextToken() throws IOException { if ( start of a tag ) { // scarf until end of tag // type is either BEGIN_TAG_TYPE or END_TAG_TYPE } if ( end of file ) { type = Token.EOF_TYPE; text = "end-of-file"; } else { // scarf until start of a tag type = TEXT_TYPE; } if ( just whitespace ) { // ignore and get another token } return new Token(type, text.toString()); } }
public class Token { public static final int INVALID_TYPE = 0; public static final int EOF_TYPE = -1; protected String text; protected int type; public Token(int type, String text) { this.type = type; this.text = text; } public String getText() { return text; } public int getType() { return type; } public String toString() { return "["+text+":"+type+"]"; } }
Hint: your scanner, embodied by nextToken(), will look like an if that routes program flow either to something that matches a tag or matches text until the start of a tag.
This entire scanner can be built in less than 100 lines of Java.
Lessons learned
- you can break up text files into different types of chunks called tokens
- sometimes you want to ignore certain token types
- scanners should produce streams of token objects, isolating the implementation from the outside world
- the basic structure of a scanner
- that you need a character of lookahead to decide what token type your scanner will match on each nextToken() request
- how to deal with the end of file condition
- scanning a text file to pull out some information is surprisingly easy
Add Comment