Dashboard > USF Computer Science 652 - Programming Languages > ... > Projects > Parsing Graphviz DOT files the hard way
  USF Computer Science 652 - Programming Languages Log In | Sign Up   View a printable version of the current page.  
  Parsing Graphviz DOT files the hard way
Added by Terence Parr, last edited by Terence Parr on Jan 28, 2008  (view change)
Labels: 
(None)

Introduction

In this project, you will learn to parse DOT files and also to build trees representing that input. You will build everything by hand, without the use of language tools such as ANTLR, so that you can learn about the infrastructure required to build a language application. The project may seem large at first, but I have provided a framework for you to fill in; there is more thinking than coding. You have to see the pattern, but once you get it, you can solve the problem easily.

Let's begin with a series of small DOT files and the resulting visual representation as built by DOT/Graphviz.

digraph tree {
0 -> "1" [label = "A"];
0 -> "2" [label = "B"];
}
digraph tree {
rankdir=LR;
0 -> "1" [label = "A"];
0 -> "2" [label = "B"];
}
digraph trees {
  subgraph t {
    0 -> "1" [label = "A"];
    0 -> "2" [label = "B"];
  }
  subgraph u {
    Animal -> Cat [label = "feline"];
    Animal -> Dot [label = "canine"];
  }
}

Tasks

You have three subprojects you need to build:

  1. a scanner or lexical analyzer for DOT that breaks up the input stream of characters into vocabulary symbols called tokens: DOTScanner.java
  2. a DOT parser that recognizes the grammatical structure, producing errors upon syntactically invalid input: DOTParser.java
  3. an extension of the DOT parser that builds an abstract syntax tree (AST) and returns it to the calling program: DOTTree.java.

In the attachments to this page, you'll find AST.java, TokenStream.java, and Token.java support code as well as the framework for each of the above mentioned files.

To test each of the tasks, notice that there is a main() in each of them so you can simply run Java on that class. For example here is the meat from DOTScanner's main:

DOTScanner scanner = new DOTScanner(new InputStreamReader(System.in));
Token t = scanner.nextToken();
while ( t.type!=Token.EOF_TYPE ) {
	System.out.println(t);
	t = scanner.nextToken();
}

When run on a.dot above, you should get the following output:

$ java DOTScanner < a.dot
[digraph:11]
[NFA:1]
[{:4]
[0:2]
[->:9]
["1":3]
[[:7]
[label:1]
[=:10]
["A":3]
[]:8]
[;:6]
[0:2]
[->:9]
["2":3]
[[:7]
[label:1]
[=:10]
["B":3]
[]:8]
[;:6]
[}:5]
$

Running the DOTParser should simply print OK if there are were no syntax errors:

$ java DOTParser < a.dot
OK
$

On the other hand, if you type in some erroneous input, you should see an exception:

$ java DOTParser
I like peanut butter
<CTRL-D>
Exception in thread "main" java.lang.Exception: token mismatch: expecting token type 11; found I
        at DOTParser.match(DOTParser.java:40)
        at DOTParser.digraph(DOTParser.java:48)
        at DOTParser.main(DOTParser.java:128)
$

Note: CTRL-D sends the end of file signal to the standard input; on a PC this is CTRL-Z.

Your the final task involves building ASTs for the input files as shown in the section below "Building Abstract Syntax Trees":

java DOTTree < a.dot
(digraph NFA (CLUSTER (-> 0 "1" (PROPS (= label "A"))) (-> 0 "2" (PROPS (= label "B")))))
$

As you modify the framework I provide, you should spend time thinking about how it is constructed. The code represents a quintessential recursive descent parser and handbuilt lexer. The code is very clean and easy to understand. From this, you should try to identify the relationship between the grammar provided in this assignment and code the you build. For example, you will note that for every rule in the grammar there is a method in the parser.

Simplified DOT syntax

digraph
cluster
subgraph
property
edge
nodelabel
value
optionlist

Simplified DOT grammar

digraph
	:	'digraph' ID cluster 
	;
	
cluster
	:	'{' ( property ';' | edge ';' | subgraph )* '}'
	;

subgraph
	:	'subgraph' ID cluster
	;
	
property
	:	ID '=' value
	;
	
edge:	nodelabel '->' nodelabel optionlist?
	;
	
nodelabel
	:	value
	;

value
	:	ID
	|	STRING
	|	INT
	;

optionlist
	:	'[' property (',' property)* ']'
	;

ID	:	('a'..'z'|'A'..'Z')+
	;

INT	:	'0'..'9'+
	;
	
STRING
	:	'"' .* '"'
	;

WS	:	(' '|'\t'|'\r'|'\n')+ {skip();}
	;

Parse tree

This section shows the relationship between an input file and how the DOT grammar interprets it. The interior nodes of a parse tree are rule names and the leaves are input symbols.

digraph tree {
0 -> "1" [label = "A"];
0 -> "2" [label = "B"];
}

Building Abstract Syntax Trees

Building ASTs is a matter of having each rule return an appropriate structure for the tokens matched and the rules referenced within. The AST returned from the starting role is the completed history. (Ignore the nil nodes on the ASTs displayed below).

AST toStringTree()
thumbnail!
digraph t { }
(digraph NFA
  (CLUSTER
    (-> 0 "1" (PROPS (= label "A")))
    (-> 0 "2" (PROPS (= label "B")))
  )
)
(digraph NFA (CLUSTER
 (= rankdir LR)
 (-> 0 "1" (PROPS (= label "A")))
 (-> 0 "2" (PROPS (= label "B")))))
(digraph trees (CLUSTER
  (subgraph t (CLUSTER 
    (-> 0 "1" (PROPS (= label "A")))
    (-> 0 "2" (PROPS (= label "B")))))
  (subgraph u (CLUSTER
    (-> Animal Cat (PROPS (= label "feline")))
    (-> Animal Dot (PROPS (= label "canine")))))))

There are 3 basic operations you will need when building a tree:

  1. creating a root node from a token (the token is considered the payload)
    AST root = new AST(tokens[p]);
  2. adding a child created from the token
    root.addChild(tokens[p]); // add current token tokens[p] as child
    match(DOTScanner.ID);
  3. adding a subtree return from a rule reference
    AST t = value();
    root.addChild(t);

Then, each method in your parser returns root.

The only unusual thing is the use of two imaginary tokens. An imaginary token is one that does not correspond to any real input symbol. In this case I create two imaginary tokens, CLUSTER and PROPS, to act as roots for a collection of nodes. CLUSTER is a better node name than '{', so I use that. I use the PROPS imaginary token as a root to hold the list of properties matched as part of an option list. Imaginary nodes are created like this:

AST root = new AST(new Token(PROPS,"PROPS"));

There are samples in the framework I provide.

Submission

You will create a jar file called dot.jar containing source and *.class files and place in your lib directory:

https://www/svn/userid/cs652/proj1/trunk/lib

You can use the svn account for development of the software to if you would like, but I will only be looking at your dot.jar file in the lib directory.

For more information, see svn in CS601. Naturally you will have to substitute cs652 for cs601.

Grading

You will be graded quickly in class interactively. We will move the class to the CS labs before the end of class in order to do the grading. Generally the grading will occur on the day the project is due.

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.1 Build:#806 May 06, 2007) - Bug/feature request - Contact Administrators