Dashboard > ANTLR 3 > ... > Using ANTLR with XML > Parsing XML
  ANTLR 3 Log In | Sign Up   View a printable version of the current page.  
  Parsing XML
Added by Oliver Zeigermann, last edited by Terence Parr on Jan 24, 2008  (view change)
Labels: 
(None)

Parsing XML

This is a tutorial about parsing XML using ANTLR 3. If you do not know what ANTLR 3 is, have a look at this introduction. XML has been chosen as an example as it should be known to wide range of developers. Additionally, it isn't all that easy to parse, but then also not too difficult. 

Caution: This is an example given for educational reasons. In the magnitude of all cases, you'll be off far better by using an XML parser such as Xerces, expat etc. or an XML binding tool such as XML beans. If you are not sure whether you should write a parser of your own, have a look at section 5.

Parsing XML using traditional parsing techniques is tricky, most of all because XML is a sick language. Depending on the context different characters are either keywords or simple text. Given an XML fragment like:

<tag attribute="value">

The '=' has a special meaning as well as the '>' while this is of course not the case in plain text. Additionally, newlines and other white spaces inside tags have to be ignored, but are part of the text outside of them.

For further investigations we restrict ourselves to a simple subset of XML. It will have the same complexity as XML seen from the perspective of parsing, but will make the examples much simpler. It is restricted to tags and parsable text (PCDATA) - no XML declaration, not doctype, no DTD or XSD, no processing instructions, no CDATA sections, no namespaces. We require that the reader has at least basic knowledge of XML. For a short introduction, you should have a look at this XML in 10 points article. And here is something more complete: W3C XML Recommendation.

 Note: Full sources are available as an attachment to this page.

Overview 

This tutorial is devided into multiple parts. Each part explains about the tasks of one of ANTLR'3 parser types. There is an optional excursion that tries to shed some light onto how ANTLR 3's parsing strategy works. If you are not interested in theory you can safely omit it.

Parser Type Overview 

Type
Task / Function
Input Structure
Output Structure
Lexer
  • condense multiple characters into tokens
  • check input for lexcial correctness
  • strip off unwanted white space
Character Stream (one dimension)
Token Stream (one dimension)
Parser
  • mimic inherent token structure in rule structure
  • check input for syntactical (structural) correctness
  • optionally execute actions
  • optionally generate an AST (stripping off unwanted tokens / adding condensing tokens)
Token Stream (one dimension)
optional AST (two dimensions)
Tree Parser
  • traverse an AST
  • optionally execute actions
  • optionally generate an AST (stripping off unwanted tokens / adding condensing tokens)
AST (two dimensions)
optional AST (two dimensions)

Sections

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.5.1 Build:#806 May 06, 2007) - Bug/feature request - Contact Administrators