by Jimmy Zhang, founder, Ximpleware. Intel Corp.
Behind closed doors or in design meetings, how many times have you heard people complaining about the performance of XML? You are not alone. In fact, performance, along with verbosity of XML motivated the World Wide Web Consortium's (W3C) decision to charter its binary XML characterization working group last year.
XML is designed to be the data exchange format of the Internet. And verbosity is a cost that comes with XML's many core benefits. Benefits such as being semi-structured, openly interoperable, and human-readable. Fortunately, both bandwidth and storage are getting cheaper and more abundant by the day, so what was considered problematic this time last year may be a less significant issue today, and may not even be a problem next year.
Compared with verbosity, performance of XML is a very different issue. Consider this: XML doesn't run, since it doesn't have legs or wheels; XML doesn't really execute either because it is not really an ".exe" file. In practice, software developers always choose a processing model to read, write, or update XML data. So strictly speaking, performance is not a problem of XML per se, but instead a problem in XML processing models. Therefore, it is necessary to understand some of the technical issues with the current XML processing models, such as Document Object Model (DOM) and SAX.
Technical analysis of DOM and SAX
DOM is an in-memory, tree-based, XML processing API designed to be both platform and language neutral. Using DOM, a developer can create an XML document, navigate its structure, and add, modify, or delete its elements. A very important concept in DOM is its node interface model: Every data object in the DOM's hierarchical representation implements the Node interface. Because DOM loads everything in memory and provides a hierarchical view of XML data, developers often find it to be an easy and natural way to work with XML. Its weakness is that building a DOM tree is usually quite slow and consumes a lot of memory (typically 5x~10x the size of the XML document).
The root cause of DOM's performance issue turns out to be precisely the weak point of Object Oriented Programming (OOP). Creating a DOM tree involves dynamically allocating a large number of objects, which, in most modern OOP languages, is quite expensive. Even worse is the fact that these objects need to be collected as garbage when they go out of scope. Also, allocating a lot of objects leads to significant memory bloat when you add up the per-object memory overhead.
SAX, on the other hand, is invented to circumnavigate DOM's wasteful memory usage issues. To accomplish this, SAX exports a low-level tokenizer directly to the applications that are calling it. By itself, a SAX parser doesn't build any in-memory representations of XML documents. SAX's memory usage is low and doesn't grow with respect to the size of the document, but unfortunately, SAX is difficult to use. SAX doesn't build any trees in-memory, and developers using SAX often have to adapt their applications to an event-driven style to accommodate the parsing routine. This places a burden on developers by creating code that is verbose and hard to maintain.
For applications that require repetitive access to the data elements in XML, SAX also introduces the following dilemma of coding the application to scan the XML document multiple times or having to build in-memory structures. Although the raw performance of SAX is very good, scanning the document multiple times significantly reduces its performance edge over DOM. By taking the other option, you will find yourself building a DOM anyway. So, why not just start using a DOM in the first place?
As a quick summary, performance of XML processing models should not come at the expense of usability. Both the performance and memory overhead of a random-access capable API are results of allocating many objects.
Why is it important to understand the problems of DOM and SAX? Because problem solving is often about finding alternative ways that accomplish the same task. As you will see below, many steps and design decisions that constitute traditional XML processing techniques have simple, but sometimes unobvious, alternatives that can be put together to achieve the same goal and better results.
Understand VTD-XML
A quick introduction
VTD-XML is the latest, open source, Java-based, XML processing API that overcomes many problems of the current XML processing models. The project is currently hosted by Sourceforge, and can be found in here. This demo will acquaint you with the basic concept. It probably won't be a stretch to claim that VTD-XML is designed from ground up, as it introduces a number of optimization techniques starting from the first step—tokenization.
Offset and length-based "non-extractive" tokenization
Traditionally as the first step, the XML parser takes apart the input XML document into many tokens containing the relevant text data. Subsequently, the parser either passes those tokens to the user as SAX events, or, builds an in-memory object-based on the hierarchical data structure. After tokenization, the parser usually discards the input source document. These tokens, also known as strings, are typically a null-terminated array of characters allocated during parsing. This style of tokenization is called "extractive" tokenization, as the parser actually extracts the text content from the input document into dynamically created strings.
However, there exists a simple alternative to achieving tokenization, which is to only use starting offsets and lengths to describe tokens. This approach also requires that the original source document be kept intact in-memory and un-decoded. Essentially, the parser treats the XML document as a large token bucket, and creates a map detailing the locations of tokens in XML. This style of tokenization is "non-extractive", as the parser leaves the token content as-is in the source document.
The best way to illustrate how this "non-extractive" style of tokenization works is to compare it to traditional "extractive" tokens in some common usage scenarios:
- String Comparison: With extractive tokenization, developers use some flavors of C's "strcmp" function (in <string.h />) to compare an "extractive" token against a known string. With "non-extractive" tokenization, developers simply use C's "strncmp" function in <string.h>.
- String to numerical data conversion: Other frequently used macros, such as "atoi" and "atof" can be revised to work with non-extractive tokens. One of the possible changes is the signature of functions. For example, "atoi" takes a character string as the input. To make a non-extractive equivalent, simply create a "new-atoi" that accepts three variables: The source document (of the type char), offset (of the type int), and length (of the type int). The difference in implementation is mostly to deal with the new string/token representation (e.g. end of string is no longer marked by \0).
- Trim: Removing the leading and trailing white spaces of a "non-extractive" token only requires changes to the values of offset and length. This is usually simpler than extractive style of tokenization, which often involves the creation of new strings.
Use integers, not objects
Once it is clear that the offset/length based tokenization works, the next question becomes, �Do tokens have to be objects?� The most basic function of an object is to bind together multiple member variables/fields into a single entity. In object-oriented languages, the notion of objects also includes additional support of access control and member methods. However, alluding to our prior discussions on the issues of DOM, allocating a lot of objects is bad for both performance and memory usage overhead. So it would be nice to work around objects as well.
![]()
![]() | Web services extend high-performance computing grid capabilities by Matt Gillespie, technical author. Intel Corp. Grid computing bas... |
![]() | Multi-Core: Intel's new processor architecture explained by Andrew Binstock, principal analyst, Pacific Data Works LLC. ... |
If you're interested in this topic, these articles may be helpful:
![]() | The Ajax transport method: There's more to Ajax than XMLHttp Discover three Ajax data transport mechanisms (XMLHttp, script tags, a... |
![]() | XML serialization in C# by Andrew Ma Object serialization is an important topic which is... |
![]() | Determine the correct XML parser type for a Java application by Padma Apparao, senior performance architect, Software Solutions Gro... |
![]() | Sorting effectively in C# by Larry Mak This tutorial is generally about sorting in Java&mda... |
![]() | A framework-based approach to real-time development with UML by Ran Rinat, I-Logix Inc. The emergence of the UML as an industry ... |
![]()
Related Jobs:


