Improve XML processing with VTD-XML
-page-->

Luckily, there exists a simple alternative to objects that can bind together multiple variables into a single entity by using integers. In fact, the physical representation of objects and integers are quite similar: They both are small memory blocks filled with bits. As to the binding part, one can actually split all the bits into different groups to represent different variables. For example, a 32-bit register can bind together four 8-bit integers, two 16-bit integers, or even 32 single-bit Booleans. Notice that this particular use of integers is quite common in the X86 architecture. For example, various segment registers and processor state registers in the Intel® Pentium® processors are 64-bit in size and in which a number of sub-variables are defined.

VTD-XML internally stores and represents tokenized form of XML based on a "non-extractive" binary encoding specification called Virtual Token Descriptor (VTD). VTD records are 64-bit integers that encode starting offsets, lengths, types and nesting depths of tokens in XML. At the bit layout level, a VTD record is defined as follows:

  • Starting offset: 30-bits (b29 ~ b0)
  • Length: 20-bits (b51 ~ b32)
  • For some token types:
    • Prefix length: 9-bits (b51~ b43)
    • Qname length: 11-bits (b42 ~ b 32)
  • Nesting Depth: 8-bits (b59~b52). Maximum value is 2^8-2 = 254
  • Token type: 4-bits (b63~b60)
  • Reserved bit: 2-bits (b31: b30) are reserved for a tri-state variable marking namespaces.
  • Unit: Because the processing model doesn't decode XML, the unit for offset and length are in raw characters of the transformation format. For UTF-8 and ISO-8859, length and offset are in bytes. They are in 16-bit words for UTF-16. Figure 1 below depicts the bit-level layout of a VTD record.

element directory-based hierarchical representation
Figure 1. Element Directory-Based Hierarchical Representation

In DOM, everything (whether it is an element, text, CDATA, or attribute) is a node, and the in-memory representation consists of a large number of node objects stitched together using pointers. By treating everything as an instance of a node object incurs penalties on both the performance and memory usage. This is a big reason why DOM is so slow. Consider this: Neither a text node nor an attribute node has any children. So including those in the hierarchy is not only performance adverse, but also overdone.

To provide random access, VTD-XML adopts an element-based directorial approach by using location caches. The basic idea of location caches is similar to the index section of a book, except that location caches are nested directories. Location caches only treat elements (effectively the VTD index value for starting tokens) as the constituent of hierarchy. A location cache entry is a 64-bit integer containing two 32-bit fields: One for global VTD index; the other for relative index of the next level location caches. The project web site of VTD-XML provides an in-depth description of location caches along with the behavior description of VTD Navigator.

How VTD-XML solves XML's performance Issue
High performance in software

When implemented in Java, VTD-XML achieves the performance level of SAX with NULL content handler, while consuming memory typically between 1.3 and 1.5 times the size of an XML document, which is about 1/3 to 1/5 of a DOM tree.

Why can VTD-XML achieve both high performance and low memory usage? The key is reducing object creation. In many virtual machine (VM) based OOP languages, per-object allocation incurs a small amount of memory overhead. VTD records are immune to the overhead because they are integers, not objects. In addition, because VTD records are constant in length, they can be stored in large memory blocks, which are more efficient to allocate and GC. For example, by allocating a large array for 4096 VTD records, one incurs the per-array overhead (16 bytes in JDK 1.4) only once across 4096 records, thus reducing per-record overhead to be very little. As to the performance, using less memory means less cost in both object allocation and garbage collection. In fact, VTD-XML's high performance is a by-product of its efficient memory usage.

Custom hardware implementation
As more Web Service applications exchange XML messages over the network, it is increasingly important for network appliances to process, filter, and secure those messages at very high speed. A closely related topic is porting XML processing on a chip. It is worth mentioning that VTD-XML is designed as a processing model capable of custom hardware implementation that can achieve Gigabit performance levels. While an in-depth explanation is outside the scope of this article, the basic idea is this: VTD turns XML processing into a problem very close to DES encryption, which is well-known to be able to achieve high performance when implemented in custom hardware. The article, XML on a Chip, discusses this topic in detail.

Binary-enhanced XML
Because both VTD and location caches are based on 64-bit integers, VTD-XML's internal representation of parsed XML is inherently persistent, and can be saved on disk or sent across the network to eliminate the overhead of parsing completely. In other words, VTD, along with location caches and the XML document itself, provides a simple and purposeful way to index XML. More importantly, indexing XML this way incurs no loss of benefits inherent to XML, such as human readability, as VTD-XML maintains XML intact in memory.

Other benefits of VTD-XML
Incremental update

Consider modifying the text content of the following XML file.

<color> red </color>

Using DOM, it would require at least the following three steps: build the DOM tree, navigate to and then update the text node, write the updated structure back into XML. So no matter how trivial the modification is, there is a round trip penalty of parsing the document and writing it back out. What if this is only a snippet buried within a big document? Wouldn�t it be nice to be able to surgically remove then insert the update "in-place?"


Subscribers who liked this article also read:
Multi-Core: Intel's new processor architecture explained
by Andrew Binstock, principal analyst, Pacific Data Works LLC. L...
Web services extend high-performance computing grid capabilities
by Matt Gillespie, technical author. Intel Corp. Grid computing bas...

If you're interested in this topic, these articles may be helpful:

Sorting effectively in C#
by Larry Mak This tutorial is generally about sorting in Java&mda...
A framework-based approach to real-time development with UML
by Ran Rinat, I-Logix Inc. The emergence of the UML as an industry ...
XML serialization in C#
by Andrew Ma Object serialization is an important topic which is...
The Ajax transport method: There's more to Ajax than XMLHttp
Discover three Ajax data transport mechanisms (XMLHttp, script tags, a...
Determine the correct XML parser type for a Java application
by Padma Apparao, senior performance architect, Software Solutions Gro...

Related Jobs:

Software, Test and Hardware Design Engineer - WA - Seattle - PHYTEC America LLC
Software Design (40%): Software development for PHYTEC Single Board Co...
Lead Software Development Engineer #138713 - WA - Redmond - Microsoft Corporation
Webdata XML, the team that brought you the core XML components relied ...
Technical Editor #132558 - WA - Redmond - Microsoft Corporation
The Exchange User Experience team is looking for an experienced techni...
Web Application Developer #976 - CA - North Hollywood - RCG Information Technology
Description: All applicants must have a minimum of 3 years IT Industry...
Senior Software Engineer (Web or Windows) #3025086 - CA - Dublin - Ajilon
Description : MULTIPLE OPENINGS. SR. SOFTWARE ENGINEER - ON WEB OR WI...
Senior Software Engineer - Offline/Backend Data Applications #947F7B63F3D43C28 - CA - Oakland - Ask Jeeves,Inc.
Title Senior Software Engineer - Offline/Backend Data Applications C...
Technical Writer - CA - Brisbane - WILY Technology, Inc.
Technical Writer Wily Technology is seeking a motivated Technical Wri...
SAS Data Analyst/Progrmmer - Lead #613 - NJ - Princeton - RCG Information Technology
Location: Princeton,NJ Description: All applicants must have a mi...
Quality Engineer #05-515-000827 - GA - Atlanta - EarthLink, Inc.
Quality Engineer 05-515-000827 posted 12/22/05 Requirements...
Web Application Developer #953 - CA - Hollywood - RCG Information Technology
Description: All applicants must have a minimum of 3 years IT Indus...