Tag Archives: xml

Data representation

Data can be represented in text format for human and binary format for computer. Here my focus will be on text representation.

For application, we commonly use XML because:

  1. Its self-documenting format describes structure and field names as well as specific values. And it is easily digested by both human and machine.
  2. It is platform-independent, thus relatively immune to changes in technology and facilitate in data exchange across heterogeneous systems.
  3. It supports Unicode.
  4. It can represent common computer science data structures: records, lists and trees.
  5. It allows validation using schema languages such as DTD and XSD. XSDs are far more powerful than DTDs in describing XML languages. They use a rich datatyping system, allow for more detailed constraints on an XML document’s logical structure, and must be processed in a more robust validation framework.

With all the above advantages, it quickly becomes the standard of data exchange especially in web service world. However, XML also carries its disadvantages like it is verbose and the hierarchical model for representation is limited in comparison to an object oriented graph.

Other options:

  1. XML vs JSON - JSON is now more attractive than XML for kinds of data interchange that powers Web-based mashups and Web gadgets widgets. Why? Look at the articles below:
    • Fixing AJAX: XMLHttpRequest considered HarmfulYou don’t see much AJAX examples that access third party web services like Amazon, Yahoo and Google. That is because all the newest web browsers impose a significant security restriction on the use of XMLHttpRequest. That restriction is that you aren’t allowed to make XMLHttpRequest to any server except the server where your web page came from. If you attempt to do so, XMLHttpRequest will either fail or pop up warnings, depending on the browser you are using… Solution: Application Proxy, Apache Proxy or Use Script Tag Hack (On-demand Javascript).
    • JSON vs XML: Browser Security ModelThis article comments the solutions proposed above. It indicated that Script Tag approach is better than proxy.
    • JSON and Yahoo!’s Javascript API – This article will give you example of how to use Script Tag to communicate with Yahoo Web service API and bypass the restriction of XMLHttpRequest. The way to bypass XHR restriction is not using XHR at all. The cross-site requests are made by adding script tags to a document’s HEAD with DOM methods (i.e. [code]]czoxNDpcIi5hcHBlbmRDaGlsZCgpXCI7e1smKiZdfQ==[[/code])
    • Is JSON better than XML (a good objective review)
    • In conclusion, JSON enables you to use Script Tag approach to bypass XHR security restriction b/c JSON itself is part of Javasript. That makes JSON popular.
  2. YAML as an alternative of data serialization
  3. Java Serialization will take object to binary representation (versioning headache). XStream is a simple library to serialize objects to XML and back again.

For machine, data is represented in binary format:

  1. The art of assembly language (a free book that you can read online)
Leave a comment Continue Reading →

Evolution of XML parsing technologies

Introduction

There were 2 main XML parsing technologies few years ago. They were SAX and DOM.

  1. SAX is event-driven and the events are fired and forget along the xml parsing. Advantages: It doesn’t need to cache the whole xml document in memory and you don’t need to wait til the whole xml been parsed before the first event emitted. Disadvantages: It uses Push API that holds the control during parsing. So clients cannot control the parsing and it doesn’t fit for xml manipulation.
  2. DOM is used to convert the xml into object tree in memory before manipulation. Advantages: Easier to manipulate the xml. Disadvantages: Eat up a lot of memory that is not good for documents larger than few MBs in size or in memory constrained environment such as J2ME.

Pull API is a more comfortable alternative for streaming processing of XML. A pull API is based around the more familiar iterator design pattern rather than observer design pattern. In a pull API, the client program asks the parser for the next piece of information rather than the parser telling the client program when the next datum is available. In a pull API the client program drives the parser. In a push API the parser drives the client. That leads to the invention of StAX.

In this article, I will introduce an new object model from Axis2 named AXIOM that uses StAX underneath for xml parsing. With this, xml parsing will cost less memory with better control.

Evolution of Axis

One of the first generation SOAP engines, Apache SOAP, uses a DOM-based object model internally to represent the XML document, where the XML handling techniques force the entire XML object model to be built at once. The second generation Apache Axis shifted to SAX to avoid keeping the complete information in the memory. SAX, however, has a major constraint – it is built around a "push" technique, and once the parsing of the XML document starts it cannot be stopped. To jump over this hurdle, Apache Axis has to record SAX events. So, effectively, the XML message has to be kept in the memory in the form of SAX events, thus making Apache Axis yet another memory intensive programming model.

Axis2 avoids keeping the complete SOAP message in the memory by introducing a new Object Model for representing the SOAP message AXIOM. AXIOM takes a dramatically new approach. Although AXIOM has an "external" resemblance to DOM, the difference lies in that it generates objects only when required. This "on-demand building" feature gives AXIOM the edge needed to overcome the memory barrier that early SOAP engines failed to pass.

An interesting feature of AXIOM is that it is based on Pull parsing. It is capable of generating pull events from the Object Model that is built. Further, if the Object Model happens to be half built, AXIOM is capable of shifting to the underlying pull parser to generate pull events directly from the stream. The heart of AXIOM is the XML Pull parser since it is the only parsing model that supports the pausing of the parsing process. AXIOM uses the Streaming API for XML (StAX), making it easy to manipulate and utilizing only a fraction of the memory used by a conventional object model. Combined with the speed of the streaming pull parser, AXIOM pushes Axis2 leaps ahead of its predecessors in terms of efficiency and speed.

Apart from new parser, Axis2 also has other new add-ons. They are:

  1. Pluggable Data Binding – you can pick and choose JAXB, Castor and XMLBean for xml – java conversion.
  2. Improved Support for Message-style interaction (RPC vs Message-based)
  3. Improved handlers

The goal of this article is to focus on parsing technology, so I will not discuss in detail the new features on Axis2. If you want to find out more, read this.

 

Reference

An Introduction to StAX

Fast and lightweight object model for XML

 

 

Leave a comment Continue Reading →