2009-09-29

XML processing in Java

One of the things that most Java developers tackle on daily basis is dealing with XML. Despite the fact that XML is taking lots of criticism and new formats like YAML are emerging and becoming more popular, you cannot avoid XML it's too widespread and used everywhere. It's the main format for interchanging data across systems and even people. There is a great deal of fat books that show how to use various XML APIs and libraries to handle the beast with all it's standards and extensions. There are many solid tools that had been continuously developed for years by large communities (Xalan, Xerces, JDOM, DOM4J).

And still XML processing in Java is still a major pain in the ass.
I see two reasons for that: 
  1. XML is too bloated as a format. See the picture below (click to enlarge):
  2. Java libraries that deal with XML are bloated. It's natural because they simply try to implement the specifications

Let's say you have a Java application which receives some data in form of simple XML:

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <entry id="1">entry number one</entry>
<entry id="2">entry number two</entry>
</data>
Your application has this class:

public class Entry {
private int id;
private String content;
//the usual setters and getters here
}
If you would want to parse this XML with Java, into Entry objects you would usually do something like this:

try {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File("data.xml"));
NodeList nl = doc.getElementsByTagName("entry");
for (int i = 0; i < nl.getLength(); i++) {
Entry entry = new Entry();
entry.setId(Integer.parseInt(nl.item(i).getAttributes()
.getNamedItem("id").getNodeValue()));
entry.setContent(nl.item(i).getTextContent());
System.out.println(entry);
//do real stuff
}
}
catch (final Exception e) {
System.out.println("Failed parsing: " + e);
//do real handling
}
Expected output:

Entry:{id: 1; content: entry number one}
Entry:{id: 2; content: entry number two}
In Java 6 DocumentBuilderFactory.newInstance() will usually return an instance of this implementation: com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl 
This is Xerces embedded into the JRE. What's wrong with that? First, it's a huge library with big memory footprint. It will be outdated in comparison with what you can get at the official homepage, so if you want to go for the latest version with all the bug fixes, you will have to add another megabyte of jars to your project, set a system property (javax.xml.parsers.DocumentBuilderFactory) to change the default implementation and hope your code works. Then you have to know DOM. You have to use an ugly for loop to iterate the results instead of doing it right (for (Node n : doc.getElementsByTagName("entry") { ... }).
Even though Java aims to be loosely coupled, you can use the API and switch implementations, you should keep in mind that API changes over time, and implementations work differently. I have seen legacy code where you can find sick things like DocumentBuilderFactoryImpl = (DocumentBuilderFactoryImpl) DocumentBuilderFactory.newInstance();, I have seen Axis failing to parse complex SOAP messages after switching to different, newer JDK, I have seen third party software vendors who start cursing when you change your web service implementation and your WSDL is generated with minor cosmetic differences (i.e.: xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" instead of previous version: xmlns:s="http://schemas.xmlsoap.org/wsdl/soap/"). In all these cases APIs and implementations failed to do what they were meant for. Of course, everything can be fixed, but it takes time and nerves, and these things are precious.
XML processing in Java is terrible, and the worst part is when you have to go through all this just to parse a simple piece of data. Why couldn't it be as simple as that:

for (XmlSlicer piece : XmlSlicer.cut(data).getTags("entry")) {
//each piece is: <entry id="...">...</entry>
Entry entry = new Entry();
entry.setId(Integer.parseInt(piece.getTagAttribute("entry", "id")));
entry.setContent(piece.get("entry").toString());
System.out.println(entry);   
//do real stuff
}
After being fed up with Java's great XML APIs and libraries I made a small tool for simple daily work with XML files.
The code above would work in XML Zen - a small and lightweight XML processing library that supports ~1% of what other XML processing libraries can do, however this 1% of functionality is what you use 90% of the time. There are no big APIs, just simple logic driven object oriented processing of XML strings. And it's just a little over 10Kb.
You can add XML Zen dependency with Maven, just set the dev.java.net repo first:

<repositories>
<repository>
  <id>maven2-repository.dev.java.net</id>
  <name>Java.net Repository for Maven</name>
  <url>http://download.java.net/maven/2</url>
  </repository>
  <!-- other repositories -->
</repositories>
Then the dependency:

<dependency>
    <groupId>com.googlecode.xmlzen<groupId>
    <artifactId>xmlzen</artifactId>
    <version>0.1.1</version>
</dependency>
That's it, you are ready to go. And when it comes to building XML and XML Zen is not enough for your needs, check out this great project: http://java.ociweb.com/mark/programming/WAX.html.

4 comments:

  1. o tu nebandei http://www.xom.nu/ ?
    dar, man rodos, tau StAX standarto parseriai patiktų.

    rašyti savo xml parserį yra siaubingas NIH apsireiškimas.

    ReplyDelete
  2. @pukomuko
    xom.nu nebandžiau, pasižiūrėsiu. StAX yra neblogai, bet jis veikia iteratoriaus principu. Kam iteruoti per viską, kai gali tiesiog pasiimti tai ko tau reikia. Nekonkuruoju su Xerces ir panašiais dalykais, tiesiog savo malonumui pasidariau kažką, ką galiu naudoti neenterpraisiniams namudiniams projektams, ale žaidimo konfigūracijai parsinti... :)

    ReplyDelete
  3. XML processing doesn't have to be painful, have you heard of VTD-XML?

    ReplyDelete
  4. YAML? Can you validate YAML against anything?

    ReplyDelete

Spam comments (i.e. ones that contain links to web development services) will be reported along with user profiles!

Note: only a member of this blog may post a comment.