Blag

He's not dead, he's resting

Why must all XML APIs suck?

Given the following:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pkgmetadata SYSTEM "http://www.gentoo.org/dtd/metadata.dtd">
<pkgmetadata>
    <herd>blah</herd>
    <herd>foo</herd>
    <maintainer>
        <email>foo@bar</email>
        <name>Foo Bar</name>
    </maintainer>
    <maintainer>
        <email>bar@baz</email>
    </maintainer>
    <use>
        <flag name="foo">Adds support for foo. Needs <pkg>cat/fooplugin</pkg> to be useful.</flag>
        <flag name="bar">Adds support for bar.</flag>
    </use>
    <longdescription><![CDATA[
        A giant space monkey has eaten my shorts.
        ]]></longdescription>
    <longdescription lang="fr"><![CDATA[
        Un singe géant de l'espace a mangé mes shorts.
        ]]></longdescription>
</pkgmetadata>

I want the following:

  • A set of strings called herds, containing blah and foo.
  • A set of pair(string, string) called maintainers, containing ("foo@bar", "Foo Bar") and ("bar@baz", "").
  • A map from string to string called use, containing ("foo" => "Adds support for foo. Needs cat/fooplugin to be useful.") and ("bar" => "Adds support for bar").
  • A string called longdescription containing "A giant space monkey has eaten my shorts.".

What’s the least painful way of doing it? Why can’t there be a solution concise enough to fit into a comment? Why must XML blow so many goats?

Advertisements

14 responses to “Why must all XML APIs suck?

  1. zong_sharo November 4, 2008 at 11:38 pm

    Why must XML blow so many goats?
    dunno, xml euphoria – maybe.

    pypi, cpan, hackage – dosn’t need this crap, praise the dsl’s.

  2. Ciaran McCreesh November 4, 2008 at 11:45 pm

    Whilst DSLs make it easier to extract the result into the DSL, they make it harder to transfer the results from the DSL into the main program. I’ve yet to see a DSL where the benefit outweighs the cost.

  3. Hypnos November 5, 2008 at 2:16 pm

    (Ack, screwed up the markup — a little ironic. Take two …)

    Two possible strategies:

    * XPath: Run a query for nodes in your XML document, and get back data (usually from an iterator) that you package yourself. XQilla, an open source implementation, is in the Gentoo main repo.

    * XML data binding: Run a preprocessor on a schema/DTD to get a class with autogenerated parser, getter and setter methods with meaningful names. Then you can instantiate this class with every XML document you read.

    Good luck!

  4. Ciaran McCreesh November 5, 2008 at 2:28 pm

    XPath’s what I’m going to end up using. The problem is, libxml2 is a horrible API, and it takes way too much work to get things done. I’ve not played with XQilla, primarily because it’s not GPL2 compatible, but if it ends up looking better I might have to work out a way around that.

    And so far as I can see, XML data bindings tend to assume schema, but all I have is a DTD…

  5. thewtex November 5, 2008 at 2:38 pm

    here is a link to a dtd2xsd converter
    http://w3.org/2000/04/schema_hack/dtd2xsd.pl

    here is a link to an XML data binding tool. only played around with a little bit, but it is very cool.
    http://www.codesynthesis.com/products/xsd/

  6. John Snelson November 13, 2008 at 2:37 pm

    XQilla uses the Apache Licence v2, which is GPL compatible AFAIK. Let me know how you get on if you decide to use it.

  7. Ciaran McCreesh November 13, 2008 at 2:48 pm

    The Apache v2 licence isn’t GPL2 compatible. That isn’t really a problem though, since I highly doubt any of the Paludis copyright holders would mind adding an optional linking exception… The big problem with XQilla is that it doesn’t use namespaces. It dumps a load of commonly named functions (and silly things like X) into the global namespace, which is extremely bad form. The not so big problem with XQilla is that it tries very hard to fetch the DTD, which puts it outside acceptable performance boundaries for what we need.

    I’ve gone with libxml2 and XPath. Whilst it’s way more painful than it should be, it does the job.

  8. John Snelson November 14, 2008 at 12:09 pm

    It’s simple enough to implement an XMLEntityResolver to control how XQilla resolves it’s URIs.

    I’m not a fan of XQilla not being in a namespace – but that’s how I inherited the code, and it’s just waiting for someone willing to add the namespace declarations :-S. I’m not sure my perl munging skills are up to the task…

  9. Ciaran McCreesh November 15, 2008 at 6:16 pm

    The thing is… I don’t want the DTD resolved at all. It’s irrelevant, and resolving it would mean I’d have to do some messy on-disk cache for it.

    As for namespaces… You can’t add namespaces to a package without a massive ABI break, and any package that does massive ABI breaks without some kind of slotting is one we can’t use.

  10. John Snelson November 26, 2008 at 10:22 am

    You can resolve the DTD to an empty document, which is the same as not resolving it. Of course if you do that you risk getting incomplete or invalid documents, since the document could rely on the entity declarations and default attribute values in the DTD.

    XQilla maintains ABI across patch releases – the same policy that Xerces-C has. Maintaining ABI is very hard and often not worth the trouble – which is why a large number of libraries don’t do it.

  11. nunojob December 7, 2008 at 6:16 am

    Perl + regexpressions
    easy, fast :)

  12. El Guapo December 21, 2008 at 6:01 pm

    Of course regular expressions can parse XML. Well, not by themselves, but with say, Perl, you could do it. Not that you would want to, though!

    Anyway, I found your blog by typing “XML data binding sucks”. This tells you that I share your frustrations and as I need this for Java, I actually have a wealth of stuff to choose from. They all still have their drawbacks though. So much overhead to read a configuration file, albeit a rather complex one.

    All I need is an object to hold some Maps and Lists that relate certain items together. I’m considering switching the xml out for JSON and trying that. Available libraries appear much simpler.

    • Ciaran McCreesh December 21, 2008 at 6:50 pm

      Can’t! Regular expressions can’t parse a^n b^n. Even with the fancy Perlish extensions any kind of non-trivial nesting is too complicated, and that’s before you start considering messy things like cdata. The supporting code ends up being even larger than using one of those horrible libraries.

      And JSON isn’t a solution. JSON is just a different flavour of doing it wrong.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s