WordPress Importers: Getting Our House in Order

The previous post talked about the broad problems we need to tackle to bring our importers up to speed, making them available for everyone to use.

In this post, I’m going to focus on what we could do with the existing technology, in order to give us the best possible framework going forward.

A Reliable Base

Importers are an interesting technical problem. Much like you’d expect from any backup/restore code, importers need to be extremely reliable. They need to comfortable handle all sorts of unusual data, and they need to keep it all safe. Particularly considering their age, the WordPress Importers do a remarkably good job of handling most content you can throw at it.

However, modern development practices have evolved and improved since the importers were first written, and we should certainly be making use of such practices, when they fit with our requirements.

For building reliable software that we expect to largely run by itself, a variety of comprehensive automated testing is critical. This ensures we can confidently take on the broader issues, safe in the knowledge that we have a reliable base to work from.

Testing must be the first item on this list. A variety of automated testing gives us confidence that changes are safe, and that the code can continue to be maintained in the future.

Data formats must be well defined. While this is useful for ensuring data can be handled in a predictable fashion, it’s also a very clear demonstration of our commitment to data freedom.

APIs for creating or extending importers should be straightforward for hooking into.

Performance Isn’t an Optional Extra

With sites constantly growing in size (and with the export files potentially gaining a heap of extra data), we need to care about the performance of the importers.

Luckily, there’s already been some substantial work done on this front:

There are other groups in the WordPress world who’ve made performance improvements in their own tools: gathering all of that experience is a relatively quick way to bring in production-tested improvements.

The WXR Format

It’s worth talking about the WXR format itself, and determining whether it’s the best option for handling exports into the future. XML-based formats are largely viewed as a relic of days gone past, so (if we were to completely ignore backwards compatibility for a moment) is there a modern data format that would work better?

The short answer… kind of. 🙂

XML is actually well suited to this use case, and (particularly when looking at performance improvements) is the only data format for which PHP comes with a built-in streaming parser.

That said, WXR is basically an extension of the RSS format: as we add more data to the file that clearly doesn’t belong in RSS, there is likely an argument for defining an entirely WordPress-focused schema.

Alternative Formats

It’s important to consider what the priorities are for our export format, which will help guide any decision we make. So, I’d like to suggest the following priorities (in approximate priority order):

  • PHP Support: The format should be natively supported in PHP, thought it is still workable if we need to ship an additional library.
  • Performant: Particularly when looking at very large exports, it should be processed as quickly as possible, using minimal RAM.
  • Supports Binary Files: The first comments on my previous post asked about media support, we clearly should be treating it as a first-class citizen.
  • Standards Based: Is the format based on a documented standard? (Another way to ask this: are there multiple different implementations of the format? Do those implementations all function the same?
  • Backward Compatible: Can the format be used by existing tools with no changes, or minimal changes?
  • Self Descriptive: Does the format include information about what data you’re currently looking at, or do you need to refer to a schema?
  • Human Readable: Can the file be opened and read in a text editor?

Given these priorities, what are some options?

WXR (XML-based)

Either the RSS-based schema that we already use, or a custom-defined XML schema, the arguments for this format are pretty well known.

One argument that hasn’t been well covered is how there’s a definite trade-off when it comes to supporting binary files. Currently, the importer tries to scrape the media file from the original source, which is not particularly reliable. So, if we were to look at including media files in the WXR file, the best option for storing them is to base64 encode them. Unfortunately, that would have a serious effect on performance, as well as readability: adding huge base64 strings would make even the smallest exports impossible to read.

Either way, this option would be mostly backwards compatible, though some tools may require a bit of reworking if we were to substantial change the schema.

WXR (ZIP-based)

To address the issues with media files, an alternative option might be to follow the path that Microsoft Word and OpenOffice use: put the text content in an XML file, put the binary content into folders, and compress the whole thing.

This addresses the performance and binary support problems, but is initially worse for readability: if you don’t know that it’s a ZIP file, you can’t read it in a text editor. Once you unzip it, however, it does become quite readable, and has the same level of backwards compatibility as the XML-based format.

JSON

JSON could work as a replacement for XML in both of the above formats, with one additional caveat: there is no streaming JSON parser built in to PHP. There are 3rd party libraries available, but given the documented differences between JSON parsers, I would be wary about using one library to produce the JSON, and another to parse it.

This format largely wouldn’t be backwards compatible, though tools which rely on the export file being plain text (eg, command line tools to do broad search-and-replaces on the file) can be modified relatively easily.

There are additional subjective arguments (both for and against) the readability of JSON vs XML, but I’m not sure there’s anything to them beyond personal preference.

SQLite

The SQLite team wrote an interesting (indirect) argument on this topic: OpenOffice uses a ZIP-based format for storing documents, the SQLite team argued that there would be benefits (particularly around performance and reliability) for OpenOffice to switch to SQLite.

They key issues that I see are:

  • SQLite is included in PHP, but not enabled by default on Windows.
  • While the SQLite team have a strong commitment to providing long-term support, SQLite is not a standard, and the only implementation is the one provided by the SQLite team.
  • This option is not backwards compatible at all.

FlatBuffers

FlatBuffers is an interesting comparison, since it’s a data format focussed entirely on speed. The down side of this focus is that it requires a defined schema to read the data. Much like SQLite, the only standard for FlatBuffers is the implementation. Unlike SQLite, FlatBuffers has made no commitments to providing long-term support.

WXR (XML-based)WXR (ZIP-based)JSONSQLiteFlatBuffers
Works in PHP?⚠️⚠️⚠️
Performant?⚠️⚠️
Supports Binary Files?⚠️⚠️
Standards Based?⚠️ / ❌
Backwards Compatible?⚠️⚠️
Self Descriptive?
Readable?⚠️ / ❌

As with any decision, this is a matter of trade-offs. I’m certainly interested in hearing additional perspectives on these options, or thoughts on options that I haven’t considered.

Regardless of which particular format we choose for storing WordPress exports, every format should have (or in the case of FlatBuffers, requires) a schema. We can talk about schemata without going into implementation details, so I’ll be writing about that in the next post.

This post is part of a series, talking about the WordPress Importers, their history, where they are now, and where they could go in the future.