In Part 1 of this series, I described some of the motivation, and the components being used to build a new blog for myself. In this (lengthy) post, I’ll address the solution I used to move my content archives from WordPress to the new app.
Installing new blog software is generally easy, but if you have legacy content that you need to preserve, the ability to move content between systems becomes of utmost importance. Fortunately, it’s quite common for popular software to provide import/export features; Having good tools to migrate content reduces switching costs, making it easy to try new software without fear of content lock-in. Unfortunately, with a home-grow blog platform, these tools need to be written from scratch.
For my soon-to-be-launched Django-based blog, importing content from my WordPress installation was an early priority — there’s only so much testing you can do with lorem ipsum posts. In tackling this content migration, I considered the following four options:
Regardless of the approach taken, I also added one important requirement: The import solution had to be so easy (and easily repeatable) that I would never hesitate to make a change to the database models when needed. Naturally, it’s nice to freeze the model once you have a stable release, but during development, even the database model should be open to agile iteration. I’ve worked on systems where every model change meant writing accompanying SQL scripts to alter the tables, and while effective, it wastes time, and I wanted the option to simply export, wipe the database clean, and re-import whenever needed. (And preferably by simply running a single script.)
I finally settled on option #4, to export into a neutral format (XML), and write an importer for that format; However, I did briefly consider each of the above options:
1) Supporting the legacy (WordPress) database schema sounds nice on the surface. This would allow the two systems to share the same database (thus eliminating the need to migrate content at all), while making it extremely easy to run the systems side-by-side (perhaps even balancing traffic between the two to test the deployment.) The downside though, is that the custom application would need to maintain the data relationships that WordPress was relying on. It’s certainly doable, but on further investigation, I found that I didn’t actually like everything about the WordPress schema; There was a bit too much de-normalized data that I didn’t want to keep around.
2) Exporting and Importing at the database level would essentially involve a mysqldump, some sed/grep/perl magic, and a SQL import into a new database. This would get the job done, but could very well lead to endless hours of tweaking regex patterns; and the end result would basically be throw-away code.
3) Writing an adapter layer was actually the most tempting at first. I knew that Django contained a tool for generating model definitions based on an existing database schema. If this worked for the WordPress database, then all I would need to do is write a thin layer to fetch content from one model and stick it into another. Sure enough, the `inspectdb` tool did do a good job, and I got so far as having routines for pulling posts and comments before realizing that this also wasn’t as reusable a solution as I wanted. Complicating matters was the need to do all this magic in a single database, since the Multiple Database Support branch of Django is still in development/testing.
With the above options scratched off the list, I went in search of a means to export directly from WordPress into a neutral format. With a little googling, I found some posts about an export/import feature that might be “in development” in the WordPress tree, but I found no documentation on the feature. Fortunately, a few more searches turned up the “WordPress XML Export” plugin, which sounded like an effort to backport the exporting feature to early versions of WordPress. After first installing the XML Export plugin, I found that it didn’t actually work with the version of WordPress on my server, but a quick look through the source code revealed a hardcoded version check that was easy enough to modify. With that change made, the plugin has run like a champ ever since.
The XML Export plugin outputs the full contents of a WordPress blog into a WXR file (WordPress eXtended RSS), which is an RSS 2.0 file, extended with a wordpress export namespace so that it can include extra metadata and comments.
With the content archives now in a massive RSS file, the next task was to write an importer. To parse the XML, I decided to use ElementTree for it’s simplicity in getting the job done. Pulling the file into ElementTree is a one-liner (when wordpress_xml_file is a File object):
tree = ET.parse(wordpress_xml_file)
The entries can be easily iterated:
for item in tree.findall("channel/item"):
Extracting the basic elements was also straight-forward (which I stuck into a Dictionary):
results['link'] = item.find(”link”).text
results['pubDate'] = item.find(”pubDate”).text
results['summary'] = item.find(”description”).text
results['body'] = item.find(”{http://purl.org/rss/1.0/modules/content/}encoded”).text
results['post_date'] = item.find(”{http://wordpress.org/export/1.0/}post_date”).text
results['post_date_gmt'] = item.find(”{http://wordpress.org/export/1.0/}post_date_gmt”).text
Extracting the Categories/Tags was only slightly more work:
results['categories'] = []
categories = item.findall("category")
for c in categories:
results['categories'].append(c.text)
Pulling the comments was the only messy part of the process. The list of comments is easy enough to fetch…
comments = item.findall("{http://wordpress.org/export/1.0/}comment")
…but extracting the actual comment text is a little more work because some comments may contain child nodes. For example, a comment containing a hyperlink, bold tag, or any other HTML will be truncated if you simply use the `.text` attribute. To crawl the comment text and child tags, I used the `getiterator()` method, while concatinating `.text` attributes to assemble the full comment text. While doing this, I also decided to filter out any HTML tags from the comments, which made the process fairly simple:
tmp_comment_list = []
comment_tag = comment.find("{http://wordpress.org/export/1.0/}comment_content")
for comment_tag_child in comment_tag.getiterator():
tmp_comment_text = comment_tag_child.text
if tmp_comment_text: tmp_comment_list.append(tmp_comment_text)
the_comment['body'] = ' '.join(tmp_comment_list)
results['comments'].append(the_comment)
By writing an importer for the WXR/RSS 2.0 format, this not only solves the problem at hand, but also sets the groundwork for a reusable RSS importer. IMO, this potential reuse adds additional value to the solution (as opposed to one-off SQL munching or custom adaption layers), which makes it worth any additional work that might have gone into it. With a little re-factoring, the same system could also be extended to support the Movable Type Import Format, making the software very easy to setup and evaluate.
In Part 3, I’ll skip some of the development details and jump into the server issues, with a focus on why the new blog hasn’t launched yet. The answer lies heavily in the challenge of running a Python-based application server in shared hosting environments. The common lack of mod_python, the RAM hit, etc., all add to the complexity in adopting Django.
February 17th, 2007 at 8:23 am
Awesome post, I’m working on this very same thing at the moment and your insights were helpful. I was having a hard time finding a decent export solution. You might want to include the full link to the WXR export plugin, instead of the Google search. That link being here.
February 17th, 2007 at 1:45 pm
Hi Jesse, thanks for the comment. I updated the URL. (And thanks for the tip on your blog about Wagamama coming to Boston. I’m in Boston more often then London lately.)
February 28th, 2007 at 11:04 am
One additional note, in case anyone else reads this and has a similar problem.
The export plugin *might* generate some malformed XML, like unescaped or illegal characters. For example, I often use —, an em-dash. ElementTree seemed fine with it, but working with that data subsequently caused python to throw an UnicodeEncodeError. Eg. A simple ‘print results['body']‘ returns the following error:
>>> UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\u2014′ in position 4
>>> 99: ordinal not in range(128)
I’m no XML expert. In fact, I know very little about character sets in general. So perhaps I need to change Python’s character set or something. I didn’t bother. Instead, I manually edited the RSS file and removed the offending characters.
February 28th, 2007 at 11:39 am
Good point Jesse — I had a number of problems with the XML that came out of the plugin as well (and I also found out along the way that my MySQL tables were storing ‘latin-1′ instead of ‘utf8′… fun fun.) I ended up adding some entity encoding/translation to swap various entities and unicode characters into something more ascii-friendly. I hate doing this (I really wish it were easier to run everything as unicode), but that’s just where we’re at with our tools right now I guess.
The following are the characters I swapped:
“–” becomes “–”
“—” becomes “–”
“‘” becomes “‘”
“’” becomes “‘”
“‚” becomes “‘”
““” becomes ‘”‘
“”” becomes ‘”‘
“„” becomes ‘”‘
“…” becomes “…”
“″” becomes ‘”‘
“\u2019″ becomes “‘”
“\u2029″ becomes “?”
July 11th, 2008 at 4:11 am
I tried doing this, but into serious issues because the WXR file produced was not well-formed XML.
Other people have had similar issues: http://lucumr.pocoo.org/cogitations/2008/02/18/how-not-to-do-xml/
July 11th, 2008 at 8:15 am
Thanks for the link Martey… and the point can’t be overstressed: The WXR that WordPress produces is an invalid mess! A strict parser will fail. If you want to work with it, you need a liberal, non-validating parser (and even still, you might need to touch up potential character encoding issues before you start.)