I picked up “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” during a recent O’Reilly ebook sale, hoping that there might be a few relevant nuggets for the work I’m doing with Bittwist. Unfortunately, the book reads as a collection of unrelated, (mostly) data-themed articles that lack depth and fail to create cohesion.

“Bad Data Handbook” brings together 19 authors to tell their favorite “bad data” stories. With such diversity, the book helps broaden the definition of bad data, citing examples of encoding problems, data bias, formatting inconsistencies, etc. However, this isn’t the hands-on technical book that we usually get from O’Reilly — think of it more like bad data war stories told at the pub. Instead of leaving with a new set of tools and skills, you’re more likely to simply share a few, “oh yeah, I’ve dealt with that” moments.

The topic is nicely timed with the growing interest in big data, but the book falls short. Verdict? Skip this one.

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

 

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

 

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200′s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic -- you can gauge content impressions, but you're not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

 

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn't have to worry about clock-sync; and with round-robin load balancing, we get a "good enough" estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

 

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

If the recently hyped Google Goggles (AKA Project Glass) aren’t vaporware, they represent a classic move from the Apple playbook:

Invent the products that replace your best-sellers

I’m thinking of this in terms of Glass replacing Android handsets. Apple doesn’t wait for a competitor to offer something better than their products — they control the killing-off of their best-sellers with each next-generation release. While some companies delay innovation in attempt to extract every ounce of profit (and increasing margins) out of their products, Apple doesn’t leave their timelines, forecasting, and supply-chain in the hands on it’s competitors.

Don’t compete head-on (Android v. iOS)

Android is making great progress, but it’s an head-on fight; and advancements are always compared to the iPhone. Sure, there’s money to be made with Android handsets — but it’s a “me too” game.

Change the conversation (iPad vs. everything else)

Tablet computing is hot right now; but so far, trying to capture a piece of the iPad market (AKA the Tablet market), is a losing proposition. Apple owns this segment. Instead of playing catch-up, change the conversion. Invent a new platform, and own it’s market segment instead.

 

Of course, as nice as all of this sounds, daringfireball has the right interpretation: This vaporware R&D fluff looks more like Microsoft or Nokia than Apple. Apple wouldn’t say “we’re exploring ideas for a future product that will change the world.” Apple would just do it.

My earlier work with Social Book Club, and current work with Kirkus Reviews, has me spending a fair amount of time exploring and developing recommendation systems. There are a variety of good books and papers on the subject, but I recently finished reading “Mining of Massive Datasets” (a free ebook that accompanies a Stanford CS course on Data Mining), and it was a surprisingly good read.

The book covers a number of topics that come up frequently in data mining: reworking algorithms into a map-reduce paradigm, finding similar items, mining streams of data, finding frequent items, clustering, and recommending items. Unlike many texts on the subject, you won’t find source-code in this book; but rather, extensive explanations of multiple techniques and algorithms to address each topic. This lends itself to a better understanding of the theory, so that you understand the trade-offs you might be making when implementing your own systems.

There are easier texts to get through, but if you’re getting started with recommendation or data-mining systems, and haven’t read this book, I’d encourage you to do so.

We’ve seen various attempts at using JavaScript on the server over the last decade. Mozilla’s Rhino (Java) engine fueled most of it. However, with the release of Google’s V8 (C++) engine (and the networking performance example set by Node.js), the conversation is gaining traction.

The motivation for a 100% JavaScript stack, per conversations at Texas JavaScript Conference (TXJS) last weekend, is the desire to use a single programming language when developing web applications, rather than the mix of technologies we use today. It’s not so much that JavaScript is the best language for application development (contrary to the JS fanboys), but since it’s what we’re stuck with on the client-side, it’s worth considering on the server-side. With a single language, business logic can be reused on the client and the server (think form validation), and you avoid bugs caused by frequent language switching (i.e., using, or forgetting semi-colons, putting commas after the last item in an array, using the wrong comment delimiter, etc.)

The wrinkle in the 100% JavaScript argument, is whether JavaScript is actually the language you want to write your back-end in. The language lacks package management standards (though CommonJS is working to change that); It lacks the standard libraries and tools that the incumbents offer (i.e., no batteries included); Maybe people who use it don’t actually know the language very well; And it suffers from the multitude of bad examples and advice freely available online.

There have been some interesting Node-based applications developed already (i.e., Hummingbird), and the JavaScript on App Engine efforts (i.e., AppEngineJS) will be interesting to watch as well. (I expect both to foster more mature development patterns for large applications written in JavaScript.) However, in the near term, the 100% JavaScript stack will likely remain as niche as the Erlang, Haskel, Lisp, etc. web frameworks (as interesting as they may be.)

The question for you (Mr./Mrs. web developer/web-savvy business person), is whether JavaScript on the back-end offers a competitive advantage. Can you execute on an idea faster/better/cheaper than your competition because of your technology stack?

Coders at Work book cover

I finished reading “Coders at Work last night. In it, author Peter Seibel interviews 15 legendary programmers, discussing how they got started with computers, how they learned to program, how they read and debug code, etc. The interviews cover a wide range of opinions and approaches, and offers a fascinating look at “computer science” history.

The format of the book is a little unusual, in that it’s entirely interview transcripts. No analysis. No author-interpretation. Just recorded conversations. At first it’s a little surprising that one can publish a book like this; But then you get into the content and it’s wonderfully engaging. Analysis and interpretation would just get in the way of letting these folks talk. Reading direct quotes makes the content all the more exciting.

The book isn’t for everyone (obviously), but I rather enjoyed it. There’s some great stories about the history of our profession, and many topics raised that inspired additional research. (I went out and found a number of research papers referenced in the interviews, and bookmarked a lot of content for further exploration.) There’s also a fair amount on the history of different programming languages, and I have a fascination with programming languages, so it was a great fit.

A few take-away themes and ideas:

  • While programming was no easy task in the early days, at least it was possible to fully-understand the hardware and all the software running it (as opposed to modern computers.) The modern computing environment presents very different challenges to present-day programmers, especially those new to the field.
  • Even some of best use print statements.
  • Passion and enthusiasm separate good programmers from great ones.
  • In academia, you have time to think about the “best” solution, without the deadlines imposed on commercial developers.
  • There’s certainly a component of “doing great work” that requires being in the right place at the right time — sometimes it’s just a matter of getting staffed on the right project.
  • There’s some negativity towards C/C++ in here, mostly due to it’s negative impact on compiler and high-level language development. (i.e., one school of thought is that you give people a high-level language and make the compiler smart. The other is that you give people a low-level language and let them do the work. Unfortunately, humans aren’t so good at hand-writing code optimized for concurrency, but once you have a language that let’s them try, it’s hard to fund compiler research.)

Here’s a few of the quotes I highlighted while reading:

“One of the most important things for having a successful project is having people that have enough experience that they build the right thing. And barring that, if it’s something that you haven’t built before, that you don’t know how to do, then the next best thing you can do is to be flexible enough that if you build the wrong thing you can adjust.” — Peter Norvig

“…there are user-interface things where you just don’t know until you build it. You think this interaction will be great but then you show it to the user and half the users just can’t get it.” — Peter Norvig

“I get so much of a thrill bringing things to life that it doesn’t even matter if it’s wrong at first. The point is, that as soon as it comes to life it starts telling you what it is.” — Dan Ingalls

“…a complex algorithm requires complex code. And I’d much rather have a simple algorithm and simple code…” — Ken Thompson

“If you can really work hard and get some little piece of a big program to run twice as fast, then you could have gotten the whole program to run twice as fast if you had just waited a year or two.” — Ken Thompson

“if they’d have asked, ‘How did you fix the bug?’ my answer would have been, ‘I couldn’t understand the code well enough to figure out what it was doing, so I rewrote it.’” — Bernie Cosell

“You have to supplement what your job is asking you to do. If your job requires that you do a Tcl thing, just learning enough Tcl to build the interface for the job is barely adequate. The right thing is, that weekend start hacking up some Tcl things so that by Monday morning you’re pretty well versed in the mechanics of it.” — Bernie Cosell

“…computer-program source code is for people, not for computers. Computers don’t care.” — Bernie Cosell

“if you rewrite a hundred lines of code, you may well have fixed the one bug and introduced six new ones.” — Bernie Cosell

“I had two convictions, which actually served me well: that programs ought to make sense and there are very, very few inherently hard problems. Anything that looks really hard or tricky is probably more the product of the programmer not fully understanding what they needed to do” — Bernie Cosell

“You never, ever fix the bug in the place where you find it. My rule is, ‘If you knew then what you know now about the fact that this piece of code is broken, how would you have organized this piece of the routine?’” — Bernie Cosell

“Part of what I call the artistry of the computer program is how easy it is for future people to be able to change it without breaking it.” — Bernie Cosell


book cover

I just finished reading “Even Faster Web Sites: Performance Best Practices for Web Developers“, by Steve Souders. It’s technical, and definitely for a limited audience, but it’s certainly relevant for web developers trying to squeeze a few extra milliseconds out of page render times with older browsers. (Yes, many of the techniques are just as applicable for modern browsers, but the performance competition between Firefox, Safari, and Chrome has the latest builds addressing, and solving, some of the common bottlenecks.)

What I liked best about the book were the tests and test results. Souders runs each browser through numerous test scenarios to demonstrate the (sometimes huge) impacts that small authoring decisions can make. (e.g., the surprising relationship between CSS files and inline JavaScript.) Souders also provides implementation details and decision trees for choosing and implementing as much asynchronous loading as possible.

All in all, it was a nice exploration of how different browser implementations approach page loading and painting, and how to exploit this knowledge for speed.

Summary:

  • Targeted at developers wanting to learn Django by building example applications rather then (or in addition to) reading the docs and man pages
  • The reader builds three working applications by following along
  • The examples are based on up-to-date Django features (ie., a 2008 build)
  • Lesson’s focused on using Django (not on Django’s inner workings)
  • Doesn’t waste time explaining Python and HTML (nor does it dive deep explaining the how/why of what you’re doing in the examples)
  • Introduces the reader to powerful Django features — covering a wide range of capability
  • Examples focus on designing for code reuse (and leading by example, by integrating with existing reusable apps and Python libraries)
  • Offers an alternative approach to learning, focused on relevant, practical examples

Background:

Practical Django Projects (Apress book description) was written by James Bennett, release manager and contributor to the Django Web Framework. It was published by Apress in 2008. This was Bennett’s first book.

Full disclosure: I was provided with a free, review-copy of the book by Apress.

The Book:

Practical Django Projects introduces the reader to the Django Web Framework by example. It takes the reader step-by-step through three example projects: a basic CMS, a blog application (called Coltrane, which powers the author’s personal blog), and a code-sharing/snippets site (called Cab, which powers http://www.djangosnippets.org/.) The examples cover real-world problems (and integration tasks) that developers are likely to be interested in, and leaves the reader with three working Django applications.

The lessons are spread across eleven chapters:

  1. Welcome to Django — a wonderfully short introduction that wastes no space explaining prerequisites (it assumes the reader knows Python)
  2. Your First Django Site: A Simple CMS — an introduction to the Django Admin and Flatpages
  3. Customizing the Simple CMS — customizing the Admin interface (adding TinyMCE) and developing a simple, reusable search feature
  4. A Django-Powered Weblog — defining the basic models, and using django-tagging and Generic Views
  5. Expanding the Weblog — adding del.icio.us-synced links, and custom categories
  6. Templates for the Weblog — more extensive use of Generic Views, template inheritance, and custom template tags
  7. Finishing the Weblog — using django.contrib.comments and model signals to develop a moderation system with email notification and Akismet integration; Using django.contrib.syndication to add RSS/Atom feeds
  8. A Social Code-Sharing Site — building the initial models, integrating with the pygments syntax highlighter, and writing custom model managers
  9. Form Processing in the Code-Sharing Application — great examples of using newforms (much better then the The Definitive Guide to Django‘s chapter on form processing)
  10. Finishing the Code-Sharing Application — more custom template tags, this time used with bookmarking and rating features
  11. Writing Reusable Django Applications — a summary of Bennett’s philosophy on decoupling application features into reusable components (with references to the UNIX saying, “do one thing, and do it well”)

The examples focus on building applications the “Django way” — meaning that they heavily leverage Django features such as Generic Views, custom template tags, and the django.contrib package. Each section starts by outlining the features to be developed, then walking the reader through model definitions, URLs, template design, and the request-handler (view) code.

While working through the three example applications, Bennett teaches the reader how to decouple applications from projects, how to think about (and look for) opportunities for code reuse, and how to integrate with other reusable Django applications. The lessons aren’t so much “how does Django work”, but rather “how do you, as a developer, structure your projects to get the most out of the framework.” Depending on your level of comfort using Django and Python, the lessons will either be a breeze, or ridiculously confusing. (ie., there’s a lot of magic going on in the examples, and the book assumes that either you get it, you’re comfortable not knowing, or that you’ll figure out the finer bits when you need them.)

The Core Message

Ultimately, the book isn’t so much about learning Django, as it is about learning how to use Django properly (where properly is defined as the way in which the Django developers use Django.) From this perspective, it’s quite successful. The reader is shown a number of patterns and concepts that can be applied to any Django project.

Bennett wraps up the book with a chapter on design philosophy, but I think the overall lesson of the book is best summarized on page 124, with the following quote:

…this is the hallmark of a well-built Django application. Installing it shouldn’t involve any more work than the following:

  1. Add it to INSTALLED_APPS and run syncdb.
  2. Add a new URL pattern to route to its default URLConf.
  3. Set up any needed templates.

This is the zen of pluggable Django applications. It’s the path Bennett wants to help you start down. The value of going down this path will depend on how often you’ll use Django in the future.

Conclusion:

Overall, I think the book will be more valuable for someone just getting started with Django, then someone who’s been hacking lower-level with the framework for awhile. It’s a developer-focused, quick-start, “get you on the right foot” kind of book that I certainly would have appreciated more a few years ago. The big question then, is whether this book is for you. The answer depends on a couple things, with the most important being how you like to learn. Do you prefer learning by example, or learning by reading the docs and building things on your own? If you prefer to have an expert guide you step-by-step, then this book is for you. You’ll still need to poke around in the Django documentation to really grok how it all works, but this book will get you up to speed quickly.

If you’ve read the docs, done the online tutorials, and are still interested in picking up some best-practices on decoupling your code from your specific application (ie., learning how Django supports code reuse), then this may still be a book for you. If you know you’ll be building a large application, the lessons in the book might help prevent you from writing a single, monolithic application, or at least give you some insight into how to organize and package your code. Down the road you’ll thank yourself.

For me personally, I was actually looking forward to this book before it came out. I think the Django docs online (as great as they are) can sometimes lack in providing best practices. However, I’ve also been using the framework professionally for a number of years (to deploy personal, start-up, and enterprise-class web applications), and I’ve previously built and deployed a pluggable, multi-site, Django-based blog engine (with del.icio.us and Akismet integration, flexible moderation rules, etc.), so the idea of using a blog engine as the core example in the book was a bit disappointing. That said, I did enjoy seeing another developer’s approach on solving the same problem, and I picked up a few nice tips around some of the more recent Django features.

If you’re looking to build a reusable code library (and you should be, if you’re going to build more then one Django project) and ensure that you’re using Django efficiently, this book will help point you down the right path and have you thinking about decoupling your architecture from the start.

Posted in Uncategorized.

Over the holidays we had an accidental deletion of every image on one of our phones (a Nokia N90, Symbian OS device.) Mild panic was quickly replaced with a gentle pondering on the difference between what a normal person would do in this situation vs. what a geek would do. The geek process goes something like this:

Step 1: Get the memory card out of the phone as quickly as possible

Either shut the phone down and pull the card, or use the super-secret combo hidden within the profile-switching shortcut to have the phone un-mount the card.

Step 2: Obtain a USB memory card reader

I’ve needed a reason to buy one of these for a long time. Good thing I had a gift card left from the holidays. I went with a Dynex gazillion-to-one card reader, not for it’s technical superiority, but because it was the only thing the shop nearby had.

Step 3: Stick the memory card into the reader, and plug the reader into your Linux box

Mine happens to run Ubuntu at the moment, but the results will likely be similar on other distros.

Step 4: sudo apt-get install testdisk

Testdisk “was primarily designed to help recover lost data storage partitions…” and includes a utility called “PhotoRec“, which is what you want.

Step 5: Run photorec

PhotoRec is a data recovery tool designed specifically for recovering files from digital camera media. It supports a number of file-system formats, including the FAT format that Symbian OS uses on it’s memory cards. PhotoRec is a text-based, terminal application, but it does the job perfectly.

Select the mounted memory card from the list of drives (which should be easy to spot given how small memory cards are relative to modern hard drives), and send it scanning. PhotoRec can be told to look for specific file types (you want JPG’s, in this case), but by default it will look for just about any media file format that you’re likely to have on your phone. Files will be recovered and written to a local directory.

Step 6: Sigh in relief when you see your beloved cat pictures returned to you

PhotoRec isn’t going to restore the images to the memory card’s file system such that the phone can see them again, but you’ll have the pictures on your Linux box now, and can copy them back over if you choose to. The naming scheme will be different, but that’s an acceptable compromise.

Posted in Uncategorized.

I have an odd fascination with Visual Programming languages, and while I’ve gotten so far as sketching out some UI concepts and object models for a text-processing focused, web-mashing, visual programming environment, I’m a long way from having anything that works. Much to my surprise then when David Ascher dropped a link to the Lily project on his blog today. Holy cow this is sweet. Think PD or Max/MSP written in JavaScript, running in a browser, with modules for popular Web API’s and JavaScript frameworks (ex., “Amazon, Flickr, Wikipedia, Yahoo; UI modules that wrap widgets from YUI, Scriptaculous, JQuery, Google Maps….”)

Check out one of the demo’s here:

(Via: Lily: JavaScript, visual programming, fun.)

Posted in Uncategorized.

The Maemo team has been quietly rocking Nokia’s world for some time now. They’re off in the background building (almost pocketable) mobile computers; fine-tuning touch interfaces and small-screen UIs; becoming experts in embedded linux; and bringing top-notch open source software and modern development tools to this unique mobile platform. For years, the Nokia tablets have sat on the side-lines as niche devices for hackers; but lately, the team has been changing the game.

The Nokia 770 and N800 have always faced an up-hill battle with market adoption given their lack of GSM/CDMA support. “Is it a phone?” is one of the first questions people ask when they see me using one these devices. Saying “No, it’s a web tablet” only brings a look of confusion. Thankfully, the latest software releases, wider market recognition of UMPC‘s, and the iPhone release, have had a huge impact on the perception of the N800.

The Internet Tablet OS 2007 edition 4.2007.26-8 upgrade (released earlier this month) brought Skype support to the N800. While perhaps playing second fiddle to a Flash upgrade that makes YouTube work better, adding Skype greatly improves the likelyhood of using the N800 as a portable VoIP device. However, even more significant is the recent Internet Communications Software Update for N800. This update adds SIP support to the N800 for VoIP calls — a feature that turned my N800 into my new desk-phone at work.

At Optaros, we use Asterisk to run our phone infrastructure. There are the occasional physical SIP phones in conference rooms, but in general, we use soft-phones running on our laptops to make and receive calls. The downside here is portability. Even using WIFI, a laptop doesn’t make the best cordless phone. But an N800 does. The N800 is actually quite nice as a cordless phone; and with WIFI available in the office, at home, and at nearly every business in Austin, my phone extension can now be routed to my Nokia device and be available almost everywhere.

It may take awhile for the market to notice this, but Nokia is quietly taking the top-spot in mobile linux and VoIP hardware know-how. The Nokia linux tablets aren’t quite ready for the general consumer (in terms of usability), and the marketing messages aren’t there yet either — but the R&D is, and the technology will be ready to drop-in and rock the mobile-phone world as soon as the strategy dictates.

Posted in Uncategorized.

Awhile back, Ubuntu announced a mobile and embedded edition of it’s popular Linux distribution. The buzz was around the possibility of Ubuntu Mobile showing up on future UMPCs. The news caught my eye, but didn’t really get my attention until the plans for Ubuntu 7.10 (Gutsy Gibbon) were announced:

“Ubuntu 7.10 will be the first Ubuntu release to offer a complete mobile and embedded edition built with the Hildon user interface components” (developed by Nokia for the Maemo platform.)

Now that’s interesting. Could it be that we’ll see Ubuntu Mobile booting on Nokia N800′s? It’s certainly a possibility — and one that could bring a larger breadth of software to Nokia’s mobile Linux tablets.

However, as interesting as it may be if Nokia adopts Ubuntu, the possibilities for wider Hildon support didn’t hit me until my drive home today. It was one of those obvious moments. I had been using my Nokia N800 while walking to my car, so the touch- and small-screen friendly UI was fresh in my mind. Then I started thinking about my Car PC. It uses a 7″ touch screen and runs Ubuntu (a full distribution, with a UI designed for full-size monitors.) Running Gnome on my cheap, in-car 7″ monitor makes for a pretty lousy experience. Text is hard to read, and everything is too small to click on. However, if this news is right, Ubuntu 7.10 will change all of that. I’ll be able to run Hildon on my Car PC! That’s killer. Imagine having Canola running in-car, sitting on 100GB of multimedia…

Posted in Uncategorized.

Django “lorem ipsum” generator (and a new contrib.webdesign module)

The Django Web Framework project just added a new contrib.webdesign module with an amazingly simple, but incredibly handy first feature: a lorem ipsum generator. The idea is that a project’s base templates can include generated lorem ipsum for testing layout and page flow, but inheriting templates can override the generated text once real content is available.

The lorem tag is used like this (via the contrib.webdesign docs):

  • {% lorem %} will output the common “lorem ipsum” paragraph.
  • {% lorem 3 p %} will output the common “lorem ipsum” paragraph and two random paragraphs each wrapped in HTML <p> tags.
  • {% lorem 2 w random %} will output two random Latin words.

In practice, you might do this:

templates/template.html:


<html>
  <head>
    <title>{% block article_title %}{% lorem 5 w %}{% endblock %}</title>
  </head>
  <body>
    <div class="article">
      <div class="article_title">{% block article_title %}{% lorem 5 w %}{% endblock %}</div>
      <div class="article_body">{% block article_body %}{% lorem 4 p %}{% endblock %}</div>
    </div>
  </body>
</html>

And then inherit when you’re ready:

templates/article.html:


{% extends "template.html" %}

{% if article %}
  {% block article_title %}{{ article.title }}{% endblock %}
  {% block article_body %}{{ article.body }}{% endblock %}
{% endif %}

Previously, I used to just paste lorem ipsum text directly into the main template (wrapped in block tags for overridding), but this new tag will let you skip the copy/paste routine. Very nice!

Posted in Uncategorized.

Even though most articles indicate that Ubuntu Edgy should have automatically patched itself with updated timezone files, my laptop (and apparently a few others didn’t get the update either.) With some googling, I found plenty of suggestions (including “sorry, mine worked”, and “just manually set your clock”), but none got to the core issue, which is that the timezone files themselves were wrong.

No doubt, by now, you know whether your machine updated correctly; but if it didn’t, you can verify your timezone files with this:

`zdump -v /etc/localtime | grep 2007`

If you see “April 1″ in there, the machine has old files (as mine did.)

The solution (for me), was to manually rebuild the timezone files (since the system thought it was fully patched.) Step 1: Go here: http://packages.ubuntu.com/edgy/libs/tzdata and download the latest file (for me, it was http://archive.ubuntu.com/ubuntu/pool/main/t/tzdata/tzdata_2006m.orig.tar.gz.)

Put the file somewhere (like /tmp/), ‘cd’ there, and un-tar it all. ‘cd’ into the uncompressed files until you find a file called ‘northamerica’. Now compile the timezone file like this:

`sudo zic northamerica`

Remove your previous file:

`sudo rm /etc/localtime`

And sym-link to the new one:

`sudo ln -sf /usr/share/zoneinfo/CST6CDT /etc/localtime` (substituting CST6CDT for your timezone.)

Now verify with:

`zdump -v /etc/localtime | grep 2007`

It should now read “Mar 11″ and “Nov 4″ instead of “April 1″ and “Oct 28″, and the machine should fix it’s clock shortly (it just took a few minutes for mine to correct itself.)

I have no idea what the long-term effects may be of having manually fixed this (as in, what happens when I update to Feisty Fawn), but for now, all is good with the system clock.

Posted in Uncategorized.

My previous post, “Passing JSON via the X-JSON HTTP header with Django and Prototype“, contained an example on writing custom HTTP headers from a Django-based web application. Continuing with that theme, here’s another header trick that I use in one of my apps to force the browser’s “Save As…” dialog box when viewing a particular URL.

The feature that I wanted was the ability to generate an XML file based on an HTTP GET request, but to have the browser open a “Save As…” dialog instead of attempting to render it (as would normally happen with XML in a modern browser.) The solution is to exploit the web browser behavior of not handling unknown mime types. A sample implementation (written in Python for the Django Web Framework) follows:

def save_as_xml(request):
    import datetime

    current_time = datetime.now()

    response = HttpResponse('PUT THE XML HERE')
    response['Content-Type'] = 'application/x-generated-xml-backup'
    response['Content-disposition'] = 'Attachment; filename=export.%s.xml' % (current_time.strftime("%Y-%m-%d"))

    return response

Setting the Content-Type header to a made-up type ensures that the browser will not attempt to render the file. The Content-disposition header provides the mechanism for suggesting the filename of the content to be saved on the viewer’s system. In this case, I’m using the standard `datetime` module to insert the date into the suggested filename.

Posted in Uncategorized.

One of the demo sites I was working on this week needed to pass a small amount of JSON back with it’s page results. There are a few ways to do this (and I’d suggest this post, “Loading Content with JSON” as a starting point if you’re looking for ideas), but for simplicity, I decided to take advantage of the automatic X-JSON HTTP Header parsing feature in Prototype 1.5.0. (The Ajax.Request docs address this capability.)

The sample code below demonstrates the use of the X-JSON header with an simple “sticky notes” web app. On the client-side, the JavaScript is quite simple. The second variable in the onSuccess callback handler will be automatically initialized using the data in the X-JSON header:

function display_note(id) {
    new Ajax.Request('/api/note/' + id + '/', {
        method: 'get',
        onSuccess: function(transport, results) {
            alert("Note(" + results['id'] + ") `" + results['title'] + "`: " + results['body']);
        },
        }
    );
}

To handle this request, I’m using Django on the server with the following URL pattern:

(r'^api/note/(?P\d+)/$', 'views.get_note')

The `get_note` method implementation looks like this: [NOTE: For production use, you'll want some exception handling, but I removed the error handling to simplify the example.]

def get_note(request, id):
    # Fetch the Note from the DB:
    note = Note.objects.get(pk=id)
    # Create the response object (with some dummy text for now):
    response = HttpResponse('Check the X-JSON header.')
    # Manually set the X-JSON header using the JSON generated from the Note record:
    response['X-JSON'] = cjson.encode(note.__dict__)
    # Return the response object:
    return response

If you’d like to use this technique on your own sites, there are couple points to remember:

  1. You can’t return an empty HTTP Response regardless of there being an X-JSON header. If the response is empty, the browser will hang waiting for content to arrive.
  2. The X-JSON header should only be used for small payloads. Don’t stuff more then 8kb in your headers. If you’re sending more then that, move the JSON to the body of the response.
  3. The cjson and simplejson encoders don’t handle Django DateTime fields. For objects with DateTime fields, write an alternate method for converting the object into a dictionary before passing it to the json encoder.

[Update: 2009/04/10]
This post got some flak from a firewall vendor for demonstrating a technique of pushing JSON out that will likely bypass their security checks. Since I haven’t used their products in a long time, I’m not too bothered by this, but I do want to point out that you likely don’t want to use HTTP Headers to pass this kind of data in a production site anyway. You’ll find out quickly that there are character limits to how much you can put in the headers, and before long, the distinction between what data goes in the header and what goes in the body will blur. Once that happens you’ve got a big mess on your hands. Better would be to avoid this pattern all together. My post here simply demonstrates how to use the technique, should you be interested in doing so.

Posted in Uncategorized.

In Part 1 of this series, I described some of the motivation, and the components being used to build a new blog for myself. In this (lengthy) post, I’ll address the solution I used to move my content archives from WordPress to the new app.

Installing new blog software is generally easy, but if you have legacy content that you need to preserve, the ability to move content between systems becomes of utmost importance. Fortunately, it’s quite common for popular software to provide import/export features; Having good tools to migrate content reduces switching costs, making it easy to try new software without fear of content lock-in. Unfortunately, with a home-grow blog platform, these tools need to be written from scratch.

For my soon-to-be-launched Django-based blog, importing content from my WordPress installation was an early priority — there’s only so much testing you can do with lorem ipsum posts. In tackling this content migration, I considered the following four options:

  1. Support the legacy database schema.
  2. Export and Import at the database level (ie., SQL dump, some text file munching, and SQL imports.)
  3. Write an adapter layer to pull from the existing database and insert into the new database.
  4. Export the content into a neutral format, and import from that format.

Regardless of the approach taken, I also added one important requirement: The import solution had to be so easy (and easily repeatable) that I would never hesitate to make a change to the database models when needed. Naturally, it’s nice to freeze the model once you have a stable release, but during development, even the database model should be open to agile iteration. I’ve worked on systems where every model change meant writing accompanying SQL scripts to alter the tables, and while effective, it wastes time, and I wanted the option to simply export, wipe the database clean, and re-import whenever needed. (And preferably by simply running a single script.)

I finally settled on option #4, to export into a neutral format (XML), and write an importer for that format; However, I did briefly consider each of the above options:

1) Supporting the legacy (WordPress) database schema sounds nice on the surface. This would allow the two systems to share the same database (thus eliminating the need to migrate content at all), while making it extremely easy to run the systems side-by-side (perhaps even balancing traffic between the two to test the deployment.) The downside though, is that the custom application would need to maintain the data relationships that WordPress was relying on. It’s certainly doable, but on further investigation, I found that I didn’t actually like everything about the WordPress schema; There was a bit too much de-normalized data that I didn’t want to keep around.

2) Exporting and Importing at the database level would essentially involve a mysqldump, some sed/grep/perl magic, and a SQL import into a new database. This would get the job done, but could very well lead to endless hours of tweaking regex patterns; and the end result would basically be throw-away code.

3) Writing an adapter layer was actually the most tempting at first. I knew that Django contained a tool for generating model definitions based on an existing database schema. If this worked for the WordPress database, then all I would need to do is write a thin layer to fetch content from one model and stick it into another. Sure enough, the `inspectdb` tool did do a good job, and I got so far as having routines for pulling posts and comments before realizing that this also wasn’t as reusable a solution as I wanted. Complicating matters was the need to do all this magic in a single database, since the Multiple Database Support branch of Django is still in development/testing.

With the above options scratched off the list, I went in search of a means to export directly from WordPress into a neutral format. With a little googling, I found some posts about an export/import feature that might be “in development” in the WordPress tree, but I found no documentation on the feature. Fortunately, a few more searches turned up the “WordPress XML Export” plugin, which sounded like an effort to backport the exporting feature to early versions of WordPress. After first installing the XML Export plugin, I found that it didn’t actually work with the version of WordPress on my server, but a quick look through the source code revealed a hardcoded version check that was easy enough to modify. With that change made, the plugin has run like a champ ever since.

The XML Export plugin outputs the full contents of a WordPress blog into a WXR file (WordPress eXtended RSS), which is an RSS 2.0 file, extended with a wordpress export namespace so that it can include extra metadata and comments.

With the content archives now in a massive RSS file, the next task was to write an importer. To parse the XML, I decided to use ElementTree for it’s simplicity in getting the job done. Pulling the file into ElementTree is a one-liner (when wordpress_xml_file is a File object):

tree = ET.parse(wordpress_xml_file)

The entries can be easily iterated:

for item in tree.findall("channel/item"):

Extracting the basic elements was also straight-forward (which I stuck into a Dictionary):


results['link'] = item.find("link").text
results['pubDate'] = item.find("pubDate").text
results['summary'] = item.find("description").text
results['body'] = item.find("{http://purl.org/rss/1.0/modules/content/}encoded").text
results['post_date'] = item.find("{http://wordpress.org/export/1.0/}post_date").text
results['post_date_gmt'] = item.find("{http://wordpress.org/export/1.0/}post_date_gmt").text

Extracting the Categories/Tags was only slightly more work:


results['categories'] = []

categories = item.findall("category")

for c in categories:
    results['categories'].append(c.text)

Pulling the comments was the only messy part of the process. The list of comments is easy enough to fetch…

comments = item.findall("{http://wordpress.org/export/1.0/}comment")

…but extracting the actual comment text is a little more work because some comments may contain child nodes. For example, a comment containing a hyperlink, bold tag, or any other HTML will be truncated if you simply use the `.text` attribute. To crawl the comment text and child tags, I used the `getiterator()` method, while concatinating `.text` attributes to assemble the full comment text. While doing this, I also decided to filter out any HTML tags from the comments, which made the process fairly simple:


tmp_comment_list = []

comment_tag = comment.find("{http://wordpress.org/export/1.0/}comment_content")

for comment_tag_child in comment_tag.getiterator():
    tmp_comment_text = comment_tag_child.text
    if tmp_comment_text: tmp_comment_list.append(tmp_comment_text)

the_comment['body'] = ' '.join(tmp_comment_list)

results['comments'].append(the_comment)

By writing an importer for the WXR/RSS 2.0 format, this not only solves the problem at hand, but also sets the groundwork for a reusable RSS importer. IMO, this potential reuse adds additional value to the solution (as opposed to one-off SQL munching or custom adaption layers), which makes it worth any additional work that might have gone into it. With a little re-factoring, the same system could also be extended to support the Movable Type Import Format, making the software very easy to setup and evaluate.

In Part 3, I’ll skip some of the development details and jump into the server issues, with a focus on why the new blog hasn’t launched yet. The answer lies heavily in the challenge of running a Python-based application server in shared hosting environments. The common lack of mod_python, the RAM hit, etc., all add to the complexity in adopting Django.

Posted in Uncategorized.

It’s always a strange feeling when my main Linux box is stable and running great, so I figured it was time to update from Ubuntu Dapper to Edgy (which is still pre-release.) I took the simple approach of swapping “dapper” with “edgy” in my “/etc/apt/sources.list” file, and using aptitude to handle the dist-upgrade. The process took some time (there was a lot to download), but after a reboot and another pass through dist-upgrade (using the Update Manager), everything came together just fine. (FYI, this is on a dual-AMD64 (Opteron) machine with an NVIDIA card.)

First impression: WOW, fonts looks MUCH nicer! Particularly in Firefox (a 2.0 RC build), which now seems to actually be using anti-aliased fonts. The improvement is text display is a welcome change, since this machine has always lagged behind my OS X boxes in terms of on-screen readability.

So far everything seems to be working quite well. Sound is working a little more reliably (though XMMS isn’t happy about it), but to be fair, I’m running audio through an M-Audio Audiophile USB box, which is probably a bit far from the norm. Also, I had an issue with Lighttpd while updating, but it was easily resolved with some manual intervention.

For more information on Edgy and how to update, check out the following:

Posted in Uncategorized.