I picked up “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” during a recent O’Reilly ebook sale, hoping that there might be a few relevant nuggets for the work I’m doing with Bittwist. Unfortunately, the book reads as a collection of unrelated, (mostly) data-themed articles that lack depth and fail to create cohesion.

“Bad Data Handbook” brings together 19 authors to tell their favorite “bad data” stories. With such diversity, the book helps broaden the definition of bad data, citing examples of encoding problems, data bias, formatting inconsistencies, etc. However, this isn’t the hands-on technical book that we usually get from O’Reilly — think of it more like bad data war stories told at the pub. Instead of leaving with a new set of tools and skills, you’re more likely to simply share a few, “oh yeah, I’ve dealt with that” moments.

The topic is nicely timed with the growing interest in big data, but the book falls short. Verdict? Skip this one.

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

 

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

 

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic -- you can gauge content impressions, but you're not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

 

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn't have to worry about clock-sync; and with round-robin load balancing, we get a "good enough" estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

 

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

If the recently hyped Google Goggles (AKA Project Glass) aren’t vaporware, they represent a classic move from the Apple playbook:

Invent the products that replace your best-sellers

I’m thinking of this in terms of Glass replacing Android handsets. Apple doesn’t wait for a competitor to offer something better than their products — they control the killing-off of their best-sellers with each next-generation release. While some companies delay innovation in attempt to extract every ounce of profit (and increasing margins) out of their products, Apple doesn’t leave their timelines, forecasting, and supply-chain in the hands on it’s competitors.

Don’t compete head-on (Android v. iOS)

Android is making great progress, but it’s an head-on fight; and advancements are always compared to the iPhone. Sure, there’s money to be made with Android handsets — but it’s a “me too” game.

Change the conversation (iPad vs. everything else)

Tablet computing is hot right now; but so far, trying to capture a piece of the iPad market (AKA the Tablet market), is a losing proposition. Apple owns this segment. Instead of playing catch-up, change the conversion. Invent a new platform, and own it’s market segment instead.

 

Of course, as nice as all of this sounds, daringfireball has the right interpretation: This vaporware R&D fluff looks more like Microsoft or Nokia than Apple. Apple wouldn’t say “we’re exploring ideas for a future product that will change the world.” Apple would just do it.

My earlier work with Social Book Club, and current work with Kirkus Reviews, has me spending a fair amount of time exploring and developing recommendation systems. There are a variety of good books and papers on the subject, but I recently finished reading “Mining of Massive Datasets” (a free ebook that accompanies a Stanford CS course on Data Mining), and it was a surprisingly good read.

The book covers a number of topics that come up frequently in data mining: reworking algorithms into a map-reduce paradigm, finding similar items, mining streams of data, finding frequent items, clustering, and recommending items. Unlike many texts on the subject, you won’t find source-code in this book; but rather, extensive explanations of multiple techniques and algorithms to address each topic. This lends itself to a better understanding of the theory, so that you understand the trade-offs you might be making when implementing your own systems.

There are easier texts to get through, but if you’re getting started with recommendation or data-mining systems, and haven’t read this book, I’d encourage you to do so.

We’ve seen various attempts at using JavaScript on the server over the last decade. Mozilla’s Rhino (Java) engine fueled most of it. However, with the release of Google’s V8 (C++) engine (and the networking performance example set by Node.js), the conversation is gaining traction.

The motivation for a 100% JavaScript stack, per conversations at Texas JavaScript Conference (TXJS) last weekend, is the desire to use a single programming language when developing web applications, rather than the mix of technologies we use today. It’s not so much that JavaScript is the best language for application development (contrary to the JS fanboys), but since it’s what we’re stuck with on the client-side, it’s worth considering on the server-side. With a single language, business logic can be reused on the client and the server (think form validation), and you avoid bugs caused by frequent language switching (i.e., using, or forgetting semi-colons, putting commas after the last item in an array, using the wrong comment delimiter, etc.)

The wrinkle in the 100% JavaScript argument, is whether JavaScript is actually the language you want to write your back-end in. The language lacks package management standards (though CommonJS is working to change that); It lacks the standard libraries and tools that the incumbents offer (i.e., no batteries included); Maybe people who use it don’t actually know the language very well; And it suffers from the multitude of bad examples and advice freely available online.

There have been some interesting Node-based applications developed already (i.e., Hummingbird), and the JavaScript on App Engine efforts (i.e., AppEngineJS) will be interesting to watch as well. (I expect both to foster more mature development patterns for large applications written in JavaScript.) However, in the near term, the 100% JavaScript stack will likely remain as niche as the Erlang, Haskel, Lisp, etc. web frameworks (as interesting as they may be.)

The question for you (Mr./Mrs. web developer/web-savvy business person), is whether JavaScript on the back-end offers a competitive advantage. Can you execute on an idea faster/better/cheaper than your competition because of your technology stack?

Coders at Work book cover

I finished reading “Coders at Work last night. In it, author Peter Seibel interviews 15 legendary programmers, discussing how they got started with computers, how they learned to program, how they read and debug code, etc. The interviews cover a wide range of opinions and approaches, and offers a fascinating look at “computer science” history.

The format of the book is a little unusual, in that it’s entirely interview transcripts. No analysis. No author-interpretation. Just recorded conversations. At first it’s a little surprising that one can publish a book like this; But then you get into the content and it’s wonderfully engaging. Analysis and interpretation would just get in the way of letting these folks talk. Reading direct quotes makes the content all the more exciting.

The book isn’t for everyone (obviously), but I rather enjoyed it. There’s some great stories about the history of our profession, and many topics raised that inspired additional research. (I went out and found a number of research papers referenced in the interviews, and bookmarked a lot of content for further exploration.) There’s also a fair amount on the history of different programming languages, and I have a fascination with programming languages, so it was a great fit.

A few take-away themes and ideas:

  • While programming was no easy task in the early days, at least it was possible to fully-understand the hardware and all the software running it (as opposed to modern computers.) The modern computing environment presents very different challenges to present-day programmers, especially those new to the field.
  • Even some of best use print statements.
  • Passion and enthusiasm separate good programmers from great ones.
  • In academia, you have time to think about the “best” solution, without the deadlines imposed on commercial developers.
  • There’s certainly a component of “doing great work” that requires being in the right place at the right time — sometimes it’s just a matter of getting staffed on the right project.
  • There’s some negativity towards C/C++ in here, mostly due to it’s negative impact on compiler and high-level language development. (i.e., one school of thought is that you give people a high-level language and make the compiler smart. The other is that you give people a low-level language and let them do the work. Unfortunately, humans aren’t so good at hand-writing code optimized for concurrency, but once you have a language that let’s them try, it’s hard to fund compiler research.)

Here’s a few of the quotes I highlighted while reading:

“One of the most important things for having a successful project is having people that have enough experience that they build the right thing. And barring that, if it’s something that you haven’t built before, that you don’t know how to do, then the next best thing you can do is to be flexible enough that if you build the wrong thing you can adjust.” — Peter Norvig

“…there are user-interface things where you just don’t know until you build it. You think this interaction will be great but then you show it to the user and half the users just can’t get it.” — Peter Norvig

“I get so much of a thrill bringing things to life that it doesn’t even matter if it’s wrong at first. The point is, that as soon as it comes to life it starts telling you what it is.” — Dan Ingalls

“…a complex algorithm requires complex code. And I’d much rather have a simple algorithm and simple code…” — Ken Thompson

“If you can really work hard and get some little piece of a big program to run twice as fast, then you could have gotten the whole program to run twice as fast if you had just waited a year or two.” — Ken Thompson

“if they’d have asked, ‘How did you fix the bug?’ my answer would have been, ‘I couldn’t understand the code well enough to figure out what it was doing, so I rewrote it.'” — Bernie Cosell

“You have to supplement what your job is asking you to do. If your job requires that you do a Tcl thing, just learning enough Tcl to build the interface for the job is barely adequate. The right thing is, that weekend start hacking up some Tcl things so that by Monday morning you’re pretty well versed in the mechanics of it.” — Bernie Cosell

“…computer-program source code is for people, not for computers. Computers don’t care.” — Bernie Cosell

“if you rewrite a hundred lines of code, you may well have fixed the one bug and introduced six new ones.” — Bernie Cosell

“I had two convictions, which actually served me well: that programs ought to make sense and there are very, very few inherently hard problems. Anything that looks really hard or tricky is probably more the product of the programmer not fully understanding what they needed to do” — Bernie Cosell

“You never, ever fix the bug in the place where you find it. My rule is, ‘If you knew then what you know now about the fact that this piece of code is broken, how would you have organized this piece of the routine?'” — Bernie Cosell

“Part of what I call the artistry of the computer program is how easy it is for future people to be able to change it without breaking it.” — Bernie Cosell