All posts by erik

Book: “Technical Blogging: Turn Your Expertise into a Remarkable Online Presence”

Technical Blogging cover
Like the book in my previous review, I bought “Technical Blogging: Turn Your Expertise into a Remarkable Online Presence“, by Antonio Cangiano, during a winter, e-book sale. Unlike the book in my previous review, I enjoyed this one.

While I’ve been blogging for over a decade, its never been a professional thing — I tend to write posts that are mostly reminders to my future self. I’ve thought about taking blogging more seriously (and have started, and abandoned a few that didn’t work out), but this book gave me new focus and motivation.

“Technical Blogging” is organized in five main sections: Planning (i.e., defining your topic and audience, setting goals, etc.); Building (which covers WordPress customizations and content strategies); Promoting (i.e., marketing and analytics); Benefiting from your blog (i.e., monetization and other perks); and finally, Scaling (i.e., increasing your reach, hiring writers, etc.) Through each section, Cangiano’s style is to offer very specific steps for each topic, along with an open-book view of his blogs’ traffic and revenue numbers. It’s opinionated, and reads like a blogging consultant telling you want to do. That style doesn’t always work for me, but in this case, I appreciated the frankness.

As a testament to his work, the book inspired me to finally start posting to a domain I’ve been sitting on for awhile, I’m optimistic that I can keep this one going.

Book: “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work”

I picked up “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” during a recent O’Reilly ebook sale, hoping that there might be a few relevant nuggets for the work I’m doing with Bittwist. Unfortunately, the book reads as a collection of unrelated, (mostly) data-themed articles that lack depth and fail to create cohesion.

“Bad Data Handbook” brings together 19 authors to tell their favorite “bad data” stories. With such diversity, the book helps broaden the definition of bad data, citing examples of encoding problems, data bias, formatting inconsistencies, etc. However, this isn’t the hands-on technical book that we usually get from O’Reilly — think of it more like bad data war stories told at the pub. Instead of leaving with a new set of tools and skills, you’re more likely to simply share a few, “oh yeah, I’ve dealt with that” moments.

The topic is nicely timed with the growing interest in big data, but the book falls short. Verdict? Skip this one.

Facebook-bot behaving badly

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.


Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.


Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic — you can gauge content impressions, but you’re not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.


With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn’t have to worry about clock-sync; and with round-robin load balancing, we get a “good enough” estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.


While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

Google Glass as an Apple strategy

If the recently hyped Google Goggles (AKA Project Glass) aren’t vaporware, they represent a classic move from the Apple playbook:

Invent the products that replace your best-sellers

I’m thinking of this in terms of Glass replacing Android handsets. Apple doesn’t wait for a competitor to offer something better than their products — they control the killing-off of their best-sellers with each next-generation release. While some companies delay innovation in attempt to extract every ounce of profit (and increasing margins) out of their products, Apple doesn’t leave their timelines, forecasting, and supply-chain in the hands on it’s competitors.

Don’t compete head-on (Android v. iOS)

Android is making great progress, but it’s an head-on fight; and advancements are always compared to the iPhone. Sure, there’s money to be made with Android handsets — but it’s a “me too” game.

Change the conversation (iPad vs. everything else)

Tablet computing is hot right now; but so far, trying to capture a piece of the iPad market (AKA the Tablet market), is a losing proposition. Apple owns this segment. Instead of playing catch-up, change the conversion. Invent a new platform, and own it’s market segment instead.


Of course, as nice as all of this sounds, daringfireball has the right interpretation: This vaporware R&D fluff looks more like Microsoft or Nokia than Apple. Apple wouldn’t say “we’re exploring ideas for a future product that will change the world.” Apple would just do it.

Writing a simple memoization decorator in Python

I was generating some reports recently that involved accessing expensive object methods whose results were known to not change on subsequent calls; However, instead of using local variables, I sketched-out this quick memoization decorator to save method responses as variables on the object (using a leading ‘_’ followed by the method-name as the variable name):

def cache_method_results(fn):
    def _view(self, *args, **kwargs):
        var_name = '_{n}'.format(n=fn.__name__)

        if var_name in self.__dict__:  # Return the copy we have
            return self.__dict__[var_name]

        else:  # Run the function and save its result
            self.__dict__[var_name] = fn(self, *args, **kwargs)
            return self.__dict__[var_name]

    return _view

You might use it like this:

class Foo(object):
    def some_expensive_operation(self):
        ...calculate something big and unchanging...
        return results

f = Foo()
print(f.some_expensive_operation())  # This first call will run the calculation
print(f.some_expensive_operation())  # but this one will used the cached result instead

It’s not rocket science, but these little tricks add to the fun of using Python.

“Mining of Massive Datasets”

My earlier work with Social Book Club, and current work with Kirkus Reviews, has me spending a fair amount of time exploring and developing recommendation systems. There are a variety of good books and papers on the subject, but I recently finished reading “Mining of Massive Datasets” (a free ebook that accompanies a Stanford CS course on Data Mining), and it was a surprisingly good read.

The book covers a number of topics that come up frequently in data mining: reworking algorithms into a map-reduce paradigm, finding similar items, mining streams of data, finding frequent items, clustering, and recommending items. Unlike many texts on the subject, you won’t find source-code in this book; but rather, extensive explanations of multiple techniques and algorithms to address each topic. This lends itself to a better understanding of the theory, so that you understand the trade-offs you might be making when implementing your own systems.

There are easier texts to get through, but if you’re getting started with recommendation or data-mining systems, and haven’t read this book, I’d encourage you to do so.