Facebook-bot behaving badly

Quick Summary: Facebookbot can be sporadically greedy, but will back-off temporarily when sent an HTTP 429 (Too Many Requests) response.

You may know the scenario all too well: Your website is running great; Your real-time analytics show a nice (but not unusual) flow of visitors; and then someone calls out, “Hey, is anyone else seeing timeout errors on the site?” Crap. You flip over to your infrastructure monitoring tools. CPU utilization is fine; the cache servers are up; the load balancers are… dropping servers. At work, we started experiencing this pattern from time-to-time, and the post-mortem, log-analysis always showed the same thing: An unusual spike in requests from bots (on the order of 10-20x our normal traffic loads.)

Traffic from bots is a mixed blessing — they’re “God’s little spiders”, without which your site will be unfindable. But entertaining these visitors comes with an infrastructure cost. At a minimum, you’re paying for the extra bandwidth they eat; and worst-case, extra hardware (or a more complex architecture) to keep-up with the traffic. For a small business, it’s hard to justify buying extra hardware just so the bots can crawl your archives faster; but you can’t afford site outages either. What do you do? This is traffic you’d like to keep — there’s just too much of it.

Our first step was to move from post-mortem to pre-cog. I wanted to see these scenarios playing out in real-time so that we could experiment with different solutions. To do this, we wrote a bot-detecting middleware layer for our application servers (something easily accomplished with Django’s middleware hooks.) Once we could identify the traffic, we used statsd and graphite to collect data and visualize bot activity. We now had a way of observing human-traffic patterns in comparison to the bots — and the resulting information was eye-opening.

Let’s start with a view of normal, site-wide traffic:

normal site traffic in graphite

In the graph above, the top, purple line plots “total, successful server requests” (e.g, 200’s, for the web folks.) Below that we see Googlebot in blue, Facebookbot in green, and all other known-bots in red. This isn’t what I’d call a high-traffic site, so you’ll notice that bots make up almost half of the server requests. [This is, by the way, one of the dangers of only using tools like Chartbeat to gauge traffic — you can gauge content impressions, but you’re not seeing the full server load.]

Now let’s look at some interesting behavior:

bad bot traffic in graphite

In this graph, we have the same color-coding: Purple plots valid, HTTP 200 responses; Blue plots Googlebot; Green plots Facebookbot; and red is all other bots. During the few minutes represented in the far, right-hand side of the graph, you might have called the website “sluggish”. The bot-behavior during this time is rather interesting: Even though the site is struggling to keep up with the increased requests from Facebookbot, the bot continues hammering the site. It’s like a kid repeatedly hitting “reload” in their web browser when they see a HTTP 500 error message response. On the other hand, Googlebot notices the site problems and backs-down. Here’s a wider view of the same data that shows how Googlebot slowly ramps back up after the incident:

bad bot traffic in graphite

Very well done Google engineers! Thank you for that considerate bot behavior.

With our suspicions confirmed, it was time to act. We could identify the traffic at the application layer, so we knew that we could respond to bots differently if needed. We added a throttling mechanism using memcache to count requests per minute, per bot, per server. [By counting requests/minute at the server-level instead of site-wide, we didn’t have to worry about clock-sync; and with round-robin load balancing, we get a “good enough” estimate of traffic. By including the bot-name in the cache-key, we can count each bot separately.]

On each bot-request, the middleware checks the counter. If it has exceeded its threshold, the process is killed, and an HTTP 429 (Too Many Requests) response is returned instead. Let’s see how they respond:

Bot traffic response to HTTP 429

Here we see the total count of HTTP 200 responses in green; Googlebot in red; and Facebookbot in purple. At annotation ‘A’, we see Googlebot making a high number of requests. Annotation ‘B’ shows our server responding with an HTTP 429. After the HTTP 429 response, Googlebot backs down, waits, and the resumes at it’s normal rate. Very nice!

In the same chart (above), we also see Facebookbot making a high number of requests at annotation ‘C’. Annotation ‘D’ shows our HTTP 429 response. Following the response, Facebookbot switches to a spike-pause-repeat pattern. It’s odd, but at least it’s an acknowledgement, and the pauses are long enough for the web-servers to breathe and handle the load.

While each bot may react differently, we’ve learned that the big offenders do pay attention to HTTP 429 responses, and will back down. With a little threshold tuning, this simple rate-limiting solution allows the bots to keep crawling content (as long as they’re not too greedy) without impacting site responsiveness for paying customers (and without spending more money on servers.) That’s a win-win in my book.

Writing a simple memoization decorator in Python

I was generating some reports recently that involved accessing expensive object methods whose results were known to not change on subsequent calls; However, instead of using local variables, I sketched-out this quick memoization decorator to save method responses as variables on the object (using a leading ‘_’ followed by the method-name as the variable name):

def cache_method_results(fn):
    def _view(self, *args, **kwargs):
        var_name = '_{n}'.format(n=fn.__name__)

        if var_name in self.__dict__:  # Return the copy we have
            return self.__dict__[var_name]

        else:  # Run the function and save its result
            self.__dict__[var_name] = fn(self, *args, **kwargs)
            return self.__dict__[var_name]

    return _view

You might use it like this:

class Foo(object):
    @cache_method_results
    def some_expensive_operation(self):
        ...calculate something big and unchanging...
        return results


f = Foo()
print(f.some_expensive_operation())  # This first call will run the calculation
...
print(f.some_expensive_operation())  # but this one will used the cached result instead

It’s not rocket science, but these little tricks add to the fun of using Python.

“Mining of Massive Datasets”

My earlier work with Social Book Club, and current work with Kirkus Reviews, has me spending a fair amount of time exploring and developing recommendation systems. There are a variety of good books and papers on the subject, but I recently finished reading “Mining of Massive Datasets” (a free ebook that accompanies a Stanford CS course on Data Mining), and it was a surprisingly good read.

The book covers a number of topics that come up frequently in data mining: reworking algorithms into a map-reduce paradigm, finding similar items, mining streams of data, finding frequent items, clustering, and recommending items. Unlike many texts on the subject, you won’t find source-code in this book; but rather, extensive explanations of multiple techniques and algorithms to address each topic. This lends itself to a better understanding of the theory, so that you understand the trade-offs you might be making when implementing your own systems.

There are easier texts to get through, but if you’re getting started with recommendation or data-mining systems, and haven’t read this book, I’d encourage you to do so.

Listening to customers

Back when I was in Product Management, I used surveys to gather feedback from beta testers. Given how valuable (and appreciated) the feedback could be, I now make a point to participate in surveys when asked. Unfortunately, even something as simple as a survey doesn’t always go as planned. Here’s what I was greeted with yesterday during an attempt to provide feedback:

Screen shot 2010-08-24 at 9.16.04 AM.png

Pretty awesome, huh?

I had better luck loading the page today; However, after spending a few minutes filling out a survey, guess which button didn’t work?

Screen shot 2010-08-25 at 10.58.30 AM.png

I generally expect only a very small percentage of customers to fill-out surveys, so the reliability of the survey service is of utmost importance — if you actually want to listen. In this case, I hope that web metrics can be used to track how many customers started the survey vs. how many completed the task. [NOTE: If you’re designing surveys, tracking abandonment points during the survey process can also give you an idea whether your surveys are too long, or asking the wrong questions.]