5 Jul 2008, 1:44 p.m.

Fighting Spam and Digitising Books with reCAPTCHA

When I added a comment form to this blog, I wondered how long it would be before I started getting comment spam. Then I wondered if I was flattering myself to think that spam bots would even be interested in my site.

So it's with mixed emotions that I have to admit that right now the number of spam comments I'm receiving is outstripping the number of genuine comments by a ratio of about 10:1.

The time has come to add a CAPTCHA to the comment form.

The Wikipedia article describes the CAPTCHA concept adequately, so I'll merely summarise that a CAPTCHA is a simple test that the poster of the comment is human. I show you a picture of some wonky-looking text, and you type the words you see into the box provided.

Some wonky-looking text

Fig 1: Some wonky-looking text

If you correctly identify the words, I'll assume that you're a real person, and not an evil bot. And your comment will get posted. Simple as that.

reCAPTCHA

I had been meaning to have a play with reCAPTCHA since it caught my eye a few months back. It's a great idea: a totally free CAPTCHA tool, developed by Carnegie Mellon University, that anyone can use on their website.

What makes reCAPTCHA special is that at the same time as you're reading that wonky text and entering the words in the box, you're playing your part in a global effort to digitise pre-computer era books, by deciphering the words that OCR software struggles with. There's a more detailed overview of the project here.

It's kind of a cool idea, so I'm going to co-opt reCAPTCHA to help me fend off those evil spammers. I won't be alone: reCAPTCHA counts sites as large as Facebook, Twitter and StumbleUpon among its users [1].

Implementation

The first step in using reCAPTCHA is to drop in at the reCAPTCHA site and get yourself an account. Of course, you'll have to fill in a reCAPTCHA to do this!

As part of the signup process, you'll be prompted to request a key for your first domain (each key is restricted for use on only one domain, apparently for security reasons). In fact, you receive both a public and a private key, and we'll see how to use those shortly. The whole process takes about two minutes.

Once you're signed up, you're free to start implementing reCAPTCHA. For us PHP users, this is delightfully simple, as the reCAPTCHA guys have thoughtfully knocked up a small library to wrap their API. You can download the library from the project's Google Code pages.

Simply download the code, and unzip it somewhere sane and accessible on the webserver. I'll refer to the installation directory as /path/to/recaptcha for the purposes of this post.

To begin using reCAPTCHA, we'll start by adding some HTML to the comment form in order to display the reCAPTCHA challenge box. The library generates all the HTML we need:

It really is that simple, and the reCAPTCHA challenge box shows up as if by magic. With its default theme, it looks like so:

Screenshot of the default reCAPTCHA challenge box

Fig 2: The default reCAPTCHA challenge box

Drop that HTML into the appropriate place in whichever form you want to protect from spam. Once the form is submitted, you can check the validity of the submission as follows:

is_valid ) { // assume the user is human // so post the comment } else { // CAPTCHA was not entered correctly // so redisplay the form }

Job done, basically. You can theme the actual reCAPTCHA box - to an extent - quite easily, which is nice as the default beige and maroon jarred a little with my fetching grey and turquoise getup. To do that, add a small snippet of Javascript to the form page:

;

There is also a "custom" theme which gives you a lot more control over the look and feel, but for the time being I stuck with "white". The whistles and bells can wait!

That's really all there is to it. If you like, you can see the finished comment form, replete with reCAPTCHA, for this post. Time will tell what effect this has on the amount of spam I receive.

Footnotes

[1] http://news.bbc.co.uk/1/hi/technology/7023627.stm

Posted by Simon at 01:53:00 PM