Spammers have already created a wasteland out of e-mail networks, and I am damned if I am going to let them ruin what has basically become my number one source of brain food - blogs.

Spammers are using comment submission forms and trackbacks to insert unwelcome content into comments, whats worse is these parasitic posts end up getting indexed by search engines like Google, kind of like scars (scroll down to the bottom) that hang around long after the offending comment has been removed.

The problem is that alot of the bloggers out there to which I am subscribed are starting to talk about turning off comments. I now believe that the MSDN bloggers can't receive comments on posts older than thirty days.

This weekend, between sleeping, folding clothes with my wife and taking my daughter swimming I have spent some serious hours looking at ways that comment spam can be tackled. One mechanism that has been discussed for e-mail is Hashcash which I found whilst searching for literature related to Microsoft Research's Penny Black project.

Hashcash is a simple mechanism sometimes used in e-mail to force the sender of an e-mail to perform some moderately expensive computation that produces a "stamp" which can be sent along with the message and cheaply verified by the receiver. A stamp looks something like this.

0:030829:foo123456789:lnymsmzsbksvkavrzltdcr/+

Hashcash actually has two stamp formats, version 0 and version 1 which has just been released. I focused on implementing some code that would produce version 0 stamps because it was easier, but I think version 1 would be better moving forward.

The version 0 stamp is broken down into four parts, each seperated by a ":" character.

  1. Stamp format identifier.
  2. Date time in yymmdd format.
  3. Resource identifier.
  4. Rand characters.

The idea is that the sending software produces a "candidate" stamp where the first three parts remain the same and the random part changes. A SHA1 hash is then performed on the candidate stamp. The hash is then analysed to see how many of the leading bits are zeros. The more zeros, the higher the value of the stamp.

When sending an e-mail, the recipient expects to get stamps of a certain value, and they are easy to verify (by performing the SHA1 hash on the stamp and counting the bits).

Because not every random string will produce a hash with the right amount of leading zeros it needs to be attempted multiple times (the code I have running in the background here is producing a stamp of sufficient value between 3000 to 400000 hashing iterations.

So how does this help? Well, the Hashcash FAQ lays it out better than I do, but basically spammers use armies of drone mailers, this is profitable for them because it is very cheap to send an e-mail, but by forcing the mailer to "mint" a stamp it slows down the rate, especially since each recipient needs to have a stamp produced just for them. The idea is to destroy the economic model what spammers use.

How does this relate to blogs? Well I think that we could use a similar mechanism to protect our comments from spammers who would use form posters and trackbacking drones to polute them. The mechanism would be implemented in two independent pieces.

Hashcash for Comments

Now this is the thing that I haven't tested, but in theory it will work. We implement a hashcash minting algorithm in JavaScript that can be embedded in the comment submission forms. When the submit button is clicked the algorithm kicks off and inserts the stamp into a hidden field that can then be read by the server-side code.

The prerequisites for getting this working would be a SHA1 algorithm in JavaScript, I've found one here. I haven't tested it but it is a starting point for someone who wants to tackle this side of things.

Details like the resource name and stamp value (from v1 format of Hashcash) would be provided by the server when it renders the page that contains the javascript implementation of the hashcash algorithm.

Hashcash for TrackBacks

OK, I won't pretend to be an expert on how trackbacks work, but my idea is to place some additional content in the RDF element contained within blog pages to describe what kind of stamp format is required and its value (just in case we want to change this later on). Here is a sample of the modified RDF.

Now, lets go through a scenario. When I make a post using BlogJet (sub in your favorite poster here) to .Text (sub in your favorite blog host here) it goes off to do the trackbacks. Before it can do a trackback it needs to pull down the content of the referenced URL and search for the RDF element to get the trackback URL.

At this point it also looks for the hashcash information and computes a stamp up to the required value passing in the URL (URL encoded) as the resource name, I'm not sure yet whether the resource name should be the trackback URL or the originally referenced URL, my current thinking is that it should be the trackback URL (including the unique query string).

.Text would then attempt to make the trackback, but in addition to the normal arguments in the HTTP payload it would also include the stamp value. The recipient of the trackback would then quickly verify the stamp by performing a SHA1 hash on it and ensuring it had the required value.

Here is a link to the REALLY REALLY POORLY PERFORMING CODE that I wrote whilst getting to know Hashcash. I haven't really factored the code well and there is no comments, in fact you can tell it was a quickie by the namespace. But I thought that the .NETers amongst you would probably find the .NET class library more hospitable than some of the C, Java, Perl, and Python that I have read over the last few days.

Update: If you are .NET savvy and took a look at my implementation, I suggest that you flip over to using the SHA1Managed algorithm, it seems faster than the SHA1CryptoServiceProvider, presumably because there is no managed/unmanaged transition. I'm running the B1 bits of Whidbey at the moment - does anyone have a profiler that will work with this setup?

Known Issues

I know there are issues with the proposal, thats why I am going to call this a starting point.

The first issue that I can see is that as the required number of zeros in the stamp climb so does the cost, and I can see a point in the very near future where that would be too expensive (in fact the code presented above requires 24 zeros, thats pretty costly in my slow implementation).

Microsoft Research listed some alternatives with their Penny Black research (not necessarily the original source) including memory bound functions which relied on CPU cache misses to prove the effort, which is more egalitarian than pure CPU horsepower effort.

Similarly, if we look at huge blog hosts like http://weblogs.asp.net then the burden of producing stamps on behalf of its users would cause flat heads on the boxes hosting the system. In this instance I recommend that the posting applications supply the hosting system with a set of stamps which match the URLs referenced in the post. Web-based posters could use the same javascript approach used for comments.

There will also be an adoption issue, although given the rate at which technical blogging folk adopt new things like this it'll probably be more successful than e-mail. Hopefully this post serves as a rallying point for the likes of BlogJet, NewsGator, Scott Watermasysk and everyone else who contributes bits to the blogosphere.

Comments, suggestions, implementations? Leave comments.

Resources for Implementors

  • http://www.hashcash.org
  • http://www.hashcash.org/faq/
  • http://www.hashcash.org/tool/
  • http://www.hashcash.org/dev/