On the 11th of April 2004 the Australian Spam Act (2003) became law. The law has provisions to impose fines up to $1.1 million dollars per day.

 The Spam Act prohibits the sending of unsolicited commercial electronic messages that have an Australian link. This means that commercial spam, sent by mobile phone as well as by e-mail, is not permitted to originate from Australia and is not allowed to be sent to Australian addresses, whatever their point of origin.

Enforcement of this law was always going to be a problem since its hard to prosecute people sending generic spam messages for things like pharmaceuticals when the message originated in another country. Prosecuting the senders of spam where that spam originated in Australia (assuming it wasn’t a drone mailer) is a little bit easier since Australian law enforcement officials would have access to the culprit.

The Australian Communications Authority is charged with the enforcement of the Australian anti-spam law and has had a moderate amount of success, apparently being able to get Australia OFF the list of top ten countries from where spam originates (according to SPAMHAUS). Unfortunately ACA investigators have been unable to keep up with the flow of SPAM being reported manually by Australian citizens let alone the volumes being forwarded into their system by automated spam honey pots (I love Wikipedia).

In December, to help tackle this problem, the ACA went out to tender for expressions of interest for the provision of a Spam Forensics System. On the 28th of January that tender closed and it will be interesting to see who went into bid to provide the system.

The tender document (PDF) details that the ACA expects to spend no more than $300,000.00 on the system which would need to initially handle 250,000 e-mail submissions per day from 50,000 individual users, and the system would need to be theoretically capable of handling five million messages per day. If the average size of a spam message was five kilobytes then that would be a daily throughput of around twenty five gigabytes of data, not counting the overhead of the SMTP protocol itself or other protocols used for submission.

Speaking of submission, part of the requirements for the EOI were to provide an easy to use mechanism for Australian citizens to submit candidate SPAM messages that could work on multiple platforms and optionally provide additional information such as why they think a seemingly legitimate e-mail is indeed a SPAM message.

As messages are processed by the system it would need to automatically categorize messages to assist in targeting investigation. For example messages that don’t originate in Australia and don’t specifically link to an Australian resource would have a lower priority because the chance of a prosecution would be much lower. The flip-side is that messages which contain a new phishing scam affecting Australian businesses like banks would need to automatically be given a higher priority because of the potential for real damage rather than just annoyance. And finally blatantly illegal activity like the sale of illicit drugs or child pornography would need to be automatically forwarded to the police for investigation.

In short, the system would need to handle tremendous volumes of data in an efficient and timely manner and could easily become a critical piece of Australian communications infrastructure. If only I had found out about the tender earlier, it would have been something I would have loved to tackle from an architectural point of view.

Since I didn’t have the time to produce something by the deadline I thought I might post a few thoughts on how I might design the system if it was to be built from scratch.

Overall Architecture

First off, I envisage a system that can accept input from multiple different sources – individual e-mails with spam as attachments, web-service calls containing a verbatim copy of the SPAM message (including headers) and bulk transfers via protocols like FTP.

Spam Input Diagram

Receiving the message is just the beginning, from here it would be processed asynchronously through a pipeline of analysis routines. These routines would be pluggable so as the requirements of the system change so could the pipeline to deal with the information more efficiently. The analysis routines would attach findings to the message (whilst keeping the message in-tact) and these findings could be used for routing.

As an example, analysis and routing could be used hand and hand to drastically reduce the amount of computation required by automatically filtering out SPAM which the ACA couldn’t deal with effectively because it originated from another country and didn’t seem to have any Australian links.

SpamAnalysis

The origin analysis component could rely on external services such Geobytes to accurately determine the source of messages. This pattern could be repeated inside the software until all messages ended up in queues that investigators had some hope of actively tackling. If multiple countries used the same system (or a compatible one) then they could forward leads to each other including any findings that may have already been computed.

I’ve got more to write on this topic but I’ll hang up the keyboard for now. As I said – it would have been an interesting project to work on.