Wednesday, September 20, 2006

Fighting Spam for the Common User

I have a Hotmail account that I use primarily as a spam dump. (i.e. the sites that require you to “Sign up for a free account”) It sees less use now that I found Bugmenot (http://www.bugmenot.com), which provides a forum for sharing usernames and passwords for sites requiring free registration. However, one way or another, my Hotmail account receives about 15 messages a day, over 90% of them useless spam. Logging into the web interface and manually deleting 200 messages at a time quickly became tedious, so I started looking for a better way.

Solution - Version 1

The mail server that I worked on last school year included in its feature set a spam and virus scanner. The server used a full install of Qmailrocks (www.qmailrocks.org), or at least as much as I got working in the time that I had. QMR includes qmail-scanner, a Perl script that uses spamassassin and clamav (with others available) to filter, mark and delete unwanted messages. I decided that since I had a domain and a mailserver, I might be able to do something with my Hotmail-accumulated cruft.

The first hurdle was getting the mail on the machine. Hotmail has ceased to use a straight POP3 interface, so this was not a place to use Fetchmail. Instead, I found a screen-scraper program (which actually downloads each web page and extracts the message) called Gotmail. Gotmail has an option to forward mail to a different address, so I conveniently forwarded it to an address that I created expressly for the purpose. After figuring out that qmail-scanner usually doesn't run SpamAssassin on local mail (and how to enable this option), I ran into another problem.

It turns out that because qms is written in perl, it is also abysmally slow when dealing with mail in bulk (I was processing 200 messages in a batch). The concurrency is limited to one copy at a time, and for each instance, it has to start a new perl interpreter instance. Nicht gut.

Before I found that, I tried just setting the default delivery for the appropriate user (I was using vpopmail and virtual domains) to pipe the mail through spamc. Note I said “spamc”, not spamassassin. Spamc talks to spamd, the daemonized version of spamassassin, which reduces load, as the rules must only be loaded once. Doing it this way also raised my concurrency to whatever was set in the spamc config file. I started with the concurrency set to 4 or 5. The first time I triggered my new script, I thought I had crashed my server. Dual Pentium 2s at 333 MHz and 384 MB RAM weren't enough to handle that many processes at once, and the server stopped responding to all input. The other problem with this approach was that mail was marked, but not deleted. I still had to manually go into the folder in Pine and delete a lot of messages. Better, but not perfect. Using qms deleted high-scoring messages, but would require a lot of tweaking to discard all that I wanted.

Solution - Version 2

I think that I have now found an almost-ideal solution. Since I moved into my new apartment, Xavier (my server) has lain dormant for lack of a good home. I found it of vital importance to have unfiltered Internet access, something that CSM does not provide to residents in my area. Also, it uses quite a bit of electricity and is quite loud (old 120 mm fans). The above factors have conspired against using it here.

I also knew the headaches of setting up a full QMR setup, and because Rogue is my primary desktop machine, I didn't want to load it down with the bulky installation. All I really needed was spamassassin, not qms or even qmail.

I broke the problem down into three blocks. The first was getting the mail onto the machine. The second hurdle was filtering it properly through spamassassin. Lastly, I didn't want to hand-separate spam from good email.

The first block was accomplished once again through gotmail. The contents of my .gotmailrc are:

# See the manpage for more options!

username=[username]
password=[password]
domain=hotmail.com

save-to-login
only-new
delete
summary

folder-dir=/home/kenton
folders=Inbox
retry-limit=3


Read the gotmail man page to learn the nitty-gritty details of each option, but thankfully they've kept it pretty close to plain English. All these options are available on the command line, but an “rc” file keeps it cleaner, and also provides increased security for my username and password.

This file will retrieve all new mail in the “Inbox” folder, delete the copy on the server, and leave it in an mbox-format file in my home directory called “missionpilot9”.

The next hurdle is to filter it through spamassassin. I chose not to use spamc/spamd for this task because I didn't need the concurrency (speed is not the point here. Efficiency is.), and didn't want to set up the appropriate startup/run scripts for spamd.

The way spamassassin normally works is to read a message from standard input, scan it, then put it back out on standard output. We don't have individual messages, we have a whole clump of them in one file. Things are never easy, right? Well, if we add the --mbox option to the spamassassin command line and use some creative redirection, we can do what we need to. This is bash scripting at its best!

Now we have a big glob of SA-tagged messages in another huge mbox file. (361 messages took up 5.1 MB without SA markup, and 5.5 MB with it). Filtering can now be done on a MUA (Mail User Agent) level, such as Thunderbird, Outlook Express (ick, shudder), or Pine. In fact, this was what I tried first. Thunderbird has options to trust spamassassin-added headers, and is perfectly content to filter on X-Spam-Status: (Yes|No). Thunderbird is a huge monolithic application, though, and there must be a better way. Pine! Yes, pine can filter the same way, but that still involves work. There is a world-class best way.

Maildrop is a software very similar to Procmail that can sort mail based on user-definable “recipes”. It is written by Sam Varshavchik (who wrote the Courier suite of mail programs). Maildrop is integrated into Courier, but since my computer doesn't have courier, I downloaded and installed maildrop on its own.

I don't claim to be able to write maildrop recipes, but I can read and copy them. I have put a few links in the bibliography at the end with more info. Here is what I use:

#~/.mailfilter
if ( /^X-Spam-Status: Yes/ )
{
to "$HOME/mail/Junk"
}

to "$HOME/mail/cleanmp9"

Very simply, this says “If you find a header that says “X-Spam-Status: Yes”, put this message in the junk folder. Otherwise, deliver it to a “clean” folder.”

Maildrop is quite efficient. I test ran it on a batch of 360 messages. It took 13 minutes to download the messages, 48 minutes to scan them with SpamAssassin, then under a minute to deliver them with Maildrop.

Lastly, I open either Pine or Thunderbird to make sure that the mail was sorted correctly. With that ensured, you are down to the last two tasks.

Sometimes I get spam that doesn't trigger the DCC/Razor/Pyzor corpus checks. It's certainly spam, so we should report it. Reporting it also teaches the local Bayes database to refine further searching.

To report it:

spamassassin -r --mbox < ~/mail/Junk 2> ~/SA-Report.txt

This redirects the Junk folder into spamassassin, with the -r making it report the spam. The “2>” part redirects standard error to a file that I can read at my convenience. If you just want to put it into the Bayes database, you can run the following command.

sa-learn --spam --mbox --progress [folder]
# where [folder] is the folder of icky messages
sa-learn --ham --mbox --progress [folder]
# don't forget to train sa-learn on ham, too!

To save tedium, I combined almost all of these steps into a single shell script. This way, I can just run hotmail.sh, put it in the background, and come back later to retrieved and filtered mail.


#!/bin/sh
# hotmail.sh
#
# Simple script to battle the spammers
#

# First, let's make sure we're in the correct directory, and remove the UTF-8 language tag
# for better SA performance

cd $HOME
LANG=en_US; export LANG

# The point is to pull down my Hotmail to a local file first, then clean it
# The relevant settings are in my ~/.gotmailrc, for security reasons

time gotmail

# Now that we have it, we should filter it through SpamAssassin
# We take it from the file where Gotmail dumped it, then run it through SA
# to a file where maildrop can find it.

time spamassassin --mbox < ~/missionpilot9 >> ~/MP9temp

# Lastly, use reformail to feed the individual messages to maildrop
# so that we can sort good from bad and ugly.

time reformail -s maildrop < ~/MP9temp

# We should report all the nasty spam to the proper authorities

# spamassassin -r --mbox < ~/mail/Junk 2> ~/SA-Report.txt

# Clean up any mess that we made
rm missionpilot9
rm MP9temp

#EOF

And there you have it. The good mail is in one folder, the spam in another. All is right with the world. Or at least mostly.

Future goals: Gotmail has built-in hooks for spamassassin. It would be good to set it up to delete the really high scoring email (say, over 10 points) without even downloading it. It would save bandwidth, hard drive space, and hopefully time. Until then, though, this works fine.

Resource list:

Gotmail

http://www.sourceforge.net/gotmail

The man page is pretty good, you shouldn't need anything else. Copy the sample rc file, edit it to your liking, then you're off and running.

SpamAssassin

http://spamassassin.apache.org/ The main page
http://www.rulesemporium.com Add some custom SA rules
http://www.freespamfilter.org/FC4.html#_Toc110999208 How to install SA with all the goodies, including:

http://www.rhyolite.com/anti-spam/dcc/ DCC
http://www.sourceforge.net/pyzor Pyzor
http://www.sourceforge.net/razor Razor

Maildrop
http://www.courier-mta.org/maildrop/ The main maildrop page
http://www.firstpr.com.au/web-mail/Postfix-SA-Anomy-Maildrop/ Step by step to a well-equipped mail server
http://www.ufsdump.org/papers/courier-imap-fc3.html Courier and accessories on Fedora

2 Comments:

Anonymous Anonymous said...

Yeah, right, you're not computer science! I rely entirely on my school e-mail address to get rid of spam. It puts it all into a "junk e-mail" message, which I can review and delete. I only get two or three other junk messages, which are easy to identify and delete. Your way is probably better, but waaaay over my head!

September 21, 2006 at 1:16 PM  
Blogger The student of life said...

It is now marked as a "geek post" (9/21)...although I knew it would be of limited interest anyway.

This way is very overkill for the normal user. The school filters a lot of spam out of my normal email, then Thunderbird (my mail client) usually snags the rest. This is just a great way to deal with big batches of nasty email. Besides, it's also elegant

September 21, 2006 at 11:13 PM  

Post a Comment

<< Home