Wednesday, September 20, 2006

Fighting Spam for the Common User

I have a Hotmail account that I use primarily as a spam dump. (i.e. the sites that require you to “Sign up for a free account”) It sees less use now that I found Bugmenot (http://www.bugmenot.com), which provides a forum for sharing usernames and passwords for sites requiring free registration. However, one way or another, my Hotmail account receives about 15 messages a day, over 90% of them useless spam. Logging into the web interface and manually deleting 200 messages at a time quickly became tedious, so I started looking for a better way.

Solution - Version 1

The mail server that I worked on last school year included in its feature set a spam and virus scanner. The server used a full install of Qmailrocks (www.qmailrocks.org), or at least as much as I got working in the time that I had. QMR includes qmail-scanner, a Perl script that uses spamassassin and clamav (with others available) to filter, mark and delete unwanted messages. I decided that since I had a domain and a mailserver, I might be able to do something with my Hotmail-accumulated cruft.

The first hurdle was getting the mail on the machine. Hotmail has ceased to use a straight POP3 interface, so this was not a place to use Fetchmail. Instead, I found a screen-scraper program (which actually downloads each web page and extracts the message) called Gotmail. Gotmail has an option to forward mail to a different address, so I conveniently forwarded it to an address that I created expressly for the purpose. After figuring out that qmail-scanner usually doesn't run SpamAssassin on local mail (and how to enable this option), I ran into another problem.

It turns out that because qms is written in perl, it is also abysmally slow when dealing with mail in bulk (I was processing 200 messages in a batch). The concurrency is limited to one copy at a time, and for each instance, it has to start a new perl interpreter instance. Nicht gut.

Before I found that, I tried just setting the default delivery for the appropriate user (I was using vpopmail and virtual domains) to pipe the mail through spamc. Note I said “spamc”, not spamassassin. Spamc talks to spamd, the daemonized version of spamassassin, which reduces load, as the rules must only be loaded once. Doing it this way also raised my concurrency to whatever was set in the spamc config file. I started with the concurrency set to 4 or 5. The first time I triggered my new script, I thought I had crashed my server. Dual Pentium 2s at 333 MHz and 384 MB RAM weren't enough to handle that many processes at once, and the server stopped responding to all input. The other problem with this approach was that mail was marked, but not deleted. I still had to manually go into the folder in Pine and delete a lot of messages. Better, but not perfect. Using qms deleted high-scoring messages, but would require a lot of tweaking to discard all that I wanted.

Solution - Version 2

I think that I have now found an almost-ideal solution. Since I moved into my new apartment, Xavier (my server) has lain dormant for lack of a good home. I found it of vital importance to have unfiltered Internet access, something that CSM does not provide to residents in my area. Also, it uses quite a bit of electricity and is quite loud (old 120 mm fans). The above factors have conspired against using it here.

I also knew the headaches of setting up a full QMR setup, and because Rogue is my primary desktop machine, I didn't want to load it down with the bulky installation. All I really needed was spamassassin, not qms or even qmail.

I broke the problem down into three blocks. The first was getting the mail onto the machine. The second hurdle was filtering it properly through spamassassin. Lastly, I didn't want to hand-separate spam from good email.

The first block was accomplished once again through gotmail. The contents of my .gotmailrc are:

# See the manpage for more options!

username=[username]
password=[password]
domain=hotmail.com

save-to-login
only-new
delete
summary

folder-dir=/home/kenton
folders=Inbox
retry-limit=3


Read the gotmail man page to learn the nitty-gritty details of each option, but thankfully they've kept it pretty close to plain English. All these options are available on the command line, but an “rc” file keeps it cleaner, and also provides increased security for my username and password.

This file will retrieve all new mail in the “Inbox” folder, delete the copy on the server, and leave it in an mbox-format file in my home directory called “missionpilot9”.

The next hurdle is to filter it through spamassassin. I chose not to use spamc/spamd for this task because I didn't need the concurrency (speed is not the point here. Efficiency is.), and didn't want to set up the appropriate startup/run scripts for spamd.

The way spamassassin normally works is to read a message from standard input, scan it, then put it back out on standard output. We don't have individual messages, we have a whole clump of them in one file. Things are never easy, right? Well, if we add the --mbox option to the spamassassin command line and use some creative redirection, we can do what we need to. This is bash scripting at its best!

Now we have a big glob of SA-tagged messages in another huge mbox file. (361 messages took up 5.1 MB without SA markup, and 5.5 MB with it). Filtering can now be done on a MUA (Mail User Agent) level, such as Thunderbird, Outlook Express (ick, shudder), or Pine. In fact, this was what I tried first. Thunderbird has options to trust spamassassin-added headers, and is perfectly content to filter on X-Spam-Status: (Yes|No). Thunderbird is a huge monolithic application, though, and there must be a better way. Pine! Yes, pine can filter the same way, but that still involves work. There is a world-class best way.

Maildrop is a software very similar to Procmail that can sort mail based on user-definable “recipes”. It is written by Sam Varshavchik (who wrote the Courier suite of mail programs). Maildrop is integrated into Courier, but since my computer doesn't have courier, I downloaded and installed maildrop on its own.

I don't claim to be able to write maildrop recipes, but I can read and copy them. I have put a few links in the bibliography at the end with more info. Here is what I use:

#~/.mailfilter
if ( /^X-Spam-Status: Yes/ )
{
to "$HOME/mail/Junk"
}

to "$HOME/mail/cleanmp9"

Very simply, this says “If you find a header that says “X-Spam-Status: Yes”, put this message in the junk folder. Otherwise, deliver it to a “clean” folder.”

Maildrop is quite efficient. I test ran it on a batch of 360 messages. It took 13 minutes to download the messages, 48 minutes to scan them with SpamAssassin, then under a minute to deliver them with Maildrop.

Lastly, I open either Pine or Thunderbird to make sure that the mail was sorted correctly. With that ensured, you are down to the last two tasks.

Sometimes I get spam that doesn't trigger the DCC/Razor/Pyzor corpus checks. It's certainly spam, so we should report it. Reporting it also teaches the local Bayes database to refine further searching.

To report it:

spamassassin -r --mbox < ~/mail/Junk 2> ~/SA-Report.txt

This redirects the Junk folder into spamassassin, with the -r making it report the spam. The “2>” part redirects standard error to a file that I can read at my convenience. If you just want to put it into the Bayes database, you can run the following command.

sa-learn --spam --mbox --progress [folder]
# where [folder] is the folder of icky messages
sa-learn --ham --mbox --progress [folder]
# don't forget to train sa-learn on ham, too!

To save tedium, I combined almost all of these steps into a single shell script. This way, I can just run hotmail.sh, put it in the background, and come back later to retrieved and filtered mail.


#!/bin/sh
# hotmail.sh
#
# Simple script to battle the spammers
#

# First, let's make sure we're in the correct directory, and remove the UTF-8 language tag
# for better SA performance

cd $HOME
LANG=en_US; export LANG

# The point is to pull down my Hotmail to a local file first, then clean it
# The relevant settings are in my ~/.gotmailrc, for security reasons

time gotmail

# Now that we have it, we should filter it through SpamAssassin
# We take it from the file where Gotmail dumped it, then run it through SA
# to a file where maildrop can find it.

time spamassassin --mbox < ~/missionpilot9 >> ~/MP9temp

# Lastly, use reformail to feed the individual messages to maildrop
# so that we can sort good from bad and ugly.

time reformail -s maildrop < ~/MP9temp

# We should report all the nasty spam to the proper authorities

# spamassassin -r --mbox < ~/mail/Junk 2> ~/SA-Report.txt

# Clean up any mess that we made
rm missionpilot9
rm MP9temp

#EOF

And there you have it. The good mail is in one folder, the spam in another. All is right with the world. Or at least mostly.

Future goals: Gotmail has built-in hooks for spamassassin. It would be good to set it up to delete the really high scoring email (say, over 10 points) without even downloading it. It would save bandwidth, hard drive space, and hopefully time. Until then, though, this works fine.

Resource list:

Gotmail

http://www.sourceforge.net/gotmail

The man page is pretty good, you shouldn't need anything else. Copy the sample rc file, edit it to your liking, then you're off and running.

SpamAssassin

http://spamassassin.apache.org/ The main page
http://www.rulesemporium.com Add some custom SA rules
http://www.freespamfilter.org/FC4.html#_Toc110999208 How to install SA with all the goodies, including:

http://www.rhyolite.com/anti-spam/dcc/ DCC
http://www.sourceforge.net/pyzor Pyzor
http://www.sourceforge.net/razor Razor

Maildrop
http://www.courier-mta.org/maildrop/ The main maildrop page
http://www.firstpr.com.au/web-mail/Postfix-SA-Anomy-Maildrop/ Step by step to a well-equipped mail server
http://www.ufsdump.org/papers/courier-imap-fc3.html Courier and accessories on Fedora

Tuesday, September 19, 2006

Watching a sunrise is good for the soul!

Saturday morning I got to watch the sun rise. Although it was from my desk chair in my room (which really faces the wrong way to watch the sunrise), I still marveled at the way the sky morphed from indigo to charcoal grey to blue, then to "really bright" as the sunlight started coming in my window. Robert Sheffey is said to have said "Boy, bring me my sheepskin. I see the sunlight comin' through the trees and it makes me want to pray." Well, I see the sunlight streaming in my window, and it makes me want to write.

The last week has been very intense. It's been a lot of fun, but also a lot of work. College students have a stereotype of only seeing the sun rise by staying up all night. What would you think if I told you that I got to see the sun rise by getting up early?

A common thread that I've been noticing this semester is encouragement to live more aggressively and truly attempt something so big that I can't do it without God's help. Allow me to give you an example. Last Wednesday (the 13th), I worked on an Intermediate Mechanics problem until 1:00 in the morning. I got to bed about 2:00 a.m., and got about 5 hours of sleep before my 8:00 Mechanics of Materials class. Hooray for being rock-stubborn and getting things done...

The next night, I also did something wild and crazy. The local Campus Crusade has a tradition they call the "Cru Challenge". Every year, ideally after the first freshman chemistry test, all the willing volunteers carpool to a preselected location and hike a fourteener. I had wanted to participate this year, but I knew that I had 8:00 class the next day and an afternoon flight at 3:00, neither of which contributed much to the cause. However, I got talked into it, so as I told my recent friend Chris Carey, "this will be my first fourteener. On 5 hours of sleep. At night. With Cru. And my first class is 8:00 the next morning."

About 700 vertical feet short of the summit of Mt. Bierstadt (at about 1:20 in the morning), it started snowing and the decision was made to turn back. I was content not to summit, because I quickly found that I was running out of endurance. The walk down was cold, and I was thankful for every item of clothing that I brought.

I made it back to my house at 5:30 after 5 hours of hard hiking and 1 1/2 hours of riding in the car. I slept until 7:30, and was only 7 minutes late for my 8:00 class. I got home at about 3:30 that afternoon, and went to bed at 4:30. I slept solidly for 11 hours, and was wide awake at 3:30 AM Saturday.

What does one do at 3:30 in the morning with a full night's sleep? I'll tell you. One catches up on leftover mess from the preceding week. One also watches the sunrise. I'm here to tell you that it is true what the Bible says. Psalm 30:5 (KJV) says "weeping may endure for a night, but joy cometh in the morning." Also in Lamentations 3:22-23, we learn that His mercies are "new every morning." That alone is enough reason to enjoy a sunrise.

That's my story for now, everybody. The moral? Sunrises are cool, and God is good, as always. God Bless, and I'll write more soon!

Sunday, September 03, 2006

In statu pupillari

This is the title of this blog, and the title of the blog is “In Statu Pupillari”. But, pray tell, why would I have picked such a obtuse name? Why, I'm glad you asked.

In Statu Pupillari - the Latin

The phrase is Latin, and it means “Of the status of a student”. I am a student of life. One of my acquaintances refers to a thing he calls “Beginner's Mind”. Beginner's Mind is always having an open mind - always looking to improve yourself. I've been around a lot of people who have chosen to stop learning. It's actually quite sad. “On the day you stop learning, you die.” I never want to stop learning.

Why Latin? First, because it just sounds neat. Second, because in a former age, Latin was the language of the learned. It was also the language of religious oppression (the Catholic church), but I guess you have to take the good with the bad. English is a rich language, and we owe older languages a lot, including Latin.

In Statu Pupillari - The Implications

I could just say “Yep, I'm a college student” rather than some high-falutin' sounding Latin phrase, but I have a couple reasons why not.

The first reason is similar to the reasoning above. I don't just learn at college. I learn from all that God brings into my life. Sometimes I don't learn as much as I should, but that's because I'm still a fallen person.

I realize as I get older how much that “college student” isn't really a compliment. When I was younger, reckless and stupid things (usually in Fort Collins) could be dismissed by attributing it to “college kids.” They were the ones who did things outside the rules: sometimes ingenious but more often just stupid.

You used to be able to be proudto be called a college student. In the heyday of schools like Cambridge and Oxford, it was an honor to go to a university. Read about people like Erdos, Nash, and others. That was when people _worked_ in college. Parties were not about “getting wasted” and fulfilling the lusts of the flesh (or the lust of the eyes, or the pride of life) At least that's the way that history has been handed down to me. I found some interesting reading at http://paul.mertion.ox.ac.uk/education/cambridge.html Some of the rules seem silly now, but it appears to me that they were enacted to protect the honour of the students and of the university. Would that my school today has such a high sense of honour! There is a great quote that I read somewhere (sadly, without a source): “Glory may be won, but honor must never be lost.” Or, in the words of Professor Buckland, “If you can't make a mark, at least try not to leave a stain.”

Long story to say - I'm not just a “college student”, I'm a professional student. I'm not here to spend my time forgetting about Mines and classes, I'm here to enrich my mind and soul.

In Statu Pupillari - The Blog

Back to the blog, I promise. This blog is to document my journeys through life. Sometimes it will be for myself (The idea of standing stones comes to mind), sometimes it will be to share. I foresee at least two main threads. One will be the uber-geeky posts on “Wait till you see what I did with my super bash-shell Super Linux-BSDish Version 3.14159 scripting skills”. The other (which I trust will be much more thought-provoking) will have a philosophical slant - “Kenton's reactions and cognitations on life as he perceives it.” I'm looking for a way to differentiate these, but haven't figured one out yet.

Feel free to leave comments or email with any thoughts...welcome to my next adventure!