image spam spam spam spam spam
Sorry about that… Most spam I’ve been getting in the past few weeks has a bunch of random text (specially designed to look like a legitimate email) and the such and then a .gif attachment with text advertising stocks, casinos, viagra and xenical and dog knows what. I was tired of watching the Thunderbird bayesian filters being defeated by these bastards so today I took an hour or two devoted to building some sort of anti-spam infrastructure everyone in the family could use (regardless of what email software they use – outlook, outlook express, mobile phones – they’re always complaining about the spam they get).
A sidenote, “SPAM” got its name not directly from the gruesome canned meat but from a Monty Phyton sketch (which you can check out at YouTube). They were looking for a term which described something annoying and repetitive. If SPAM was “invented” in 2000 something, they would probably call it “crazy frog“.
Instead of creating a anti-spam gateway, I run a bunch of scripts which periodically check everyone’s mailboxes, pipe the messages through spamassassin and delete those which look like spam. The original script was found here but I had to make a few changes to store the mail passwords in a safe place and to save the spam messages (just in case a false positive occurs).
Step two features a way to teach spamassasin a few extra tricks. I had a 10+MB junk mail folder, shot spamassain over it and got the remains. These remains, along with the remaining spam I’ll get once in a while will go to an IMAP folder called LearnAsSpam. I’m planning to use this procedure to fetch the mail from the IMAP folder, pipe it to sa-learn and then delete it. As simple as that.
Now, the icing on the cake. OCR (Optical Character Recognition) is used over the attachment and the script looks for keywords which are commonly detected by anti-spam software but able to fool it when written on a graphic file. Next thing we know, spammers will be using captchas…
This is the procedure documentation along with the scripts (a few extra Ubuntu packages are required such as netpbm, imagemagick and gocr)
Results?
Content analysis details: (6.0 points, 4.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.8 EXTRA_MPART_TYPE Header has extraneous Content-type:...type= entry 2.0 DATE_IN_FUTURE_03_06 Date: is 3 to 6 hours after Received: date 0.2 HTML_MESSAGE BODY: HTML included in message 3.0 OCR BODY: Check if text in attached images contains spam words
:) I’ll see if I get a few hours sleep now, enjoy the weekend relatively spam-free and then head back to peek at FuzzyOCR (step by step installation and configuration guide). Looks promising.