I have implemented Bayesian spam filtering for this weblog. I have no idea if it is working. Well, I think it’s working. I’ve trained it with all the comments in the database. They’re all considered not spam, since I always delete spam when I see it. Then I had to go back into some archives and find spam comments that have been posted, to teach it what spam tastes like. I could only find 19 such spams, which I have a feeling isn’t quite enough.
When asked to score the good comments in the database, it is currently giving scores like .00000000000000000000001% chance of spam. And when I ask it to score some of the spams I’ve used for training, it says 100%. So it is not obviously completely broken. But I’m not sure what it’ll do when it sees stuff it’s never seen before. I can’t wait to find out… so Allez Spam!
I haven’t thrown the switch that consigns comments that rate highly on the spam-o-meter to oblivion yet, because I’m not too confident in the system. For now, it’s just working behind the scenes, rating and learning.
I first tried to use an existing free implementation of Bayesian filtering in PHP by Loic d’Anterroches. I couldn’t quite get it to work, and it was a little too general, so instead I rolled my own.
I based it on these two articles by Paul Graham, and partially on the Bayesian spam filter for MT. Translating Paul’s oatmeal and fingernail clippings into PHP was…. entertaining:
(let ((prod (apply #'* probs)))(/ prod (+ prod (apply #'* (mapcar #'(lambda (x) (- 1 x)) probs)))))
becomes
for($i=0; $i < $nprobs; $i++)
{
$ptop *= $probs[$i];
$pbot *= (1 - $probs[$i]);
}
return $ptop / ($ptop + $pbot);
I’ve got it integrated into b2, but I’m not going to “officially” release the code until I know that it’s working and useful for comment spams. Of course, those of you who just can’t wait, want to see my horrible slapped together code, and know how to use my view-source feature, be my guest. If it does prove to work, I’ll release it as a b2 hack, and hopefully the Wordpress guys will like it. (I really have to upgrade to Wordpress one of these days…)
(By the way: While I was working on this, I did introduce a bug that completely broke comments for the past day or so. If you tried to post during that time, sorry! It’s fixed now.)