One of the worst things in blogging (aside from the constant, unrelenting abuse and hate mail you subject yourself to) is dealing with SPAM. Here you are just talking openly about whatever and someone tries to make money off of it. Low…
I’ve gone out of my way to kill the spammers. First, I only allow comments by registered users and registered users only receive a password to the email address they provide at registration. This nearly eliminates comment spam, I may have had one junk message in the past year.
Second, I absolutely neutered the trackback SPAM. These are the fake comments you see on Vladville that say “Mike said this: ” followed by my post contents. What I have essentially done to neutralize these monkeys is removing the hyperlink to their URLs, so even though they SPAM me, it gets them nowhere and just sends more links to my blog.
I kind of make a living killing spammers so this goes a little beyond the lone annoyance, its outright emasculating. So I’m sitting here in the atomic tangerine lab trying to come up with some replicable pattern that I can use. Most trackback SPAM plugins rely on curl to check if the offending web page has the direct link to my blog post. For pretty much everyone, that seems to be the case. So here is about the only thing I have come up with so far:
All trackback SPAM has the full post URL in it. The page also quotes, partially, my blog post and attributes it to someone else. I am intercepting the URL, downloading it with curl, stripping out all the HTML and running preg_match between the two posts.
Because all HTML and punctuation is ignored, it should be pretty easy to find a pattern match over at least 100 characters.
For the most part, nobody quotes paragraphs and paragraphs of text in a blog post, they merely link to the article and offer their point of view on it. Let’s see how it goes, right now I am just logging the matches and not discarding them automatically.
Off topic… I tried this too:
Most trackback spam happens within minutes of the post going live. It is almost safe to say that nobody would have read, thought about and produced a post referencing me within let’s say 30 minutes of my blog post. Something that automated either has no life at all or is a spambot.
If you have a better idea, I’m all ears…