Limiting Trackback Spam Further
Those trackback spammers are getting smarter, I've had two get past my filters in the past two days.
I wrote before about my attempts at trying to limit trackback spam. My method is to visit the trackback URL and make sure it links back to me before letting it onto the site. I also blacklist sites after X tries to keep my bandwidth down.
I always look at trackbacks that get past my filters, so I was annoyed to see spam and was interested to see how it beat my system.
You can pretend to be a web browser using telnet, so I did that to see how the spammer's site behaves.
telnet c13183.traffdodkok.info 80
Trying 66.232.122.14...
Connected to c13183.traffdodkok.info.
Escape character is '^]'.
GET /1369483/ HTTP/1.1
Host: c13183.traffdodkok.info
HTTP/1.1 200 OK
Date: Fri, 08 Jun 2007 07:41:22 GMT
Server: Apache/2.0.59
Vary: Host
Content-Length: 242
Content-Type: text/html; charset=UTF-8
hey! your Link a here : <a href="http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
">Blog</a><br/>Given from:<br/>http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
Connection closed by foreign host.
Hummm, that looks fine, no obvious spam there. When you visit with a real web browser however, a busty amateur called Dawn is waiting to great you.
I tried again, but this time adding in a fake user-agent string. This an additional header a browser sends to identify itself to a web server. In this case, I decided to be Internet Explorer 6.
telnet c13183.traffdodkok.info 80
Trying 66.232.122.14...
Connected to c13183.traffdodkok.info.
Escape character is '^]'.
GET /1369483/ HTTP/1.1
Host: c13183.traffdodkok.info
Accept: */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
HTTP/1.1 302 Moved Temporarily
Date: Fri, 08 Jun 2007 07:43:56 GMT
Server: Apache/2.0.59
Vary: Host
Location: http://trafflol.info/dawn-busty-amateur.html
Content-Length: 242
Content-Type: text/html; charset=UTF-8
hey! your Link a here : <a href="http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
">Blog</a><br/>Given from:<br/>http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
Connection closed by foreign host.
So there it is! When I pretend to be Internet Explorer, the spammer's web server issues an HTTP 302 header that tells the browser to redirect away from the page it's served, and to go and see Dawn instead.
Notice how it keeps content back to my site there so my detection script would be fooled. Also, the spammer was probably being cheeky at targetting an anti trackback spam page. :-)
The way to spot this spam is to check for the HTTP status and the location header. We'd need to make our validation code follow each redirection location until it reached the real URL a web browser would see and check the contents of that page.
Thankfully Perl gives us an easy way, we can use the LWP::UserAgent module, that pretends to be a real browser and handles all this behind the scenes.
Code to handle this would look something like this (assume $url is the URL of the page to check)...
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
## pretend to be a more capable browser
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
my $req = HTTP::Request->new(GET => $url);
$req->header('Accept' => 'text/html');
$req->header('Accept-Language' => 'en-us');
my $res = $ua->request($req);
if ($res->is_success) {
my $page = $res->content;
if ($page =~ /robertprice\.co\.uk/) {
## assume valid page as it mentions my site
} else {
## assume spam
}
} else {
## assume spam
}
