UTF-8 Aware Cron Scripts

I’ve recently been having a spot of bother with UTF-8 data in a Perl script on an old linux box.

Specifically, I have been importing data from a RESTful service that includes the name Michael Bublé. That accented e at the end of Michael’s name has been problematic.

When I run my code from the command line, it imports correctly into my system, however, when run as a cron job, it imports as Michael Bublé. The é is a multibyte character, but the script was trying to read it seperate characters and getting into a muddle.

At first I assumed the service I was consuming had change the encoding, but running via the command line showed no problems. The problem was down a difference between the command line and cron environments.

Checking the locale using the locale command I got this on the command line…

… but when running that command as a cron job and piping the results to a file in /tmp, I got the following…

Cron jobs were being executed that weren’t UTF-8 aware. The solution was to set the LANG in the /etc/environment file like this…

… then restart the cron daemon using

Now my scripts can successfully import multibyte UTF-8 data correctly when run on the command line or as a cron job.

The /etc/environment file is used to set variables that specify the basic environment for all processes so should be the best place to set the lANG variable.

Parsing JSON Boolean Values With Perl

A project I’ve been working on recently has meant me using both the JSON and JSON::Streaming::Reader Perl modules.

The JSON::Streaming::Reader is great for processing large files, and produces results that are compatible with the JSON module. Compatible… well almost. They both handle booleans in a slightly different way, and that caused me a few problems as I was mainly testing against the JSON module.

In JSON::Streaming::Reader boolean values become references to either 1 or 0.

The JSON module returns a JSON::Boolean (or if using XS – JSON::XS::Boolean) object that is either JSON::true, or JSON::False. These are overloaded to act as 1 or 0.

If you look at a Data::Dumper output, here are how they compare for a “true” value.

JSON::Streaming::Reader

JSON

For “false” the 1’s become 0’s.

When using the JSON module, because the JSON::Boolean object is overloaded, you can test for truthfullness by simply doing…

However using JSON::Streaming::Reader this won’t work, as the reference will always evaluate to true. In this case the value must be dereferenced first before testing.

The good news is that this same block of code will work when also using the JSON module, so in future, when testing decoded boolean values from JSON data, always dereference first!

Using Data::Dumper With Log4perl

I’m currently working on some Perl code for a change, and as part of a refactor I’ve added log4perl in to support my debugging.

I tend to overuse Data::Dumper as it’s so useful for seeing what is going on inside an object, and I had been passing the results directly to log4perl. This is actually very inefficient if I changed my logging level to turn off the dumps, as Dumper would still run, but the results would just be ignored.

Thankfully log4perl supports filters, so I can use this to defer the Dumper operation until it actually needs to be output.

In the above example, I’m getting the singleton instance of log4perl for my current class, then calling trace, and saying I’d like the $data passed through Data::Dumper::Dumper. This will only be filtered if I have trace level monitoring enabled, if not it will be silently ignored and not executed, saving me valuable resources.

Using Config Files In Catalyst

I’ve been doing quite a bit of work with the Catalyst MVC framework lately.

Moving from development to live, the config system has really shown it’s usefulness. Instead of hardcoding values, we can simply move them out to the site’s config file and access these value programatically. It means we can run the same codebase on multiple machines and just tweak the config to pick up things like different database connection strings, cache settings etc.

I like to have my config file in Config::General format as it’s similar to Apache’s config files, but Catalyst can also handle config files in INI, JSON, Perl, XML or YAML, so you can use whatever you are most comfortable with.

Let’s have a look at a few examples. We’ll assume the Catalyst Application is called MyApp. This means we’ll have a perl module in our lib directory called MyApp.pm, and a config file called myapp.conf in the root directory.

MyApp.pm contains all the default values, but you can override these with myapp.conf. myapp.conf always takes priority.

Let’s create a simple string to say where our application is running. In MyApp.pm, in the __PACKAGE__->config we add an entry to the hash like this…


servername => 'dev',

Now when we run our application, we can access this value using the following code…


$c->config->{servername};

This will return “dev”. We can override this in the conf file, so in myapp.conf we can add…


servername production

Now when we run our application code, we’ll see “production” instead of “dev”.

A more practical use of the config file is to move out the database connection details. Let’s assume we have a simple MySQL based model in our lib/Model directory called Model::MyApp that handles our database work. We can override database connection details stored here in myapp.conf using something like this…

<"Model::MyApp">
connect_info dbi:mysql:myapp
connect_info www-rw
connect_info password
<connect_info>
AutoCommit 1
</connect_info>
</"Model::MyApp">

Now when we run the application, the connection details we entered in our conf file will be used instead. This is very useful as it means we don’t have to alter our codebase when we move the application to different servers that may have different databases. It’s also good for keeping dev, staging and live seperate as all that’s needed is a change to one config file.

For more information on how Catalyst can use a config file, have a look at
Catalyst::Plugin::ConfigLoader.

Redirects In Catalyst

I’ve been using Catalyst quite a bit recently. For those of you who don’t know what Catalyst is, the easiest buzzword compliant comparison is Ruby On Rails for Perl.

Normally in Catalyst if you want to break the flow from a current method, you use the forward method of the catalyst object to internally redirect to a new handler.

Imagine I have a handler responding to requests at /test.

If I want my request to be dealt with by the handler at /robspage I can issue a forward request in the handler at /test like this…

$c->forward($c->uri_for('/robspage'));
## code will continue here.

Once the code at /robspage has run, control returns to the calling handler.

This isn’t always what is needed, if I don’t want the handler to return and keep running I would need to use the detach method instead.

$c->detach($c->uri_for('/robspage'));
## code will not continue here.

This is great, however, the calling URL will not change and the user will not know they are actually seeing /robspage instead of /test. Sometimes this is the behaviour we want, however in this case I want the user to know a redirect has happened and for this to be reflected in their browser.

To achieve this, we have to use the redirect method of the the response object.

$c->res->redirect($c->uri_for('/robspage'));
$c->detach();

Note that I have added a $c->detach(); call after the redirect as I don’t want the processing chain to continue.

CGI.pm Doesn’t Delete CGItemp Files Automatically

More late night server maintainence…

I noticed the /tmp drive on my server was very full, so had a quick peek. It was full of large CGItemp files.

It turns out the files I’m accepting using for upload using CGI.pm aren’t deleted automatically from the temporary directory as the documentation suggests.

When you’re dealing with a lot of large uploads, this can be a problem.

Uploading Files To A LAMP Server Using .NET

We had to setup and run a webcam for the Q Awards red carpet earlier this week.

The original plan was to have an old Axis ethernet webcam, connected to a wifi network using a Dlink bridge. After a nightmare of not being able to get the Dlink to work, we gave up and went for a USB webcam connected to a laptop approach.

I had to write some software to handle the capture of the image and the upload to the webserver. Because we wanted to do a few custom things on the way we weren’t able to use out of the box software.

I’ll post on how to use a webcam with .NET another time. What I wanted to document was how to upload a file using .NET to a LAMP server.

It turns out to be easier than I thought for .NET, one line of code can achieve this using the My.Computer.Network.UploadFile Method.

For example, to upload test.txt to http://www.robertprice.co.uk/asp_upload.pl we can do the following (in Visual Basic)…

My.Computer.Network.UploadFile("C:test.txt","http://www.robertprice.co.uk/asp_upload.pl")

Now we need some code at the other end to store this upload. As I was building a webcam application, the uploaded file, an image in this case, was always overwritten.

The uploaded .NET file uses an bog standard HTTP upload as described in RFC 1867.

If we use Perl to read this file, we can use the standard CGI.pm module to do most of the hard work for us.

The uploaded file is passed a parameter called file, so we can create a CGI object and get a reference to this using the param method. Once we have this, we can treat it as a filehandle and read the date off it. Then all we have to do is to write it out to a file.

The following code achieves this.

#!/usr/bin/perl -w
use strict;
use CGI.pm;
my $CGI = new CGI;
my $file = $CGI->param('file');
open(FILE, ">/path/to/my/file/test.txt") or die $!;
while(<$file>) {
print FILE $_;
}
close(FILE);
print "Content-type: text/htmlnn";
print "OKn";

Limiting Trackback Spam Further

Those trackback spammers are getting smarter, I’ve had two get past my filters in the past two days.

I wrote before about my attempts at trying to limit trackback spam. My method is to visit the trackback URL and make sure it links back to me before letting it onto the site. I also blacklist sites after X tries to keep my bandwidth down.

I always look at trackbacks that get past my filters, so I was annoyed to see spam and was interested to see how it beat my system.

You can pretend to be a web browser using telnet, so I did that to see how the spammer’s site behaves.

telnet c13183.traffdodkok.info 80
Trying 66.232.122.14...
Connected to c13183.traffdodkok.info.
Escape character is '^]'.
GET /1369483/ HTTP/1.1
Host: c13183.traffdodkok.info
HTTP/1.1 200 OK
Date: Fri, 08 Jun 2007 07:41:22 GMT
Server: Apache/2.0.59
Vary: Host
Content-Length: 242
Content-Type: text/html; charset=UTF-8
hey! your Link a here : <a href="http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
">Blog</a><br/>Given from:<br/>http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
Connection closed by foreign host.

Hummm, that looks fine, no obvious spam there. When you visit with a real web browser however, a busty amateur called Dawn is waiting to great you.

I tried again, but this time adding in a fake user-agent string. This an additional header a browser sends to identify itself to a web server. In this case, I decided to be Internet Explorer 6.

telnet c13183.traffdodkok.info 80
Trying 66.232.122.14...
Connected to c13183.traffdodkok.info.
Escape character is '^]'.
GET /1369483/ HTTP/1.1
Host: c13183.traffdodkok.info
Accept: */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
HTTP/1.1 302 Moved Temporarily
Date: Fri, 08 Jun 2007 07:43:56 GMT
Server: Apache/2.0.59
Vary: Host
Location: http://trafflol.info/dawn-busty-amateur.html
Content-Length: 242
Content-Type: text/html; charset=UTF-8
hey! your Link a here : <a href="http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
">Blog</a><br/>Given from:<br/>http://www.robertprice.co.uk/robblog/archive/2005/8/Trying_To_Limit_Trackback_Spam.shtml
Connection closed by foreign host.

So there it is! When I pretend to be Internet Explorer, the spammer’s web server issues an HTTP 302 header that tells the browser to redirect away from the page it’s served, and to go and see Dawn instead.

Notice how it keeps content back to my site there so my detection script would be fooled. Also, the spammer was probably being cheeky at targetting an anti trackback spam page. :-)

The way to spot this spam is to check for the HTTP status and the location header. We’d need to make our validation code follow each redirection location until it reached the real URL a web browser would see and check the contents of that page.

Thankfully Perl gives us an easy way, we can use the LWP::UserAgent module, that pretends to be a real browser and handles all this behind the scenes.

Code to handle this would look something like this (assume $url is the URL of the page to check)…

use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
## pretend to be a more capable browser
$ua->agent("Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
my $req = HTTP::Request->new(GET => $url);
$req->header('Accept' => 'text/html');
$req->header('Accept-Language' => 'en-us');
my $res = $ua->request($req);
if ($res->is_success) {
my $page = $res->content;
if ($page =~ /robertprice.co.uk/) {
## assume valid page as it mentions my site
} else {
## assume spam
}
} else {
## assume spam
}

Using Perl To Send Heartbeat To A Python Server

I’ve been brushing up on my networking and Python skills lately.

One example in the Python Cookbook is recipe 13.11 Detecting Inactive Computers. Here there are two scripts, one that sends a heartbeat, and another that listens for heartbeats. The idea is a simple UDP packet containing the short string “PyHB” is sent every 5 seconds or so by a client. A server then listens for these datagrams, stores the addresses of the clients which it receives messages from, and if it fails to receive a message in a certain time period it flags this up on the console.

I thought I’d just check how easy it would be to create a Perl client to talk to the Python backend. As it’s all through sockets it should be fairly easy.

To start with we define a few constants that cover the address of the server we want to connect to, the port to send the datagram to, how often we want to send it and if we want debugging information displayed or not.

use constant SERVER_IP => '127.0.0.1';
use constant SERVER_PORT => 43278;
use constant BEAT_PERIOD => 5;
use constant DEBUG => 1;

Now we need to create the socket, in this case we’ll create a new IO::Socket object.

my $hbsocket = IO::Socket::INET->new(Proto => 'udp',
PeerPort => SERVER_PORT,
PeerAddr => SERVER_IP)
or die "Creating socket: $!n";

Next we need to send our message, the string “PyHB” over the socket.

$hbsocket->send('PyHB') or die "send: $!";

Finally we need to wrap this up in a loop and sleep BEAT_PERIOD seconds before repeating our message.

Here’s the final script.

#!/usr/bin/perl -w
use strict;
use IO::Socket;
use constant SERVER_IP => '127.0.0.1';
use constant SERVER_PORT => 43278;
use constant BEAT_PERIOD => 5;
use constant DEBUG => 1;
print "Sending heatbeat to IP " . SERVER_IP ." , port " . SERVER_PORT . "n";
print "press Ctrl-C to stopn";
while (1) {
my $hbsocket = IO::Socket::INET->new(Proto => 'udp',
PeerPort => SERVER_PORT,
PeerAddr => SERVER_IP)
or die "Creating socket: $!n";
$hbsocket->send('PyHB') or die "send: $!";
print "Time: " . localtime(time) . "n" if (DEBUG);
sleep(BEAT_PERIOD);
}

Just set the value of DEBUG to 0 if you don’t want to see the time each time we send a message to the server.

To test this you’ll have to be running the server from recipe 13.11 in the Python Cookbook. You can download a zip file of the examples in the Python Cookbook from the O’Reilly website to test this.