UTF-8 Aware Cron Scripts

I’ve recently been having a spot of bother with UTF-8 data in a Perl script on an old linux box.

Specifically, I have been importing data from a RESTful service that includes the name Michael Bublé. That accented e at the end of Michael’s name has been problematic.

When I run my code from the command line, it imports correctly into my system, however, when run as a cron job, it imports as Michael Bublé. The é is a multibyte character, but the script was trying to read it seperate characters and getting into a muddle.

At first I assumed the service I was consuming had change the encoding, but running via the command line showed no problems. The problem was down a difference between the command line and cron environments.

Checking the locale using the locale command I got this on the command line…

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

… but when running that command as a cron job and piping the results to a file in /tmp, I got the following…

LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Cron jobs were being executed that weren’t UTF-8 aware. The solution was to set the LANG in the /etc/environment file like this…

LANG=en_US.UTF-8

… then restart the cron daemon using

/etc/rc.d/init.d/crond restart

Now my scripts can successfully import multibyte UTF-8 data correctly when run on the command line or as a cron job.

The /etc/environment file is used to set variables that specify the basic environment for all processes so should be the best place to set the lANG variable.

Utf8 And MySQL Command Line

I’ve been having to use a lot of utf8 recently, and being old school I still use the command line a lot.

One project has been importing a lot of international data into a mysql database.

The database is in utf8, but when I used the command line, non latin1 data was coming back corrupted.

It turned out that the command line doesn’t automatically detect the character set, so it was printing as if it was latin1. There is a flag that can be passed in called –default-character-set and this can be set to utf8. Once set, utf8 data is correctly displayed on my terminal.

mysql --default-character-set=utf8 testdatabase

This is also useful when piping in utf8 data.