UTF-8 Aware Cron Scripts

I’ve recently been having a spot of bother with UTF-8 data in a Perl script on an old linux box.

Specifically, I have been importing data from a RESTful service that includes the name Michael Bublé. That accented e at the end of Michael’s name has been problematic.

When I run my code from the command line, it imports correctly into my system, however, when run as a cron job, it imports as Michael Bublé. The é is a multibyte character, but the script was trying to read it seperate characters and getting into a muddle.

At first I assumed the service I was consuming had change the encoding, but running via the command line showed no problems. The problem was down a difference between the command line and cron environments.

Checking the locale using the locale command I got this on the command line…

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

… but when running that command as a cron job and piping the results to a file in /tmp, I got the following…

LANG=
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=

Cron jobs were being executed that weren’t UTF-8 aware. The solution was to set the LANG in the /etc/environment file like this…

LANG=en_US.UTF-8

… then restart the cron daemon using

/etc/rc.d/init.d/crond restart

Now my scripts can successfully import multibyte UTF-8 data correctly when run on the command line or as a cron job.

The /etc/environment file is used to set variables that specify the basic environment for all processes so should be the best place to set the lANG variable.