I’ve recently been having a spot of bother with UTF-8 data in a Perl script on an old linux box.
Specifically, I have been importing data from a RESTful service that includes the name Michael Bublé. That accented e at the end of Michael’s name has been problematic.
When I run my code from the command line, it imports correctly into my system, however, when run as a cron job, it imports as Michael Bublé. The é is a multibyte character, but the script was trying to read it seperate characters and getting into a muddle.
At first I assumed the service I was consuming had change the encoding, but running via the command line showed no problems. The problem was down a difference between the command line and cron environments.
Checking the locale using the locale
command I got this on the command line…
LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
… but when running that command as a cron job and piping the results to a file in /tmp, I got the following…
LANG= LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL=
Cron jobs were being executed that weren’t UTF-8 aware. The solution was to set the LANG in the /etc/environment file like this…
LANG=en_US.UTF-8
… then restart the cron daemon using
/etc/rc.d/init.d/crond restart
Now my scripts can successfully import multibyte UTF-8 data correctly when run on the command line or as a cron job.
The /etc/environment file is used to set variables that specify the basic environment for all processes so should be the best place to set the lANG variable.