Thursday, October 20, 2005

Unicode conversion in Perl

Starting from Perl 5.6, Perl uses utf-8 for storing character strings internally (Just like Java, which also uses Unicode for storing character strings). To convert a string in native encoding to another encoding, we can think of the same way as what we usually do in Java (if you happens to have some experience in Java character set conversion, you will know what I am saying). e.g. If we want to convert a Big5 string to UTF-16 :


use Encode qw/encode decode/;

$big5str = get_big5_string();
$now_in_utf8 = decode("big5", $big5str);
$now_in_utf16 = encode("UTF-16", $now_in_utf8);



For extra information, if you to want to return an Excel spreadsheet containing Chinese characters from CGI, you can do this:


#!/usr/bin/perl
use Spreadsheet::WriteExcel;
use Encode qw/encode decode/;

my $big5str = get_big5_string();
my $now_in_utf8 = decode("big5", $big5str);
my $now_in_utf16 = encode("UTF-16", $now_in_utf8);

$filename ="excel_file.xls";
print "Content-type: application/vnd.ms-excel\n";
print "Content-Disposition: attachment; filename=$filename\n";
print "\n";

my $workbook = Spreadsheet::WriteExcel->new("-");
my $worksheet = $workbook->addworksheet();

$worksheet->write(1,1,"Non-unicode characters");
$worksheet->write_unicode(1, 2, $now_in_utf16);
$workbook->close();


For M$ Excel before version 2000, you will see square box before each cell, this is due to the BOM character generated during the encode process. Excel 2003 and OpenOffice can recognise this character correctly. To remedy the square bracket, you can just remove it:


my $now_in_utf8 = decode("big5", $big5str);
my $now_in_utf16 = encode("UTF-16", $now_in_utf8);
$now_in_utf16 =~ s/^\x{fe}\x{ff}//;

1 comment:

Anonymous said...

You can avoid the BOM altogether by specifying the endianess of the UTF-16 encoding as UTF-16BE:

my $now_in_utf16 = encode("UTF-16BE", $now_in_utf8);

The is also a Big5 example in the Spreadsheet::WriteExcel distro.

John.
--