Friday, April 27, 2012

Workaround for conversion of Unicode Vietnamese to ANSI

A few months ago, we had an interesting project which involved Vietnamese. As a part of the project, we have hit a minor snag.

It looks like the first support for Vietnamese appeared in 1990s, which means the ANSI codepage was built in a hurry already when Unicode was either in the works or already out. The tonal nature of the language demands complex combinations of diacritics to be used. Not just regular grave accents, umlauts and such. Simply put, there were not enough slots for the new characters, so the creators of the Vietnamese codepage stuffed these new characters wherever, in whatever way possible.

Today very few people use ANSI, however, it is still needed for several reasons: legacy being one (people still work with mainframes, you know), and compliance another. Of course, there is a tried and true function WideStringToMultiByte which works like a Swiss chronometre. That is, for most languages - except Vietnamese. There are posts by Microsoft folks stating that "Vietnamese is a complex language on Windows" (duh!), but not really telling how to fix it. I asked around, no one replied, as expected (I love how Stackoverflow people react when they can't answer the question :-) ).

After scrutinising the result I saw what's the problem. It seems that the decomposing routine during the conversion is unable to handle some combinations of Unicode characters. Manually decomposing some characters to their equivalents worked for me.

I use an esoteric language called Clarion to design our tools and some components, so my original code is in Clarion. A few days ago, Mark Jacobs from Critical Research contacted me requesting help with the same issue, and kindly converted my Clarion source code to C++ more familiar to the rest of the world. Thanks, Mark!

Get Clarion source code here and Mark's C++ here.