Software Development in Foreign Languages
The important of being able to develop in foreign languages is definitely gaining grown. I have heard numerous times in the office about the importance of supporting other languages (over and above English of course). This does not refer to just any language but what we call the “double-byte” languages. These are languages that have significantly different character sets and require more bytes to support their character set. Examples of these languages are Chinese, Japanese and Korean.
Fortunately, different people got together and decided to standardize the representation of the different character sets of the different languages around the world. This effort was called Unicode.
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.
The current problem is that Unicode if encoded directly would make a character more than 8-bits (1-byte) long. Normally, you will have 16-bit (2-byte) or 32-bit (4-byte) characters instead of the usual 8-bit (1-byte). These encodings are called UCS-2 (commonly supported in MS Windows) and UCS-4 (commonly support in Unix/Linux) respectively. This causes a number of problems when programming and when encoding in different systems particularly Unix and Linux systems. Besides, this requires that we have to do funny character manipulating to support these characters while programming. So, I started learning how to use ICU …
Fortunately, most systems (including my current Linux system) already supports a Unicode encoding called UTF-8.
UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) lossless encoding of Unicode characters. UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.
Fortunately, by using UTF-8, I don’t have to do anything special with my programs to support Unicode, as well as foreign character sets. Our favor Chinese, Japanese and Korean character need three (3) bytes in UTF-8 to encode. Ouch! That is the price to pay for internalization.
Oh btw, UTF-8 was invented by Ken Thompson and Rob Pike while eating in a Jersey Diner.
