skip to main content
Internationalization, Localization, and Unicode : Unicode Character Encoding : Background
 

Background

Most legacy computing environments have used ASCII character encoding developed by the ANSI standards body to store and manipulate character strings inside software applications. ASCII encoding was convenient for programmers because each ASCII character could be stored as a byte. The initial version of ASCII used only 7 of the 8 bits available in a byte, which meant that software applications could use only 128 different characters. This version of ASCII could not account for European characters, and was completely inadequate for Asian characters. Using the eighth bit to extend the total range of characters to 256 added support for most European characters. Today, ASCII refers to either the 7-bit or 8-bit encoding of characters.
As the need increased for applications with additional international support, ANSI again increased the functionality of ASCII by developing an extension to accommodate multi-lingual software. The extension, known as the Double-Byte Character Set or DBCS, allowed existing applications to function without change, but provided for the use of additional characters, including complex Asian characters. With DBCS, characters map to either one byte (such as American ASCII characters) or two bytes (for example, Asian characters). The DBCS environment also introduced the concept of an operating system code page that identified how characters would be encoded into byte sequences in a particular computing environment. DBCS encoding provides a cross-platform mechanism for building multi-lingual applications; however, using variable-width codes is not ideal.
Many developers felt that there was a better way to solve the problem. A group of leading software companies joined forces to form the Unicode Consortium. Together, they produced a new solution to building worldwide applications—Unicode. Unicode was originally designed as a fixed-width, uniform two-byte designation that could represent all modern scripts without the use of code pages. The Unicode Consortium has continued to evaluate new characters, and the current number of supported characters is over 95,200.
Although it seemed to be the perfect solution to building multi-lingual applications, Unicode started off with a significant drawback—it would have to be retrofitted into existing computing environments. To use the new paradigm, all applications would have to change. This was clearly unacceptable, and several standards-based transliterations were designed to convert two-byte fixed Unicode values into more appropriate character encodings, including, among others, UTF-8, UCS-2, and UTF-16.
UTF-8 is a standard method for transforming Unicode values into byte sequences that maintain transparency for all ASCII codes. UTF-8 is endorsed by the Unicode Consortium as a standard mechanism for transforming Unicode values and is popular for use with HTML, XML, and similar protocols. UTF-8 is, however, currently used primarily on AIX, HP-UX, Solaris, and Linux.
UCS-2 encoding is a fixed two-byte encoding sequence and is a method for transforming Unicode values into byte sequences for Microsoft Windows 95, Windows 98, Windows Me, and Windows NT.
UTF-16 is a superset of UCS-2, with the addition of some special characters in surrogate pairs. UTF-16 is the standard encoding for Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, Windows 7, Windows Server 2012, and Windows 8. Microsoft recommends using UTF-16 for new applications.