Character Encoding

Unicode and Other Character Sets

The default character set for XML, XHTML, and HTML 4.0 documents is Unicode (http://www.w3.org/International/O-unicode.html), a standard defined, oddly enough, by the Unicode Consortium (www.unicode.org). Unicode is a comprehensive character set that provides a unique number for every character, "no matter what the platform, no matter what the program, no matter what the language." Unicode is thus the closest thing we have to a universal alphabet, although it is not an alphabet but a numeric mapping scheme.

Even though Unicode is the default character set for web documents, developers are free to choose other character sets that might be better suited to their needs. For instance, American and Western European websites often use ISO-8859-1 (Latin-1) encoding. You might be asking yourself what Latin-1 encoding means, or where it comes from. Okay, to be honest, you're not asking yourself any such thing, but we needed a transition, and that was the best we could do on short notice.

What Is ISO 8859?

ISO 8859 is a series of standardized multilingual single-byte coded (8 bit) graphic character sets for writing in alphabetic languages, and the first of these character sets, ISO-8859-1 (also called Latin-1), is used to map Western European characters to Unicode. ISO 8859 character sets include Latin-2 (East European), Turkish, Greek, Hebrew, and Nordic, among others.

The ISO 8859 standard was created in the mid-1980s by the European Computer Manufacturer's Association (ECMA) and endorsed by the International Standards Organization (ISO). Now you know.

Mapping Your Character Set to Unicode

Regardless of which character set you've chosen, to map it to the Unicode standard, you must declare your character encoding, as discussed in the second rule of XHTML presented earlier. (You see, there was a point to all this.) Sites can declare their character encoding in any of three ways: