Way 2 Web

Web development tips


 
Character Encodings

Weird characters?

Ever notice characters displayed incorrectly on a web site? Those tell-tale little squares or question marks that look like Twilight Zone escapees?

This phenomenon demonstrates the importance of selecting character encodings for your web pages.

Character sets

First of all, let's distinguish between sets and encodings.

A coded character sets is a set of symbols, each of which has a unique numerical ID, a code point.

For example, the ASCII character set contains 128 code points, Latin 1 (or ISO-8859-1) has 256, and Unicode's Universal Character Set (UCS), the most expansive charcater set, has over 1.1 million code points.

Broadly speaking, the smaller character sets are subsets of the larger character sets. Thus, the characters depicted by ASCII code points (English letters, numbers and punctuation marks), appear in Latin 1 and UCS as well.

That being said, every HTML document uses UCS.

However, some older browsers and devices do not support the full range of UCS characters, and display them typically as those "weird characters" mentioned above, or simply not display them at all.

Character encoding

The character encoding of a page, however, determines how each of the UCS characters is represented, more specifically, whether one or more bytes is used per character.

7-bit ASCII encoding uses 7 bits, UTF-8 uses 1 full byte, UTF-16 uses 2 bytes, and UTF-32 uses 3 bytes.

If a given character exists only in UCS, but 7-bit ASCII encoding is used, the character cannot be adequately represented since it requires more than ASCII's seven bits. (7-bit ASCII is considered the "lowest common denominator" encoding and is the default for email messages.)

The default encoding in most browsers is Latin 1.

In other words, if your site has content in languages besides English, use UTF-8 as your encoding.

Although UTF-8 uses a single byte for a character, it may use two or three bytes where necessary, ensuring that the entire UCS is supported.

Input encoding

Thus far, we have covered the encoding of pages sitting on the server.

But the topic is also relevant to another form of client-server communications: user input, such as via an online form.

Luckily, the browser will submit data in the same encoding of the page itself. A page coded in UTF-8, will submit data in UTF-8.

In practice

After you choose an encoding for your pages, you need to:

  • Save the pages in the encoding. This is a matter of confuring your development editor program. MS Notepad allows you to select the encoding when saving a document.
  • Tell the browser what encoding to expect, using a <META> tag.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

In most cases, you'll want to use UTF-8 encoding. If your site involves a large amount of Chinese characters, UTF-16 may be a more efficient solution.

Always specify the encoding for your pages. This will ensure the pages display correctly on all platforms.

If you really have to...

What if your pages use a non-unicode encoding (such as Latin 1) for some pressing reason (for example, you're using PHP, which in many versions does not handle encoding correctly in its string functions) and you still want to use a character outside of the encoding?

You can always use the HTML character entity, the familiar codes that begin with an ampersand (&) and end with a semi-colon.

However, note that if the out-of-range characters are submitted as input, browsers behave differently, some converting the character to the appropriate character entity and submit that, others will submit the correct character code point, and yet others will replace the character with a one that is within range.

As with many workarounds, this has its problems.

References