[ Team LiB ] Previous Section Next Section

14.3 A Brief History of Bits

Let's shift gears a little and discuss additional issues to consider when dealing with non-Western European languages. Once upon a time, not so long ago, bits were very expensive. Hard disks for storing bits, memory for loading bits, communication equipment for sending bits over the wire; all the resources needed to handle bits were costly. To save on these expensive resources, characters were initially represented by only seven bits. This was enough to represent all letters in the English alphabet, 0 through 9, punctuation characters, and some control characters. That was all that was really needed in the early days of computing, because most computers were kept busy doing number crunching.

But as computers were given new tasks, often dealing with human-readable text, 7 bits didn't cut it. Adding one bit made it possible to represent all letters used in the Western European languages, but there are other languages besides the Western European languages, even though companies based in English-speaking countries often seem to ignore them. Eight bits is not enough to represent all characters used around the world. This problem was partly solved by defining a number of standards for how eight bits should be used to represent different character subsets. Each of the 10 ISO-8859 standards defines what is called a charset: a mapping between 8 bits (a byte) and a character. For instance, ISO-8859-1, also known as Latin-1, defines the subset used for Western European languages, such as English, French, Italian, Spanish, German, and Swedish. This is the default charset for HTTP. Other standards in the same series are ISO-8859-2, covering Central and Eastern European languages such as Hungarian, Polish, and Romanian, and ISO-8859-5, with Cyrillic letters used in Russian, Bulgarian, and Macedonian. You can find information about all 10 charsets in the ISO-8859 series at http://czyborra.com/charsets/iso8859.html.

Such languages as Chinese and Japanese contain thousands of characters but with 8 bits, you can only represent 256. A number of multibyte charsets have therefore been defined to handle these languages, such as Big5 for Chinese, Shift_JIS for Japanese, and EUC-KR for Korean.

As you can imagine, all these different standards make it hard to exchange information encoded in different ways. To simplify life, the Unicode standard was defined by the Unicode Consortium, which was founded in 1991 by companies such as Apple, IBM, Microsoft, Novell, Sun, and Xerox. Unicode uses 2 bytes (16 bits) to define unique codes for 49,194 characters in Version 3.0, covering most of the world's languages. Java uses Unicode for its internal representation of characters, and Unicode is also supported by many other technologies, such as XML and LDAP. Support for Unicode is included in all modern browsers, such as Netscape and Internet Explorer since Version 4. If you like to learn more about Unicode, visit http://www.unicode.org/.

What does all this mean to you as a web application developer? Well, since ISO- 8859-1 is the default charset for HTTP, you don't have to worry about this at all when you work with Western European languages. But if you would like to provide content in another language, such as Japanese or Russian, you need to tell the browser which charset you're using so it can interpret and render the characters correctly. In addition, the browser must be configured with a font that can display the characters. You find information about fonts for Netscape at http://home.netscape.com/eng/intl/ and for Internet Explorer at http://www.microsoft.com/ie/intlhome.htm.

JSP is Java, so the web container uses Unicode internally, but the JSP page is typically stored using another encoding, and the response may need to be sent to the browser with different encoding still. There are two page directive attributes that can specify these charsets. The pageEncoding attribute specifies the charset for the bytes in the JSP page itself, so the container can translate them to Unicode when it reads the file. The contentType attribute can contain a charset in addition to the MIME type, as shown in Figure 14-4. This charset tells the container to convert the Unicode characters used internally to the specified charset encoding when the response is sent to the browser. It is also used to set the charset attribute in the Content-Type header to tell the browser how to interpret the response. If a pageEncoding is not specified, the charset specified by the contentType attribute is used to interpret the JSP page bytes as well, and vice versa if pageEncoding is specified but not a contentType charset. If a charset is not specified at all, ISO-8859-1 is used for both the page and the response.[1]

[1] For a JSP Document (a JSP page in XML format, described in Chapter 17), UTF-8 or UTF-16 is the default, as determined by the XML parser.

Enough theory. Figure 14-4 shows a simple JSP page that sends the text "Hello World" in Japanese to the browser. The Japanese characters are copied with permission from Jason Hunter's Java Servlet Programming (O'Reilly).

Figure 14-4. Japanese JSP page (japanese.jsp)
figs/Jsp3_1404.gif

To create a file with Japanese or other non-Western European characters, you obviously need a text editor that can handle multibyte characters. The JSP page in Figure 14-4 was created with WordPad on a Windows NT system, using a Japanese font called MS Gothic and saved as a file encoded with the Shift_JIS charset. Shift_JIS is therefore the charset specified by the pageEncoding attribute, so the container knows how to read the file. Another charset called UTF-8 is specified for the response by the contentType attribute, using the charset attribute. UTF-8 is an efficient charset that encodes Unicode characters as one, two, or three bytes, as needed, supported by all modern browsers (e.g., Netscape and Internet Explorer, Versions 4 or later). It can be used for any language, assuming the browser has access to a font with the language character symbols.

Note that the page directive that defines the charset for the file must appear as early as possible in the JSP page, before any characters that can only be interpreted when the charset is known. I recommend you insert it as the first line in the file to avoid problems.

    [ Team LiB ] Previous Section Next Section