Linguistic diversity, computers and unicode. Jack Cain

The region of the Pacific Rim is an excellent example of the linguistic diversity which is challenging the computer industry and those who use its products. In this paper, I would like to explore some the reasons why, in spite of great advancements in computer technology, we are still far from a world in which information can be freely and easily exchanged regardless of the language or script in which the data is recorded.

I would like to highlight this problem with a brief anecdote which arose when I was attending the recent Association of Asian Studies Annual Conference in Washington, D.C. A scholar of Sanskrit was describing to me her frustration at delivering the manuscript of her book to her university press for publication. The manuscript was of course delivered as a computer file since it had been written on a computer. But when she came to read the proofs which came back to her, she found, with some dismay, that all accented letters had been converted to unreadable garbage. (Accented letters had been used in recording the romanized forms of Sanskrit words in her book; romanized forms had been used to avoid the computer problems in dealing with the original Sanskrit script, Devanagari.) Each accented letter had to be corrected by hand and rekeyed at the university press. Why did this happen? Perhaps many of those present have had a similar experience in transferring or moving computer files which contain data other than English. The root of this problem can be traced to the very beginning of computing technology in the first half of this century; a problem which has yet to find a universally implemented solution.

This overhead (Figure 1), you will recognize to be a slightly different version of the usual computer keyboard which is now so much a part of the lives of many of us. I will return to the differences about this particular keyboard later, but the point I wish to make here is that this keyboard is very similar to the keyboards of its predecessor the typewriter. And there lies the root of our problem.

As the first computer keyboards were designed, they, like the horseless carriage, imitated their forerunners. And perhaps this would not have been such a bad thing except for the fact that the contents of this keyboard then determined, presumably by default, the content of the coding systems used to record data in computer memory.

As many of you know, ASCII (American Standard Code for Information Interchange) is the name of the coding system which is used to store in computer memory most of the data now held in North American computers. The next overhead (Figure 2) shows the familiar ASCII table as it appears in the USMARC Specifications manual. It is just the left half of this chart which shows the ASCII table. Perhaps for persons with backgrounds such as mine, growing up in an almost entirely English world, this chart seems no more unusual than the familiar keyboard. But for many others the startling fact about the ASCII chart is that it makes provision only for English data and not for data in languages which, like English, may use Latin letters but which might have additional requirements such as accents or special letters. (English of course uses the letter "W" which is not a Latin letter but most of us have forgotten about these details since they are so familiar.)

On the right side of this chart, you will see the attempt made by the library community in North America to extend the ASCII chart to accommodate other languages (although still using only Latin or modified Latin letters and not accommodating scripts other than Latin). This was a sensible effort made early on at the beginning of the MARC format in 1968. And it is wonderful that this standard is widely used by library automation vendors in North America. It is because of this standard that the sharing of bibliographic data among libraries in North America has been made possible. But is this standard well known in the computer industry in general? Not at all. Try going into your local computer store and asking for word processing software or spreadsheet software or heaven forbid a keyboard that follows this standard. No one will know what you are talking about. This standard is virtually unknown outside the library community. An even worse problem is that this standard has never been accepted internationally. Figure 3 shows a similar chart illustrating one of the standards published by ISO (International Standards Organization) in Geneva. Notice that the right side of the chart is very different both in content and arrangement. (Those who are very observant will see that even the left side of this chart is slightly different. This difference is because what has been called "ASCII" in the USMARC chart is actually a modified version of ASCII.)

It would not be so tragic if what we were dealing with was only an abstract standard. But what these varying standards mean is that databases of information are being built which are not mutually intelligible since the same character is encoded with different codes in computer memory. In the two charts we just examined, you will also find that chart 1 contains some characters which are missing from chart 2 and vice versa; this means that any such character encoded on a system using chart 1 will be totally undisplayable on a system using chart 2.

It is just this problem of conflicting standards which caused the problem for my acquaintance in Washington who was publishing a work on Sanskrit. As a footnote, it is perhaps worth mentioning that the conflicting standards used by the Sanskrit scholar and by her university press were neither of the two mentioned here but completely different ones.

Perhaps another point needs to be made about the charts we examined about. These charts show the assignment of data elements in each byte of data stored. Each byte stored has 8 bits which can be either "off" or "on". If you work out the mathematics you will see that this allows for 256 combinations and the charts we just saw have in fact 256 "boxes" available.

What is perhaps not easily appreciated about the mathematics is that these 256 "boxes" are the only ones available. All data must fit into them.

Therefore, consider what we must do to encode a script different from Roman such as the Cyrillic script used by Russian (a Pacific Rim Language). We must redefine some of the 256 "boxes" to stand for Cyrillic letters; and in doing so we will be creating a standard which is in conflict with the previous two charts we have examined. It is just this proliferation of charts (called "code pages" in the computer industry) which is the root problem in the incompatibility of data which is stored in different languages.

The system of different code pages for different languages and scripts is still the prevalent system in the computer world today. This system works well if the users of the data created are using only one code page. However, for projects which require many code pages to be used at once enormous complexity is introduced. It is also true that data created today for one purpose may very well be required to be moved to a new environment or used in a new way. It is this movement of data across platforms and environments which makes the system of code pages so cumbersome. For example, the new environment into which the data moves may not support the code page in which the data was created. When this happens the data becomes unreadable and meaningless garbage. Given the movement of data we now see happening on a daily basis on Internet, this problem takes on alarming proportions.

Standards for Asian Data When we come to examine the situation with standards in Asia we find a sadly similar picture -- no unified standard for the recording of data but a multiplicity of standards.

For those of you who have had some introduction to languages such as Chinese and Japanese (or for those of you who actually know these languages well), it is immediately clear that 256 "boxes" is not enough. How to get around this mathematical limit?

Figure 4 is an attempt to illustrate as simply as possible the basics of encoding and how a language such as Japanese can be "fitted in".

This figure shows how the letter "A" is encoded in a string of 8 bits or "on/off" switches. In order to accommodate languages such as Japanese, the "trick" of using two bytes at once is used. Thus each character of the script is assigned two bytes or 16 bits of information. The mathematics (256 times 256) now provides for 65, 536 possibilities.

However, we are still using the same code space; these are still the same "boxes" since as noted above there is in fact only one code space so far as the computer is concerned. Thus the first character here (of the word Nippon, meaning Japan) is actually the same boxes as for the letter "F" and the "|" (vertical bar) in ASCII. And this is exactly what you will see on the computer screen if you take this character of Japanese data from a Japanese computer and try to view it on a plain ordinary North American personal computer. (I say plain and ordinary because it is now possible to make the North American computer display the correct Japanese character if it is provided with the appropriate software.)

Figure 5 shows this same character, the first character of the word meaning Japan as it is encoded in a variety of national computer standards. JIS is the standard used by most personal computers in Japan. GB2312 is the standard used by most computers in the People's Republic of China. KSC 5601 is the standard from South Korea. Big 5 and CCCII are standards from Taiwan. And EACC is the standard which was derived from CCCII and is used in the United States for Asian data. Note that the last two examples use 3 bytes per character not two. The mathematics of three bytes gives over a million possibilities But using 3 bytes instead of 2 drives most computer people crazy since 2 is a much more "convenient" number. The fact that two is twice one and half of four makes a difference as we shall see later with Unicode.

The system of writing developed in China is very mysterious to many people so, at the risk of boring those who already know all about Chinese characters, I thought that it would be useful to say a few words about this topic since this writing system or script is an essential part of the data being stored in databases around the Pacific Rim today.

I have often heard the sentiment expressed by English speakers when faced with the complexity of Chinese characters that perhaps they will be abandoned for a simpler system. Although there was an initial hurdle for computers to adapt to this system, much data is now being stored which includes Chinese characters. In fact, the use of such characters in the Japanese language is reported to be increasing since word processing software on personal computers in Japan make it easy to find and include Chinese characters in written documents whereas when documents are written by hand it may be too difficult to remember how to write the character so, as is permitted in Japanese but not in Chinese, a phonetic character may be used instead. (It is a commonly noted fact of the human brain that Chinese characters are much harder to remember how to write correctly by hand than to recognize that character when it is presented on a printed page.)

I would like to use Figure 6 to explain a little about Chinese characters. This series of 4 symbols which I clipped from a paper somewhere is saying "I love Japan". The last two characters are Chinese characters the same ones used in figure 4 used above. These 4 symbols are an interesting mixture of scripts and languages; the kind of mixture which seems to be in fact to be happening more and more often. I am using this example partly because of the use of the heart symbol to stand for "love". This is one of the rare examples from our Western culture which helps explain how Chinese characters work. We use the shape which has come to stand for the meaning "heart" or "love" in place of our usual written word for "heart" or "love". In using this heart symbol, we are doing something quite similar to what happens when Chinese characters are used. Note that if I used the heart symbol in a French context, I would pronounce it "aime" not "love"; there is nothing in the symbol itself to tell me how to pronounce it in words; I just have to know that in my head by convention. Now, let us look at the last two symbols, the Chinese characters which in this context mean "Japan". I choose these characters because it is still possible to see in them the original symbols. The first one, which was originally a circle with a dot but has been stylized to make it easier to write with a brush, is the symbol for "sun"; the second, is the symbol for tree with a short bar across its trunk to make it mean "roots" instead of "tree". Together they mean the "roots of the sun" or where the sun rises, that is the East, which is the location of Japan as seen from China where these characters were invented. These two characters are exactly the characters which are used still today in China and in Japan to stand for the name "Japan". In China, however they represent the word "ruhbun" which is how the Chinese word for Japan is pronounced in the Chinese language (note that many sounds in Chinese cannot be represented at all accurately by Latin letters), whereas in Japan they are used to stand for the word "Nippon" which is how the Japanese word for Japan is pronounced in the Japanese language. Notice how similar this situation is to the fact that the heart symbol is pronounced "love" or "aime" depending on whether it is used in an English or a French context. I think that this example also shows a little of why Chinese characters seem stubbornly to persist and are not dying out. They are very convenient compact graphic symbols which are independent of pronunciation; they can therefore be used in any language or dialect to convey meaning. Perhaps we will start seeing them more in English as in this particular example.

Figure 7 shows some Chinese characters from the year 571 BCE along with, in the upper right, a transcription into modern characters such as those one might see today in recently printed publications from for example Taiwan. I believe that all the characters in this modern transcription (but of course none of the ancient inscriptional forms) are present in the Taiwanese "Big 5" computer standard. Taiwan still uses what are called sometimes called in English "Traditional characters". Although the printed form of Chinese characters remained very stable for nearly 2000 years, in the 20th century, both Mainland China and Japan have pursued various programs of character simplification and standardization. Figure 8 shows the word "library" written, on the first line, (reading horizontally left to right) in traditional characters, and on the second line in its modern Japanese form in which the first character has been simplified, and on the third line in its modern Mainland Chinese form in which all three characters have been simplified. Figure 9 shows another list of characters comparing these three forms. Note that in the last three examples in Figure 9, the Japanese and Mainland Chinese simplified forms are the same but that in the first 5 examples they are not the same and range from being somewhat similar to being completely different.

There has always been for me a certain element of irony in these programs of simplification. Whereas it is true that the simplifications may have assisted more people to learn and retain these languages more easily, for the librarian or historian or archivist, the older forms cannot be ignored. For them, the simplification is actually a complication making it necessary to learn two (or more) systems instead of the formerly single standard system of characters used in China for many centuries. And similarly, for the requirements of computer processing all three of these standards (Traditional forms, Japanese forms and Mainland Chinese forms) must be taken into account in any system that wishes to deal with all of the Pacific Rim. Not only must it be possible to allow for their processing, it must be possible to convert back and forth between them.

I would like to now leave this discussion and examine the Unicode standard and the promise it brings to resolve some of the issues spoken about above.

In recognizing the basic nature of linguistic diversity which faced the computer industry, a number of interested individuals from companies and institutions requiring international information systems formed in early 1989 an informal working group called the Unicode project. Major contributors to this early development included Apple, GO, IBM, Metaphor, Microsoft, NeXT, The Research Libraries Group, Sun Microsystems, and Xerox. Aldus, Lotus and Novell subsequently also took active roles in this early development. The term "Unicode" was first used by Joe Becker of Xerox in 1987, and comes from the phrase "unique, universal and uniform character encoding". "The Unicode standard evolved from the industry's need for a 16-bit version of ASCII," said Joe Becker.

In Asia, the emerging standard was reviewed by an ad-hoc committee called the "Joint Research Group" (JRG) which included representatives from Hong Kong, Japan, South Korea, Singapore, Mainland China, and Taiwan. In addition, a review was conducted by a group of linguistic experts assembled at the University of Toronto, a group which I had the honour of being a member.

In 1991, the informal working group became "Unicode, Inc.", an incorporated consortium, which then had published (from Addison Wesley) the actual printed standard which included over 27, 000 characters representing all the characters in the major languages of publishing of the modern world (plus a great number of obscure and not so major characters).

In a very significant subsequent development the Unicode standard was considered and then published as a compatible ISO standard, number 10646. The number 10646 was chosen presumably to reflect the hopes for this standard since the ISO version of the ASCII standard is number 646.

In the Unicode standard every character, including the characters from "ASCII", is represented by two bytes (16 bits) of information. In order to be able to encode as many characters as possible, the standard uses codes such as "00" (zero, zero) which formerly were reserved for control code purposes and were not to be used as data; this practice may cause some software to malfunction and give a few hiccups to telecommunications.

The standard has been published as a two volume set. The second volume contains all the Chinese characters which have been accepted in the standard and the first volume contains everything else. In the case of the Chinese characters, what has been created is called a "unified set" of characters. This means that for such characters as the two which are used in the name of Japan which we saw above and which are used in all three languages which currently use Chinese characters that is, Chinese, Japanese and Korea only one Unicode code will be assigned for each such character. One of the most difficult and controversial jobs in the creation of this standard was the problem of when to merge and create only one Unicode code and when to keep two (or more) separate codes for what were considered to be distinct characters.

The first volume of the set contains all the alphabetic and syllabic scripts such as Arabic, Thai, Cyrillic and so on as well as a very comprehensive set of symbols and special signs. The latter have been grouped in a very logical way and in addition an index is provided to help find them by their English names.

When the standard was first published in 1991, its promoters confidently predicted that the market would be quickly flooded with Unicode-based products (both hardware and software). Unfortunately, for the most part, as of today, the world is still waiting for this to happen.

In making the above confident prediction, the Unicode authors no doubt were considering only all the problems the standard would solve without considering the hard realities of the investment in existing standards such as ASCII and JIS and the general reluctance of the computer industry (like any other industry) to make changes which affect the very foundations on which the industry is built. (Just as no one has succeeded in changing the very non-ergonomic layout of letters on our familiar ASCII keyboard, a layout which was initially designed to prevent typewriter keys from colliding with each other!)

There are a number of very practical developments which stand in the way of the launching of products which are based on the Unicode standard. First it must be recognized that the Unicode standard is just that an encoding standard for computers. What else is required? Well, many basic things. For example, fonts need to be developed for all characters in the standard. Much work has already been done by organization such as Adobe, but, to my knowledge comprehensive Unicode fonts are still not available. Related to fonts is also the question of keyboards. If I can return to Figure 1 for a moment I would like to now explain that although this keyboard includes the Latin letters and symbols from ASCII and is therefore very familiar it also includes two other sets of symbols which are provided to allow for the typing of Chinese data. In this particular case the keyboard was manufactured for use in Taiwan. As such it is not suited to typing data in Mainland China or Japan. What faces the potential developer of products based on the Unicode standard then is the necessity to provide for data entry in a variety of languages and local conditions all with their unique requirements. Being here in Canada I cannot help reminding us that there is such a thing as a French Canadian keyboard layout which is different from the French keyboard layout used in France. Just to provide all the possible keyboard layouts required for a full implementation of the Unicode standard is in itself an extremely major undertaking. And who would prefer the Unicode-based product to their current parochial products? Only those with truly international requirements such as libraries and other information specialists. Perhaps we will have to wait for a critical mass of demand before such developments will be complete.

In answering the question of who is using the Unicode standard now I would like to mention one very promising example. It is product from the Institute of Systems Science, National University of Singapore called Multilingual Application Support Service ("MASS"). This product is based on the Unicode standard and although it does not handle all Unicode characters it does handle the Arabic, Chinese, Cyrillic, Greek, Japanese, Korean, Tamil and Thai scripts as well as the accented Latin letters required in French, German, Italian, Spanish and Vietnamese. In an interesting development, this product has been selected by the National Library of Australia in its implementation of a national "CJK" (Chinese, Japanese, Korean) network for the country. This network is in the process of being implemented now and will bear watching as a model for future developments. (For more information see the MASS homepage on the World Wide Web: http://www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html)

I hope that this brief paper has contributed to clarifying some of the problems associated with the exchange of multilingual computerized data. I would be most interested in continuing to exchange ideas and information on this topic with others who are also engaged in it.

p.s. Please note the following conventions: * The word "Unicode" should be written in upper and lower case, not all upper case ("UNICODE"). * Since "Unicode" is a trademark of the Unicode Consortium, the correct usage is as an adjective (e.g., "the Unicode standard" rather than "Unicode").

Copyright: British Columbia Library Association 1995
added: 23 Novembrer 1995
aball@idrc.ca
International Development Research Centre, Ottawa, Canada
http://www.idrc.ca/library/document/netpac/abs21.html

Página creada y actualizada por grupo "mmm".
     Para cualquier cambio, sugerencia,etc. contactar con: fores@uv.es
     © a.r.e.a./Dr.Vicente Forés López
      Universitat de València Press
    Creada: 15/09/2000 Última Actualización: 15/06/2001

Linguistic diversity, computers and unicode.

Jack Cain, Senior Consultant, Trylus Computing, Toronto, Canada

Copyright: British Columbia Library Association 1995 added: 23 Novembrer 1995 aball@idrc.ca International Development Research Centre, Ottawa, Canada http://www.idrc.ca/library/document/netpac/abs21.html

Copyright: British Columbia Library Association 1995
added: 23 Novembrer 1995
aball@idrc.ca
International Development Research Centre, Ottawa, Canada
http://www.idrc.ca/library/document/netpac/abs21.html