Cambodia Takes on US Software Giants in Battle for Khmer Computer Script

It started soon after the first e-mail zipped out of Cambodia, long before Internet cafes were established and overseas Internet phone calls were possible. A government minister asked the man who established the first e-mail system in Cambodia if he could send an e-mail in Khmer to his wife, who didn’t speak English.

He was told he couldn’t.

That’s still the answer, more or less, seven years after e-mail came to Cambodia and five years after the arrival of the Internet. The people who designed the early computers didn’t speak Khmer, so the computers don’t speak it either.

The same could be said of Arabic, Mongolian, Pali and any of dozens of other languages that couldn’t appear on a computer screen designed by Microsoft or Apple without heavy tinkering by a  programmer.

A few of those programmer types have come and gone in Cambodia, and today there are ways to trick a computer into displaying and printing Khmer. A handful of people have developed Khmer font programs, which are nothing more than pictures, resembling Khmer characters, which appear when various keys are struck.

It works fine for writing a note, or printing a newspaper, but the font programs fall short when it comes to e-mail.  The messages are often misunderstood by other computers. For example, one computer’s picture of the Khmer word for “fish” for is another computer’s %6h!.

That’s changing. The world’s biggest computer companies have created a consortium to formulate a code capable of rendering all of the world’s languages. Called Unicode, it is a numbered list with spots reserved for every character ever used in a written language. The list has room for nearly one million characters.

The list will eventually be installed in every computer in the world, making it possible to send messages across time zones and cultures, with the language intact.

The development of Unicode has taken place in meetings throughout the world for more than a decade. Those meetings cover a rich marriage of language and computer science, of world cultures and the silicon chip, of how a computer program should be written to accomodate the many ways that we as a planet say hello.

Most of the world’s major languages have been completed, including thousands of characters in Japanese, Korean and Chinese scripts. But not all languages have been incorporated smoothly into Unicode. And then there”s Cambodia.

Next week, a group of Cambodians will travel to Dublin, Ireland, to meet with the experts who are designing Unicode: representatives from Microsoft, IBM and Apple are scheduled to attend.

The Cambodian delegation will tell the world’s computer companies that the consortium has made mistakes converting the Khmer language, and that the computer companies should stop the global implementation of a program that could shortly be installed in millions of computers around the world.

According to the Cambodian delegation, the consortium’s code for the Khmer language includes a letter that does not exist in Khmer, and does not include two vowels. They say the consortium’s version of Khmer also fails to provide individual marks for so-called “subscript characters” which sometimes appear in conjunction with the major characters in the alphabet.

The Cambodian delegation contends the consortium accepted the Khmer code in late 1999 without any agreement from the Cambodian government, a violation of the consortium’s own rules.

The Cambodians say they do not know what will happen in Dublin, but they hope someone will listen.

Why is it important? Listen to a scenario offered by Norbert Klein, the founder of Open Forum and the man who first brought e-mail to Cambodia.

“We could [ignore this], but probably some guys in the big universities in the West would use the Unicode system [for Khmer] in their research,” Klein said. Over time, the Unicode version of Khmer would become the international standard used on the Internet.

Cambodia could develop its own digital version of what it considers to be pure Khmer script, but it would be incompatible with any document or web page created elsewhere in the world.

“Eventually, we would get locked out of the World Wide Web,” Klein said.

“This is a very sensitive project, please handle the information very carefully” begins one of several letters sent to the Cambodia Daily from experts who either support or oppose the current  Unicode version of Khmer.

The Cambodian delegation believes it is the nationÕs right to preserve its language as it sees fit.

“We are surprised and concerned that this process was conducted without any official involvement or endorsement by the authorities of Cambodia,” said Leewood Phu, secretary general of the National Information Communications Technology Development Authority, the Cambodian government’s official body for dealing with a Khmer language standard.

Unicode supporters say it cannot be changed because it has been sanctioned by the International Standards Organization, a Geneva-based body that approves world standards for everything from gasoline to wing nuts.

They say changing Unicode would create problems because thousands of computers have already been sold with the Unicode standard embedded in them. And they fear if Cambodia succeeds in changing Unicode, other countries may want to change their codes.

It turns out 1999 was a bad time for the Unicode consortium  to accept a Khmer code. Soon after it was officially drafted, the Cambodian government sponsored a forum on the Khmer language. A new dictionary was being considered at the time, and scholars were locked in a sometime impassioned debate over the history of the Khmer language, which has been manipulated by invaders and was partially lost during the Khmer Rouge regime.

The new national dictionary which will contain up to 34,000 words, according to the National Language Institute must bridge the gap between the traditional Khmer collected by the Buddhist monk Chourn Nath in his 1915 dictionary and the modern “Khmerization” writing born in the late 1960s and is common among young people today.

To teach a computer the Khmer language, one must understand the intricacies of Khmer script, its shapes, and especially its peculiar way of arranging subscript characters around major characters to form a word, sometimes to the left, sometimes to the right, sometimes above, sometimes below.

That same person must also know English, the language of computer programming, and they must know how to write software. They must also have time and money to devote to what is essentially an esoteric project.

Enter Maurice Bauhahn, who by e-mail comes across as an intelligent eccentric who takes an interest in Bible translations, the Khmer language and the many variations of programs available for Apple computers.

An e-mail inquiry was quickly returned, complete with long answers, e-mails of relevent experts and web site addresses of academic papers (often written by Bauhahnm himself).

“I early saw the need for a standardized encoding of the Khmer script,” Bauhahnm wrote.

By early he means 1981, when he first arrived in Cambodia carrying an Osborne computer. He took on the task of what became a 1,600-page Khmer-French-Vietnamese-English medical dictionary.

He soon learned important lessons about the marriage of ancient script and modern science. He hired a team of four people to type in the 1915 Chuon Nath dictionary, using a font program that broke the Khmer language down line-by-line into individual marks called glyphs. It was capable of making most of the words, but not all.

“Eventually I had to create a second Khmer font to add glyphs that could not fit into the first font,” he wrote. Bauhahn believes Khmer is among the most complex languages in the world, and he found it difficult to devise a computer program that would faithfully capture the language without distorting it or violating its rules.

Bauhahn was still in Cambodia when a 1993 conference was held on standardization of a Khmer script, chaired by Peter Lofting, who today works for Apple computer.

Was the Cambodian government involved in the development of Khmer Unicode? Lofting says yes, offering a list of people who were at the 1993 meeting: Om Yienteng, today a close adviser to Hun Sen but at the time the director of a computer center; Ouk Chhieng, director of the University of Phnom Penh; Sorasak Pan, undersecretary of state for the Council of Ministers; and Khieu Kanharith, secretary of state for the Ministry of Information.

What came out of the conference was the idea that an encoding of Khmer subscripts should work the same way as in other Indic languages, with the help of a character from the Indic languages that indicates a subscript is being formed. The Unicode called it the Virama character.

That was Lofting’s idea, Bauhahn later said. Using the Virama model, a person types a “Virama” key and then types the character that is to be made a subscript. The end result is a Khmer character with a subscript positioned in the correct place. The Virama character does not actually appear on the finished form.

“The Virama model is one way to model the connections in a script,” wrote Lofting. “There are othersÉEach method has consequences for use and processing of the data and often trade-offs must be made between one design choice and another.”

This would cause endless debate when a Cambodian government panel formed in 2000 determined that the Virama-like character does not officially exist in Khmer.

After the 1993 conference, Bauhahn, then working for UNICEF, gathered a group of Khmer linguists appointed by the government to devise a list of characters to encode in Unicode.

By now Bauhahn had spent nearly a decade studying the relationships between Khmer script and computer programming. After experimenting with a system that used individual codes for each Khmer subscript characters, he decided a Virama model would work best.

The linguists drafted what Bauhahn considers the original Khmer Unicode document for the encoding system now the subject of debate.

Soon there was an on-line mailing list for people to discuss the Khmer Unicode draft. Its aspects, including the Virama character, were debated, sometimes hotly, according to Bauhahn. But not one Cambodian name appeared on the mailing list.

Bauhahn was puzzled by this, and would later ask why no one from the Cambodian government who objected to the Virama model joined the debate on the mailing list.

By 1997, Norbert Klein’s name was the only one on the mailing list from Cambodia. Khmer Unicode moved toward completion and the people involved had no reason to suspect anything was wrong.

“The Khmer encoding was properly devised in consultation with Cambodian government-appointed linguists,” said Michael Everson, the Irish representative to the Unicode consortium. “It went through the ISO balloting process. Most importantly, it works. There is no string of Khmer characters, whether for Khmer or for Sanskrit and Pali, that cannot be represented with the current encoding.”

As Unicode moved ahead, more people were using Khmer font programs and crude solutions in order to type Khmer on computers. Starting with one of the earliest font programs, developed by Dr Gerard Diffloth of Cornell University in the US in the early 1980Õs, the Khmer language has been recreated for computers no less than 23 times.

The font programs allow a rudimentary typing of Khmer text onto a computer. A few of the fonts allow for a word break to be inserted into the text to make it easier to break words at the end of a line of text, since the Cambodian language does not require breaks between characters.

But the font programs cannot spell-check or do any of the more powerful computer searches. As a result, the computer is little more than an electronic typewriter.

The combinations necessary to type a single character are time consuming and complex. Double, triple and even quadruple key combinations are necessary to call up a Khmer letter.

An innovative solution was unveiled in 1999 when the Japanese-government funded Khmer Philology Project introduced its Intelligent Khmer Writing System, which eliminated those multiple key combinations.

But it was a different system than Unicode.

The first official objection to the Unicode consortium’s version of Khmer was a letter written in 2001 by Leewood Phu.

“Due to the fact that the approximate Cambodian authorities are still in the process of finalizing their position and coming up with an endorsed national standard, we request you to halt all activities of disseminating any unendorsed code tables, and to advise all relevant bodies, particularly those who may be implementing such code, of our official position as expressed above,” he wrote.

Phu said recently he hasn’t received a reply from Thomas Frost, an AT&T employee in America who is also chairman  of an important Unicode consortium committee.

Leewood Phu said Burma and Laos would both file for Unicode changes if Cambodia was allowed to make changes.

A Cambodian delegation attended a Unicode consortium meeting held in Singapore last October. Organized as the newly-formed Committee for Standardization of Khmer Characters in Computers, and accredited by the Industrial Standards Bureau of Cambodia, the group delivered its own code table for the Khmer language.

The  “Cambodian Standard Coded Character Set,” is based on the 1915 Chuon Nath dictionary. The code assigns individual spaces on the Unicode list for each subscript found in Khmer, making the Virama sign unnecessary.

It also added 32 unique symbols for lunar dates and 10 specialized characters used by fortune tellers.

The Khmer team was told at a subsequent Unicode consortium in California, that they would be allowed to add the lunar calendar characters and the fortune teller remarks, but that nothing else would be changed. The two missing vowels were taken care of by other characters in the code table, the group was told.

The Cambodian panel was frustrated because they believed they had devised a superior system. They found a sympathetic ear in Paul Nelson, a Microsoft programmer who works with the Unicode consortium.

“There must also be an acceptance of Unicode by the Cambodian government,” he wrote to the Daily. “Microsoft has encountered many problems in the past when we implemented our own standards. This is a problem we are seeking not to repeat….This is a very important thing for the Cambodian government to do as it will have a huge impact on computing with Khmer for years to come.”

But Nelson ultimately sides with the Unicode consortium, which is significant because he is the man at Microsoft who will one day embed the Khmer language into the operating systems of the world’s computers. Once the Khmer code is distributed, it will be nearly impossible to change it.

Nelson says the Cambodian delegation is correct when they say the Virama model is not linguistically correct. But he says the end result is a perfectly normal looking Khmer letter.

Nelson will soon travel to Cambodia to meet face-to-face with the Cambodian delegation. He hopes to test the existing Khmer Unicode table and talk about the best ways to embed it in computers.

Peter Lofting, the man who chaired the 1993 conference on Unicode, describes objections to the Virama model this way:

“If you had a mechanical typewriter for Khmer, and somewhere deep in the manufacturer’s technical specification you found that some of the accent keys were actually labeled with Indic names because the machine was based on an earlier (and successful) Indian model, would anyone really care? Does the idea that a certain key is or is not “really” Khmer in its origin have a bearing on what a Khmer user can type on that typewriter? The relevant question is does complete and accurate Khmer text come out of the system?”

“If the Unicode model is not broken, then it will be hard to persuade computer makers to fix it,” said Lofting, who works for the Fonts Group at Apple. He said that Apple has no plans at present to add Khmer to the Mac operating system.

Michael Everson, the Irish representative to the Unicode consortium and the co-author of a paper with Bauhahn defending the Virama model, said it works well with the early Sanskrit and Pali texts written in both early and modern Khmer script.

“The Virama model is more flexible,” he wrote, adding that it will be easier for computer companies to implement because they will have experience with it in other Brahmic scripts.

Everson says the governments of Sri Lanka and Burma were also troubled by the Virama model, but eventually agreed to use it because it supported Sanskrit and Pali languages, and was far easier for the computer industry to implement.

Others, notably those behind the development of the Intelligent Writing System, say a change is necessary.

“I think that the total revision should be allowed in this case,” wrote Harada Shiro, a project researcher at the University of Tokyo and one of the peoplewho worked on the Intelligent Writing System, “because the process of incorporating Khmer code to the “international standard” had several critical irregularities and the absence of due process made the code illegitimate from the beginning. The most serious irregularity is that the code was established without Cambodian participation or endorsement.”

The Khmer Philology Project’s strong stand against the existing Khmer encoding formula, and it’s obvious self-interest in promoting it’s own Intelligent Writing System, has drawn criticism. Some people believe they are trying to manipulate the Cambodian government into supporting its product.

But the  project has one strong point in its favor: the Cambodian government never officially approved the script developed by the Unicode consortium.

For Maurice Bauhahn, the long debate has become personal.

“It has really saddened me that such animosity has grown up over Khmer Unicode and antagonism to me personally as I have tried to help Cambodia,” he wrote. “And I certainly do not want to throw fuel on the fire! I want to see Khmer Unicode implemented and used!”

Norbert Klein, who will travels with the Cambodian delegation to Ireland next week, doesn’t know what to expect.

“We are somewhat stuck at this point,” he said.”And if you ask me what is going to happen I cannot tell you. It is a complete deadlock.”

Related Stories

Latest News