Foreigners seeking to learn the Khmer language have long been restricted by a dearth of resources beyond beginner’s textbooks, often relying on conversation and the methods of individual teachers to advance their vocabulary and comprehension.
Aiming to put a small dent in the situation, U.S. citizen Matt Fisher, 25, last month launched the Kheng Khmer-English Audio Dictionary, a free online resource that provides translations and on-demand voice recordings of about 3,000 words in Khmer.
The words come from a frequency list, apparently the first of its kind, compiled by Mr. Fisher by scraping online news sources and texts to identify the most commonly used words in the written language.
“The initial motivation came from using other dictionary websites,” said Mr. Fisher, an independent software developer and former linguistics student who has studied Khmer for 10 months.
“A search for a common word would return 10, 20, or even 100s of results and I had no way of knowing which words were commonly used and which were obsolete or obscure,” he said by email.
“In written English [about] 135 words comprise half of any given piece of text, and I found nearly the same to be true of Khmer (131 by my count), so knowing just those words goes a long way in understanding the language,” he explained.
The website offers a search function that takes words in English or Khmer and provides the audio recordings, which were recorded by his girlfriend, Sinett Sun.
Augmenting the value of the audio, the website also features a unique “segmentation tool.” The tool takes in blocks of text in Khmer and breaks them up into the individual words, offering the audio for each word as well as a written translation in English.
Building the website began with compiling a large quantity of text and then analyzing its content.
“I started by building a corpus, which is just a large collection of text,” he said. “I ‘scraped’ Web content from Khmer news sites, blogs, government sites, etc. with an open-source tool.”
“In all, the corpus contains [about] 300 million characters, so quite large, but it is not carefully balanced between topics in the way that an academic corpus would be.”
Soeung Phos, the coordinator of the Khmer language program for foreigners at the Royal University of Phnom Penh, said the website’s frequency list would be useful for students learning Khmer despite its constrained focus on more literary language.
“When I checked the website Kheng.info I see only very advanced language…[but] before nobody had written out the 1,000 words,” he said.
Mr. Fisher said he was aware of the existence of only one other Khmer-language frequency list, which was compiled by the religiously-affiliated Society for Better Books in Cambodia. That list was built, he said, from a corpus of biblical texts and accordingly has its own drawbacks for those wishing to learn vernacular Khmer.
“A good half of the most frequent words were transliterations of biblical names/places like ‘Jesus,’ ‘Moses,’ ‘Eden’ etc.” he said.