|
|
Codebreakers
With brute force and statistics, Google is decoding the language barrier
By
Paul Voosen
Staff Writer, The Prague Post
May 28th, 2008 issue
Algorithm by algorithm, Google is bringing us back to the days of the Tower of Babel.This month, the Internet giant expanded support to the Czech language on its online translation system, located at Google.cz/translate. The site allows users to roughly translate whole pages of Czech into English and some 20 other languages, including Chinese and Arabic. It is the first such resource freely available for Czech Web content.The translations provided by the service are far from perfect, often clunky and ill-formed, requiring the reader to make minor leaps of logic to reconstruct a sentence. For example, the sentence “I am going shopping” translates from Czech to English as “I am going to buy.”While it is unable to replace a professional translator — breathe a sigh of relief, Radio Free Europe employees — Google’s service does represent the state of the art in automatic translation systems, according to Jan Hajič, a computer scientist and director of Charles University’s Institute of Formal and Applied Linguistics.Hajič’s team ran the Google service through its paces and found that “their results are better than anything else, including our own translation prototypes, though not by a wide margin when compared to our best systems,” he said.Most notable about Google’s service and prototypes being built by researchers like Hajič across the world is that the software underlying it has no ability to understand language in the traditional sense. Typically, computer translations have relied on layer upon layer of complex grammatical rules to break sentences down into tenses, cases and phrasal adjectives.Google instead relies on statistical machine translation, a process that has been steadily revolutionizing the field. Statistical translation has its roots in the code-breaking systems developed during World War II, treating languages as puzzles that can be broken by brute processing strength and lots of data.To build its translation models, Google feeds its computers with billions of words of text from two different sources: monolingual examples of the language, drawn anywhere from novels to news servers; and exact bilingual translations, such as official documents from the European Union or the United Nations.After collecting this mountain of data, Google then applies statistical learning techniques — those confidential in-house algorithms — to build a translation model. With enough refinement and data, the computer begins to understand that “house white” in one language likely refers to the White House in English, for example.The massive reserves of texts Google has indexed are a huge advantage, Hajič said. The company frequently finishes first in evaluations of translation systems.“It is clear that having such a large amount of data available, having it available shortly after publication and in virtually unlimited quantities has given Google’s statistical programs and algorithms a visible advantage,” Hajič said.The fewer people that speak a language, the harder it is to build statistical models, since bilingual texts will be rarer. Because of this, despite their seeming disparity, Chinese-to-English is easier to generate than English-to-Czech.For many languages, “quality translations are not readily available, typically due to copyright problems for books, manuals, etc.,” Hajič said. “Even news feeds are not routinely translated. Only excerpts and amalgamations are typically published.”Market dominationThe most significant aspect of Google’s undertaking, which it says is part of its mission “to organize the world’s information,” is that it solicits feedback from bilingual users, said Andy Way, a computer scientist at Dublin City University and the editor of the journal Machine Translation.“Allowing users to be in control by suggesting improvements to the output translations is an important signal to translators that they haven’t been forgotten by us developers,” he said.“From a research point of view, however, the Google way of just adding more and more data cannot be the way ahead, unless there are to be just one or two [machine translation] providers worldwide,” Way added.In addition to Google, several other tech companies, including Microsoft and IBM, have made significant research investments into statistical translation, though without the same success or profile.These companies have “certainly made machine translation more visible,” said Philipp Köhn, a computer scientist at the University of Edinburgh. However, the commercial side of selling machine translation is a tough business, he added.In fact, in the United States, most funding for such research has been driven by the counterterrorism effort, Way said. In linguistically diverse Europe, the European Union has backed research, including a recent grant worth 17 million euros ($26.7 million/426 million Kč) to Irish scientists working in the field.While researchers have now largely tackled translation that confers the gist of a text, a huge amount of work needs to be done before this moves to a higher level or approaches the dreamed goal of automated speech translation. The future implications are significant, though, Hajič said.“Machine translation, even though the market might not seem huge compared to other products and commodities like cars or oil, might contribute substantially to the economic and perhaps also the political well-being [of the world],” he said.Progress is never guaranteed, however. Take another Babel: the Babel fish, a creature created by the science-fiction author Douglas Adams that, when it slithered into the ear, allowed aliens to communicate with one another.“The poor Babel fish, by effectively removing all barriers to communication between different cultures and races,” Adams wrote, “has caused more and bloodier wars than anything else in the history of creation.”
Other articles in Tech & Telecom (28/05/2008):
Browse the Current Issue
|
Most visited in Business Listings
|
Be the first to add a comment!