(John Kehe/Staff)
Language Weaver: fast in translation
How one firm quickly translates reams of data.
By Gloria Goodale | Staff Writer for The Christian Science Monitor/ October 1, 2008 edition
Reporter Gloria Goodale explains the history of Language Weavers.
Reporter Gloria Goodale
Los Angeles
If you want to text message your Spanish-speaking neighbor, but don’t know how to say “Please turn down the radio” in that language, you could find a quick translation online at any number of websites. But, if you are, say, a large semiconductor company with customers around the globe, you are in a pickle if all your support data is written only in English.
Enter Language Weaver, a Los Angeles-based firm on the cutting edge of a rapidly growing field known as machine translation (MT). The firm took one chipmaker’s extensive database and translated it overnight into Spanish, the No. 1 tongue in demand by that company’s customers. This task, says the company’s CEO Mark Tapling, would have taken weeks to accomplish not too long ago. Instead, its software made short work of a gargantuan task.
The $100 million MT industry has the potential to grow by more than 50 times that number, some analysts estimate. “Language Weaver is a leader in this field,” says Don DePalma, chief research officer with Common Sense Advisory Inc., who specializes in the somewhat arcane world of computerized translation services.
This may seem like a yawn-producing competition among geeks, one that transpires beyond the purview of most people’s concerns. But in fact, say industry watchers, making swift, high-volume, global communication possible is quickly moving up the to-do list of those who conduct international business deals. For instance, what happens to a nuclear power firm doing business in remote parts of India with no ability to hand over documents in the proper local dialect?
“The ability to translate lots of information quickly is becoming one of the important concerns of a global economy,” says Mark Przybocki, computer technologist and MT team coordinator with the National Institute of Standards and Technology, in Gaithersburg, Md. “Especially when you consider the huge amounts of information accumulating on the Internet…. Effective machine translation is becoming more important every day.”
Just what constitutes “effective” MT is a source of lively debate among a small but growing number of linguists, mathematicians, and computer specialists who dominate the field. Since the 1980s, the MT field has consisted of three approaches: rules-based, in which programmers entered up to 20,000 grammatical rules to direct the translation; example-based, in which discrete examples serve as guides; and statistical, in which “smart” computer algorithms “learn” from previous translations and develop their own guidelines.
The first two approaches were dominant until the turn of the century because the statistical method required so much data from which to “learn,” as well as massive amounts of processing power to search and cull its protocols, and enough memory to retain the information. But the statistical approach became more viable as computing power began to accelerate and memory capacity grew more affordable.
Language Weaver grew out of what Kevin Knight, one of the company’s cofounders, calls a “watershed workshop” in 1999. His team discovered that the translation protocols developed for one language could move seamlessly to another without having to start over from scratch with each new tongue. The group’s work enabled it to nab all-important research funds, and within two years, the commercial venture began. Today, Mr. Knight sits in front of his computer looking at a translation program for Chinese that is capable of processing some 100 million directives.
But this would not be cutting-edge technology, however, without some disputes. Chief technology officer and cofounder Daniel Marcu has T-shirts to prove it. One reads, “I lost the syntax bet,” another says, “I won”; he alternates them depending on how the arguments go. This refers to a wager between his team and a former colleague who now runs the free translation service at Google. Mr. Marcu has maintained that the system will still need grammatical rules no matter how much a statistical system is able to learn from previous translations, while the other side believes that statistics alone will provide all the necessary guidance.
Friendly wagers aside, Marcu says that in the end, it won’t matter. “There is so much information on the Internet … that these systems will absorb grammatical rules without pausing to articulate them.”
The biggest challenge MT may face is human expectation. “People think machines should be able to act like the computer on the bridge of the Star Trek’s Enterprise, or C3PO. That would be nice,” says Mr. DePalma, “but while everyone would like that fabled Babel fish in the ear [the universal translator from the sci-fi classic, “The Hitchhiker’s Guide to the Galaxy”], we are still a ways off from that.”
( More stories )
Comments
2. Janine | 10.02.08
MT can handle large volumes and very quickly however a human translator still needs to edit text to keep the translation true to the original text’s intent.
3. Clint | 10.02.08
I think the most important thing to remember is that MT is a great way to translate large volumes of text (especially scientific or technological subject matter) and there are many companies that are doing some great research and making great progress.
There have been multiple solutions to try and solve the MT issue, but I agree with the quote, “There is so much information on the Internet … that these systems will absorb grammatical rules without pausing to articulate them.” There is so much information out there that computers will be able to account for those rules without having to have them specifically encoded.
Also, translators need to understand that MT is not going to take away their freelance translation jobs. Every time I talk about MT with a fellow translator, they get defensive and say that human translators are so much better and MT “is a joke.” Well, when you have a huge database of information and want it translated quickly, hiring a translator is not the right approach and MT has been, and will continue to be a great approach for many companies looking for a solution to huge translation projects.
4. Daniel | 10.04.08
Thanks to Kirti Vashee for some information about machine translation (MT). However, working as a translator and interpreter myself, MT is more like mechanical translation and funny errors are so obvious. Some errors may cause headache to fix; but I agree that MT is necessary for large project. It will be ridiculous if we don’t use MT; but a real, human translator must be used to check. We may not be able to see the futuristic sci-fi MT yet.
MT has two particular advantages over human. The first is that, we can load dictionaries into memory, which is better and faster than human brain to memorize. A second advantage is about the number of languages to be used. A human being may know many languages, but it also takes many years of practices in each language to reach a level of near native to do the translation work. A machine can switch to another language just by a click or so.
I do not use MT frequently. I only use MT when there is some new words, or in some cases, trying to get any hint which may be helpful. So I know only very little about MT. I am still curious about the styles of the text to be translated. I don’t know if MT can recognize the styles; because that is something human brain seems quicker than MT now. A simple example of style is humor, or text for sciences, or prose. I believe that MT still has a long way to go.
5. Kirti Vashee | 10.06.08
I agree with Daniel, MT on it’s own is not going to get us all the way to where most of us want to go.
The most recent NIST MT competition results show that MSR, Google, IBM and BBN produced the best (as in highest quality measured by BLEU) generic Arabic and Chinese systems. But they were all pretty close, and none produced really huge improvements over last year’s results. Technology initiatives like syntax, hybrid approaches make small differences but something else is needed to really accelerate the rate of improvements. The quality that these generic systems produce today is not likely to get many major enterprises to step up and pay big money for the right of use, even though they are good enough to get millions of Internet users, who will use it as long as it is free.
We have also learnt that focusing on a domain (especially a technical domain) makes better systems, and raises the accuracy of raw MT to a level that is much higher than what we see in these NIST competition focused systems. Microsoft has shown that their raw MT translations of knowledge base content is much higher in quality than the generic systems used at NIST and Goggle. This KB content is heavily used by millions of users in their global customer base. Microsoft has disclosed that the satisfaction levels of customers who use raw domain focused MT output can sometimes actually exceed the satisfaction levels of people using the same material in the English source. To my mind this is the most successful use of MT in the world today.
At Asia Online we are seeing that technical domain focused SMT systems we have built with clean data, can produce some pretty compelling raw MT output. We expect that this will be a growth area for MT technology providers in the short term as technology focused enterprises make more and more content available in multilingual formats using MT as an accelerator.
However, it is also clear that none of the MT technology out there today can really replace human beings. Language is too complex, and too filled with variations and exceptions to be completely reduced to algorithmic resolution. I think it is becoming clear that it is important to engage human beings to come and help raise the quality of the raw MT to a level that it becomes more usable, useful and compelling. With SMT, this continuous human feedback can help to drive the quality and capability of these systems to a level that we can start to approach human draft quality.
MT coupled with large scale human feedback can enable systems to improve at a rate that we have not seen yet. Since MT can produce large amounts of content filled with linguistic errors, it is possible to clean this up if a crowd of capable/competent humans can be motivated to help. The popular term for this is phenomena is “crowdsourcing”. We have seen this at work on a small scale, at Facebook already, and at Asia Online we are embarking on a 3 million page translation of the English Wikipedia into Thai initially, then into several other Asian languages. This approach will be used to translate tens of millions of pages and gradually raise this content to human quality levels with assistance from the broad student community that would find the content most useful.
MT together with web based massive online collaboration is emerging as a model that can take on huge translation tasks and we see now several initiatives around the world beginning to explore this model. What is special about these efforts is that we are seeing is actually a social phenomenon coming to a focus around a collaborative technological platform involving machine translation.
Alain Desilets of the NRC of Canada recently said, “”Two technologies which will drastically change the way we translate content: massive online collaboration a la Wikipedia, and Machine Translation. Shared language data repositories are central to both the collaborative and MT innovations. A year ago, I would have said that MT was still too imperfect to impact the translation industry in any significant way. But recently, progress has been incredibly rapid, even more rapid than its most optimistic proponents ever dreamt of.”
http://www.wiki-translation.com/tiki-index.php?page=Processes+and+tools+for+massively+collaborative+translation
Brian McConnell of the Worldwide Lexicon Blog makes a prediction in an interesting article on this site:
“The language barrier, as we know it, will be gone by 2010. Computer scientists have been chasing a Holy Grail of machine intelligence for decades, but the breakthrough that will eliminate the language barrier is social, not technical. Language, like music or art, demands people to comprehend it.”
He goes on to say,
“The language barrier will be broken down in a series of simple steps. The first phase of this transition will be driven by publishers with large or highly motivated audiences. These early adopters will recognize the value of making their content visible in many languages, and their readers will be happy to contribute. Each website will develop its own translation community from its audience. At this stage of the transition, the system will be driven by a few publishers, and probably a few thousand dedicated translators.
As these projects grow, and as multilingual publishing tools become more sophisticated, aggregators will emerge. These sites will create large translation communities that decide what to translate based on their interests, whether or not a particular publisher is aware of this activity. Roaming mobs of amateur translators will translate whatever they think is interesting. Commercial services that complement volunteer based systems will also appear.”
The full article can be seen at http://blog.dermundo.com/original/2356.html
This too, is perhaps also a little too optimistic, but this is a new trend where we see momentum building and one that seems much more likely to change the world of translation and is more likely to the way that MT finally breaks on through to the other side.
6. Leila Dagher | 10.06.08
Dear Sirs,
I am looking for a job in translation from English to Arabic and vice versa. I have been working on this job for more than twenty years and I wish to find a career with you. I have a B.A. degree from the American university of Beirut and worked as Administrative Manager in a lebanese firm and a big part of my work was in traslation. Since 1999 I have been dedicating my time to this kind of work( I also give special lessons to the 5th and 6th grades in all subjects). I and another poet and writer are working now on a book of Buddha philosophy and will be through in 2/3 weeks and will be free then.
I hope to have the honor of working also with you. Thanks.
Trackbacks/Pingbacks
Leave a Comment
We do not publish all comments, and we do not publish comments immediately. The comments feature is a forum to discuss the ideas in our stories. Constructive debate - even pointed disagreement - is welcome, but personal attacks on other commenters are not, and will not be published.
Tip: Do not write a novel. Keep it short. We will not publish lengthy comments. Come up with your own statements. This is not a place to cut and paste an email you received. If we recognize it as such, we won't post it.
Please do not post any comments that are commercial in nature or that violate copyrights.
Finally, we will not publish any comments that we regard as obscene, defamatory, or intended to incite violence.





1. Kirti Vashee | 10.02.08
This is interesting, and I am sure your readers would also be interested to know that there are several other companies also involved with Statistical Machine Translation and doing exciting things with this technology.
Possibly the most successful use of this technology in the world today is the Microsoft Knowledge Base, that uses a largely SMT based engine and successfully services 100’s of millions of user requests for technical information. This content is only available because of MT. Interestingly, many of the users of these raw computer translations are saying they are more satisfied with this MT knowledge content than users who use the same content in its source language: English. So yes the state of the art has come a long way from the eMpTy promises of old.
Asia Online is another company that is undertaking perhaps the biggest translation project in the world today using SMT. They are translating millions of pages of knowledge content with “human assisted MT” into several South Asian languages to reduce the “information poverty” in countries like Thailand, Indonesia and Malaysia. You can already see the initial results of Wikipedia and other content at http://www.asionline.net today. Millions of pages of MT content will be relased over the coming months. The MT technology platform they have built allows the automated translation corpus to improve on a daily basis with constant feedback from expert linguists and the user community (mostly students) at large in Thailand.
This same MT technology platform is also available to Fortune 1000 companies to enable them to share more knowledge and business information content with their global customers in local languages. So companies like Sun and HP could make huge amounts of knowledge base and user generated content available to their customers. This platform would allow the users to improve the quality of the raw machine translation each time they looked at the content and continue to improve the quality in real time. This is described in more detail in the links below:
http://news.tourthailand.org/business-news/online-proofreaders-sought.html
http://www.bangkokpost.com/100908_Database/10Sep2008_data62.php
Another company also using SMT to make masses of new content available is Alfabetic (a recent TechCrunch 50 company). They translate high value blog content again using SMT together with paid human translators to ensure the quality is close to perfect. Theyare currently translating Tech Crunch content into several langauges around an advertising based business model.
Given that the SMT technology is now available in open source through the MOSES project, we can expectto see many more start-ups that combine social networking concepts with MT to enable all kinds o information to become multilingual.
dotSub is another interesting startup that allows the community to easily add subtitles to videos, using MT and other translation automation tools.
Given the rate at which information is growing on the web, this technology will certainly attract new talent that will put together compelling solutions for the many new market opportunities that will emerge.
Many observers estimate that the bulk of the second billion internet users will come from Asia which currently only has 12% internet penetration. The importance of Asian languages will grow, and companies that can learn to communicate with the rapidly growing middle class in China, India and SE Asia will probably stand to gain and prosper.
Anyway, it seems the best is yet to come. After more than 50 years of underwhelming success, MT may finally be ready to deliver.