MORE ON MACHINE TRANSLATION.

Even though I’m deeply skeptical of the idea that automatic translation will ever be more than barely adequate (which is often good enough, as I insisted here), I continue to be interested in discussions of the topic, and Konstantin Kakaes has one at Slate called “Why Computers Still Can’t Translate Languages Automatically.” I like the fact that he emphasizes the difficulties without pooh-poohing the whole idea; in his conclusion, he writes:

Automatic semantic tagging is obviously hard. You have to deal with things like imprecise quantifier scope. Take the sentence “Every man admires some woman.” Now, this has two meanings. The first is that there exists a single woman who is admired by every man. […] The second is that all men admire at least one woman. But how do you say this in Arabic? Ideally, you aim for a phrase that has the same levels of ambiguity. The point of the semantic approach is that rather than attempt to go straight from English to Arabic (or whatever your target language might be), you attempt to encode the ambiguity itself first. Then, the broader context might help your algorithm choose how to render the phrase in the target language.
A team at the University of Colorado, funded by DARPA, has built an open-source semantic tagger called ClearTK. They mention difficulties like dealing with the sentence: “The coach for Manchester United states that his team will win.” In that example, “United States” doesn’t mean what it usually does. Getting a program to recognize this and similar quirks of language is tricky.
The difficulty of knowing if a translation is good is not just a technical one: It’s fundamental. The only durable way to judge the faith of a translation is to decide if meaning was conveyed. If you have an algorithm that can make that judgment, you’ve solved a very hard problem indeed.

Comments

Bathrobe says

May 13, 2012 at 8:29 pm

Why do articles on language and translation discuss low level difficulties like ‘I love you’, posit the use of ‘semantic tagging’ to solve them, and then throw in examples of complex phenomena like quantifiers and specificness (not sure of the technical term here)? The use of ‘some’, ‘any’, etc. is one of the classic areas of difficulty in English, and I’m not sure that ‘semantic tagging’ is going to solve the problem.
The classic example of this sort of thing is (I’ve raised this one previously) “John wants to marry a Norwegian”, which could be referring to a specific Norwegian (say, Eva) that John is planning to marry, or could be just a desire on John’s part to find a Norwegian wife. I can’t see how ‘semantic tagging’ is going to solve this. I’ve also given the example in the past of ‘All the glitters is not gold’, which would be hard for a computer to parse since it doesn’t follow modern English usage anyway.
There are a lot more low-level difficulties to be solved before wandering into that minefield. For example, Google Translate (the only one I’m really familiar with) often still can’t tell the difference between ‘and’ connecting two nouns and ‘and’ connecting two sentences.
bulbul says

May 13, 2012 at 8:30 pm

ClearTK actually does a bunch of NLP stuff, including POS tagging, stemming and named entity recognition. It’s the latter that would help with the Manchester United example: it would tag Manchester United and not United States as the, well, entity referred to.
Ø says

May 13, 2012 at 8:56 pm

Not all that glitters is gold.
All that glitters is not gold.
All is not gold that glitters.
Bathrobe says

May 13, 2012 at 10:59 pm

Named entity recognition is stuff that Google Translate already does fairly well. This is the easy stuff (and linguistically kind of boring) — give it a big enough corpus and it can recognise entities. Great.
I think my beef with the article is that it seems, like most articles of this nature, to have be written by someone who doesn’t know much about translation, getting trivial problems mixed up with big ones, and mixing together problems that have quite different causes. It’s the usual thing where some journalist oohs and aahs over the fact that a translation program might get “Manchester United states” wrong on the same level as its failure to disambiguate “Every man admires some woman”. Reading this interview is like peering through the fog to try and figure out what Dorr might have actually said. It sounds like he said some interesting things, but this journalist puts it through the usual “throw it all in together, add a few plausible sounding generalities and linking sentences, and mix well” process that results in the abysmal level of linguistic reporting that LH has complained of in the past.
As for the final conclusion: “The difficulty of knowing if a translation is good is not just a technical one: It’s fundamental. The only durable way to judge the faith of a translation is to decide if meaning was conveyed. If you have an algorithm that can make that judgment, you’ve solved a very hard problem indeed.” Well, if you could solve that problem, it’s probably safe to say you’ve solved the problem of translation itself.
John Cowan says

May 14, 2012 at 12:23 am

And then there’s what Bilbo said: All that is gold does not glitter.
bulbul says

May 14, 2012 at 4:18 am

Bahrobe,
Named entity recognition is stuff that Google Translate already does fairly well.
I know, I just pointed it out because, well, the author didn’t seem to know there was such a thing.
I fully agree with your assessment: the author doesn’t know much about NLP or translation, either. Kudos to him for getting SMT and some other details right, but the whole semantic thing is muddled beyond all recognition.
Polish interpreter says

May 14, 2012 at 3:21 pm

Computers never will be able to fully comprehend language. Language is about emotions, feelings not just raw facts, and it’s live, evolves with people.
Trond 延元 says

May 14, 2012 at 5:37 pm

Heh, this almost fits: In another turn of the usual meandering over at Crown’s, I tried to figure out the meaning of the Japanese era name 延元 (Engen). Combining Google Translate from Japanese (“Total Yuan”) and Chinese (“Extension element”) with the Wikipedia article on Japanese eras, I’ve landed on something like “The name of it all” or maybe “The complete era”. Could anyone set me straight?
Ray Girvan says

May 14, 2012 at 7:48 pm

Sorry for coming out with a cliché, but the algorithm will be great when it can tell the difference between “Time flies like an arrow” and “Fruit flies like a banana”.
(For the record, I will not endorse the myth that it was coined by Groucho Marx. It’s a great example of the problems of parsing, but dates from 1960s computer science articles).
Matt says

May 14, 2012 at 8:41 pm

Sure! You can’t actually figure out the “meaning” of most Japanese era names (although it is meaningful that they usually have vaguely positive connotations, and don’t mean things like “smelly contraction”), because they are usually references to Chinese literature.
My big and, alas, seldom-consulted encyclopedia of Japanese era and Emperor names is in storage, but according to the Nihon Kokugo Daijiten (roughly equivalent in market positioning if not quality to the OED) and Wikipedia, 延元 is a reference to this quotation from the Book of Liang:
聖徳所被、上自蒼蒼、下延元元
The proper translation of which I will leave to Bathrobe or some other actual knower of Chinese, but which I roughly understand as being about virtue reaching from the (blue-blue) sky to the roots of (the earth) and/or the people of the land (another meaning for 元元, at least in Japanese-Chinese). So 延元 would probably have been understood to mean “[virtue that] reaches the very root/all the people”.
(Note that this was an era name from the South/North Courts period, so the urge to make claims of universality in an era name was no doubt even stronger than usual.)
languagehat says

May 15, 2012 at 9:57 am

I asked about the meaning of an era name here; fortunately at that time Matt had access to his copy of Yoneda Yusuke’s dictionary of emperors and era names (歴代天皇・年号時点), “which I KNEW would come in handy one day,” and was able to give a comprehensive answer. You’ve got to keep these books ready at hand; you never know when you’re going to need them!
Bathrobe says

May 15, 2012 at 10:03 am

Bathrobe or some other actual knower of Chinese
Actually, I’m totally hopeless at dead languages. I’ve always liked my languages to be alive and spoken, otherwise it becomes very difficult to pick them up.
Matt says

May 15, 2012 at 8:19 pm

Ha, my old comment is almost identical structurally to my new one, right down to shirking responsibility for actual translation. I guess I really am an automaton.
Also, that should have been 事典 /jiten/ “encyclopedia”, not 時点 /jiten/ “point in time”. Embarrassing kana-conversion fail.
Bathrobe says

May 15, 2012 at 10:04 pm

Embarrassing kana-conversion fail
But all too common. The same thing happens in Chinese, where you have to be constantly checking that you’ve got the right characters, and every so often, of course, an error slips through. This happens a lot with native speakers, too.
Trond 延元 says

May 16, 2012 at 7:04 am

Thanks. How would my “name” come out in different forms of Chinese?
Trond 延元 says

May 16, 2012 at 7:08 am

… and in kun reading?
Bathrobe says

May 16, 2012 at 9:31 am

Your newly adopted name is 延元?
Trond 延元 says

May 16, 2012 at 10:04 am

Adopted for now. The (probably boring) history is that AJP some time ago gave me a link to the WP entry for the Southern German town of Engen. Trying to retrieve it I found Engen (延元). Am I completely wrong (again) in assuming that 延元 reads en-gen in the on reading?
Bathrobe says

May 16, 2012 at 10:44 am

Try this:
http://cn.voicedic.com/
For kun-yomi, well, Nobe-moto is the first reading that comes to mind, but honestly, things like 延元 weren’t meant to be read in kun-yomi.
Jan Johan "totally basic (noble motto)" Engen says

May 16, 2012 at 11:09 am

Thanks!