Translation Apps Are Getting Better.

This BBC News story by Emma Woollacott starts with some glitches that are old hat and have been covered here and/or at the Log, but goes on to some interesting material:

“Rather than writing handcrafted rules to translate between languages, modern translation systems approach translation as a problem of learning the transformation of text between languages from existing human translations and leveraging recent advances in applied statistics and machine learning,” explains Xuedong Huang, technical fellow, speech and language, at Microsoft Research. […]

But a new project from Mr Lample and a team of other researchers at Facebook and the Sorbonne University in Paris may represent a way round this problem [of “low-resource languages for which the amount of parallel sentences is very small”]. They are using source texts of just a few hundred thousand sentences in each language, but no directly translated sentences at all.

Essentially, the team’s system looks at the patterns in which words are used. For example, the English words “cat” and “furry” tend to appear in a similar relationship as “gato” and “peludo” in Spanish. The system learns these so-called word embeddings, allowing it to infer a “fairly accurate” bilingual dictionary. It then applies the same back-and-forth techniques as we’ve seen with Microsoft Translator to come up with its final translation – and not a biblical reference in sight.

Thanks, Trevor!

Comments

  1. It’s always difficult to try to understand technical concepts as they are presented in the popular press.

    On the face of it, this translation approach appears doomed, because it assumes there is a one-to-one correspondence between words in two different languages. In a simplistic way this is somewhat true. You can say “What’s the French for umbrella” and get the answer “parapluie”. But ask “What’s the French for food?”

    Chinese food = cuisine chinoise
    dog food = nourriture pour chien
    health food = alimentation biologique, alimentation bio

    Or searching on this same topic I turned up “Alimentation du chien: nourriture maison ou nourriture industrielle?” GT gives “Dog food: homemade food or industrial food?”, which is a reasonable translation. But it’s going to be challenging to get there by matching “maison” to “house” or “home”. And that’s a really simple example.

    Taking the example in the article, as a long-time cat owner I almost never use the word “furry” in relation to cats. But there are a lot of idiomatic expressions like “it’s raining cats and dogs”, “A cat can look at a king”, “He’s a real cool cat”. I assume there are in Spanish too, but not the same ones.

    And French and English and Spanish are related languages. A lot of the grammar and sentence structure is pretty similar.

    In the article they talk about how to translate languages for which there isn’t a big parallel database, like Sinhala to Pashto, by using this structural approach. But it would seem that two languages for which there isn’t a big parallel database would tend to be very different structurally also. For example, in Vietnamese everything you say to another person has to indicate the relationship, like they are your big sister, uncle, grandma or whatever. If you were translating a novel from English to Vietnamese, the translator would have to supply all of this, because it doesn’t exist in English. A good translator needs to understand the cultural context of both cultures.

    Maybe they are not talking about good translations. Maybe they are talking about “Excuse me, where can I get a bus to the airport?” That could be useful. But also your theoretical Sinhala- and Pashto-speakers might communicate using their shared rudimentary English.

    I suppose something useful could come of this, but I’m not getting anything much from the article.

  2. Maybe they are not talking about good translations. Maybe they are talking about “Excuse me, where can I get a bus to the airport?”

    That was my assumption. But your points are, of course, well taken.

  3. marie-lucie says:

    maidhc: But ask “What’s the French for food?”

    Yes, the ubiquitous English word “food” does not have a single French equivalent!

    Chinese food = cuisine chinoise

    “I like Chinese food” would be J’aime la cuisine chinoise in the context of what is available in a restaurant, or of Chinese cooking rather than its range of foods.

    dog food = nourriture pour chien

    OK in general, but I would prefer aliment[s] pour chiens, also as a generic term. “A food” would be un aliment. But both la nourriture and un aliment are not quite everyday words, especially aliment which you would see written on the package. I am not sure how I would ask “Where is the dog/cat food?” (at home) since my family did not have pets and neither do my sisters or their children.

    health food = alimentation biologique, alimentation bio

    Again you might see these words on packages or indicating the aisle in a store. L’alimentation refers to the total types, amounts, and other general aspects of food consumption, as in discussing diets for persons with specific conditions, for instance Les diabétiques doivent faire très attention à leur alimentation ‘Diabetics should pay particular attention to their food choices’.

    So how would you recommend the food in a particular restaurant? – On y mange très bien!

  4. David Marjanović says:

    health food = alimentation biologique, alimentation bio

    Not quite. Bio (all over Europe) means “organic”.

  5. marie-lucie says:

    David M: You are right, but I was concentrating on words for “food”.

  6. David Marjanović says:

    I know, I was replying to maidhc’s original comment.

  7. In presenting machine translation for the layman, the article manages to mash together several decades of differing approaches into one confusing sentence.

    It mentions the method of ‘writing handcrafted rules to translate between languages’. That is a very old approach to automatic translation (maybe 40-50 years old, maybe older) based on the use of linguistic rules to generate sentences. It didn’t work — a rather damning commentary on linguistics as a whole. It leads one to suspect that not even people use pure ‘linguistic rules’ (especially Chomskyan rules, if I may cast a gratuitous aspersion) to generate sentences in their heads.

    Then there is the second approach, ‘learning the transformation of text between languages from existing human translations and leveraging recent advances in applied statistics’. This apparently yields acceptable results, especially in clearly defined fields. But it’s not particularly new. I met a translator in Japan in the early 1990s who was using a system of this type to translate legal documents. The computer-generated translation was so good that he simply went through and tweaked the text for obvious mistakes.

    The third approach is ‘machine learning’, or ‘deep neural learning’, about which there has been much brouhaha in the last couple of years.

    As for the problems that maidhc mentions, I recently fed Google Translate a passage from Vietnamese to English which used the word em ‘little sister’, as a pronoun. Em can mean ‘I’ if it is self-referential or ‘you’ if it is second person. In the passage in question, em happened to mean ‘you’. GT translated it as ‘I’ at five places and ‘you’ at 18. That’s not bad, I guess, but in some places it switched from ‘you’ to ‘I’ in adjacent sentences.

    I don’t know if Vietnamese at Google Translate still uses its old method (which I think was based on matching up commonly occurring equivalences found in thousands of texts) or the new ‘deep neural learning’ approach. But the inability to get em right in running text is a bit troubling.

  8. I appear myself to have mashed ‘handcrafted rules to translate between languages’ and ‘the use of linguistic rules to generate sentences’ together.

    At one time there was the idea that sentences could be generated by rules, and that machine translation would involve parsing (breaking down) sentences in one language reconstructing them in the other based on that language’s grammar. I’m not sure whether this would have been equivalent to ‘handcrafted rules to translate between languages’, which might have been a simpler process altogether.

  9. I checked Google Translate and Microsoft Translate, from English to various languages. Both can translate “hairy cat” fine, but are helpless at “hairier cat”.

  10. This is getting into hairy territory, but Google Translate translated “Your cat is hairier than mine” accurately into a selection of languages (but Vietnamese failed).

    It didn’t do quite as well on “Yours is a hairier cat than mine”. This came out in Japanese as “あなたのものは私よりも毛深い猫です。” (Your thing is more a hairy cat than me.) Vietnamese still used the English word ‘hairier’.

  11. Google Translate translations from Spanish mix personal pronouns all the time.

    Because Spanish usually omits them – reader is just supposed to remember whether the character is male or female.

    That’s the largest problem in translation Spanish fiction – otherwise it’s nearly perfect.

    Unfortunately it’s rather a big deal – we can’t refer to female protagonist as male, so editor must carefully fix all pronouns manually.

  12. Yes, GT translates each sentence in isolation. It has no way to remember a pronoun reference from one sentence to another. Similarly, when you copy-and-paste text into its window, you need to be careful to remove any carriage returns inside sentences, because it assumes carriage return ends a sentence.

    Tangentially, I just noticed that Microsoft Translate offers Yucatec Maya, which I’m pretty sure is the first language of the Americas to be machine-translated. (Google has Hawaiian, but that’s only politically American.)

  13. Stu Clayton says:

    Yes, GT translates each sentence in isolation.

    That doesn’t surprise me, although I have never spent an instant of mental energy on the matter. It’s clearly based on the dogma that a sentence “expresses a complete idea”, as Mrs X taught us in school. Nothing wrong with that, so far as it goes – it simply doesn’t go anywhere near far enough.

    Speech/text is not a sequence of mutually independent “complete ideas”, Speech/text is a sequence of initially complete ideas that retreat into incompletion as more words tumble out. Sentences refer backwards to each other’s innards, and forwards to new sentences to be expected. Until the sequence is finished – when the speaker shuts his trap, or more generally is hit by a train – the sentences uttered so far are only parts of what is not yet a completed idea – the overall sense of what has been said (if anything).

  14. Stu Clayton says:

    Here’s an example of a sentence referring forward to new sentences to be expected:

    [gibberish] Let me explain what I mean. [more gibberish]

  15. the dogma that a sentence “expresses a complete idea”, as Mrs X taught us in school. Nothing wrong with that, so far as it goes

    No, it’s bullshit. No one has yet been able to define what a “complete idea” is. For instance:

    That’s not a sentence because it’s not a complete idea.

    That’s not a sentence. That’s because it’s not a complete idea.

    Why is the first a sentence and the second two sentences? Because the writer decided to split the second into two “complete ideas”? Come off it! The only difference is a syntactic one. “Ideas” are irrelevant.

    How about:

    The first example is one sentence because it expresses a complete idea, and the second is two sentences because it expresses two complete ideas.

    The first example is one sentence because it expresses a complete idea. The second is two sentences because it expresses two complete ideas.

    Even Mrs X would have trouble justifying this tiny difference.

    But your larger point is right. It’s impossible to translate sentences in isolation because of anaphora and other issues.

    Incidentally, I find it laughable that Google Translate isn’t supposed to be used to translate individual words (because it translates ‘in context’) but it doesn’t translate sentences in context.

  16. Stu Clayton says:

    No, it’s bullshit.

    I agree. I was trying to be conciliatory, but clearly I have not had enough practice.

    But your larger point is right. It’s impossible to translate sentences in isolation because of anaphora and other issues.

    Ekshually I was thinking only of “pronoun trouble”.

  17. Tangentially, I just noticed that Microsoft Translate offers Yucatec Maya, which I’m pretty sure is the first language of the Americas to be machine-translated.

    Interesting.

    A couple of years ago I met someone who was trying to learn Nahuatl. He was learning from his uncle, who was a L1 Nahuatl speaker and a L2 Spanish speaker, and this person was a L1 English speaker and a L2 Spanish speaker.

    Unfortunately I lost touch with him, because about a year later I heard that UCLA was offering Nahuatl courses. Although he didn’t live in LA, so it may not have been too helpful.

    Nahuatl has over a million native speakers. Shouldn’t languages like this be showing up in the translation engines?

    My uninformed understanding of Native American languages is that they are A WHOLE LOT different structurally from other language families. So it should be a challenge to our algorithm designers.

    In any case, we should be seeing a lot more indigenous languages, not just from North America but from all over the world. There are numerous examples of indigenous languages being under threat, and having some sort of machine translation available could be very advantageous.

    I raised this point with a Google representative at a public presentation a couple of years ago, but I got a kind of namby-pamby bromide response.

  18. David Marjanović says:

    Nahuatl has over a million native speakers.

    Don’t ask me if the many “dialects” are mutually comprehensible, though. They look quite different to me.

  19. I actually tried using Microsoft Translator for translating texts in Yucatec Maya.

    Utterly unusable, I am afraid.

    Can’t use it for translating into English even the simplest sentences.

    I don’t even wish to contemplate what gibberish it makes of English sentences in Yucatec “translation”.

    Even Hmong in GT translation made some sense, this one doesn’t.

    Not even a tiny bit

  20. Stu Clayton says:

    Batchrobe: Incidentally, I find it laughable that Google Translate isn’t supposed to be used to translate individual words (because it translates ‘in context’) but it doesn’t translate sentences in context.

    It’s rather difficult to analyze what “context” means in connection with what people consult in their heads when understanding. I think that in certain languages “pronoun trouble” across sentence boundaries is one thing that hinders good machine translation.

    In those languages, for instance German, pronouns refer sometimes to things mentioned in a previous sentence, and sometimes to the words in a previous sentence used to mention those things. This is the case when the pronouns have secondary sexual characteristics as do the German “er, sie, es, dessen, deren …” When there is no female in sight, “sie” will refer to the closest previous word of “feminine gender” in the same, or a previous sentence.

    These secondary characteristics function as semantic helpmeets, allowing you to rearrange sentence portions for clarity (or aesthetic reasons) while leaving the back-references (pronouns) intact. In other words they provide a certain relief from syntax. If you don’t know how this works – as machine translators apparently often don’t – then you can’t figure out what the hell is going on.

  21. I studied Nahuatl informally under Fritz Schwaller in Bloomington ca. 1972, and I didn’t find it particularly exotic. But then, in my twenties I could digest a polysynthetic SOV language as easily as I could anchovies.

  22. @Stu: It’s clearly based on the dogma that a sentence “expresses a complete idea”

    I think you vastly overestimate the amount of theory behind neural machine translation. GT certainly doesn’t attempt to identify a “complete idea” (ever heard the phrase “Every time I fire a linguist…”?), and it *does* translate sentence fragments. They chop the input into sentences because they have to limit the problem somehow, and a sentence is the easiest way to get a subset of words that are probably related more to words within the same subset than outside it. Obviously, that’s a leaky boundary.

    Pronoun trouble is just a symptom of the fact that the machine is trained entirely by relating texts to other texts, not to the outside world. That’s still what the Facebook research in the BBC story is doing, but they’re trying (if I understand correctly) to make use of a wider pool of texts, not just pairs of translations.

  23. Stu Clayton says:

    @ktschwarz: I think you vastly overestimate the amount of theory behind neural machine translation.

    It was not my intention to claim that the dogma “a sentence expresses a complete idea” is an explicit part of any theory. The notion is an unreflected background obstacle épistémologique, as Bachelard put it, that impedes the assessment and clarification of what you’re doing. “Because they have to limit the problem somehow” is true, but not any old how, not after all this time.

    There’s another pseudodoxa lurking in “the machine is trained entirely by relating texts to other texts, not to the outside world”. What is this “outside” world ? Speech and text are part of the world, just as apples and oranges are. Is it any easier to model an orange than model a pronoun ?

  24. Stu Clayton says:

    In constructing an adequate model of the world (or of part of it), there is an assumption that you are working with an adequate model of modelling. That may or may not be the case, but it’s worth looking into. Of course such self-referentiality reduces many people to tears, but that’s life.

  25. Stu Clayton says:

    Just off the top of my head, here’s a different idea of how to approach the translation of a text. Actually it’s not that much different from what is done now, just an extension of it.

    1. Parse/model/whatever each sentence, to obtain some kind of network (model) relating words to words, explicitly representing in some way the uncertainties/fluffiness

    2. You now have a set of provisional sentence networks. They are ordered by sentence order.

    3. Investigate the possible relationships between contiguous networks, using known features such as pronouns in an attempt to reduce the fluffiness within each sentence network. When it is found that the overall fluffiness increases, reject the text as being Heideggerian.

    4. You now have a network of networks. Take it from there.

  26. Re Yucatec Maya: D’oh! It’s *not* the only American language on the menu — there’s also Querétaro Otomi, another Mexican language. Why did Microsoft roll out these languages before Nahuatl, let alone Quechua or Guarani? Because machine translation isn’t based on number of speakers, it’s based on text, specifically: parallel translated texts in a sufficiently uniform dialect in standard orthography. The bigger languages aren’t standardized enough, and the Maya and Otomi dialects are the ones where (according to Microsoft’s blog) they found “community partners”, a university and a cultural institute, who must have been the ones to identify appropriate parallel texts.

    @Bathrobe, thanks for checking on the quality of Microsoft’s Maya. (Google’s Hawaiian is also embarrassingly bad.) I wonder if the corpus was Maya-Spanish or Maya-English? Microsoft doesn’t say. Do you know of any distinctions that might test this, the way you can use pronoun number/gender/politeness distinctions to prove that Google Translate always goes through English as the hub?

  27. That was SFReader who checked Maya.

    I guess I do overestimate the amount of theory behind neural machine translation.

    What Stu is calling ‘pronouns’ is usually subsumed under anaphora in linguistics. Wikipedia (simplified):

    “In linguistics, anaphora is the use of an expression whose interpretation depends upon another expression in context (its antecedent or postcedent)… For example, in the sentence Sally arrived, but nobody saw her, the pronoun her is an anaphor, referring back to the antecedent Sally. In the sentence Before her arrival, nobody saw Sally, the pronoun her refers forward to the postcedent Sally, an anaphor in the broader […] sense). Usually, an anaphoric expression is a proform or some other kind of deictic (contextually-dependent) expression.

    “Anaphora is an important concept for different reasons and on different levels: first, anaphora indicates how discourse is constructed and maintained; second, anaphora binds different syntactical elements together at the level of the sentence; third, anaphora presents a challenge to natural language processing in computational linguistics, since the identification of the reference can be difficult; and fourth, anaphora tells some things about how language is understood and processed, which is relevant to fields of linguistics interested in cognitive psychology.”

    I’m not sure what it means by “binds different syntactical elements together at the level of the sentence”, since anaphora clearly extends across sentence boundaries. Indeed, one example of anaphor given in the article goes across sentence boundaries:

    “Susan dropped the plate. It shattered loudly. (The pronoun it is an anaphor; it points to the left toward its antecedent the plate.)

  28. Stu Clayton says:

    Bathrobe, thanks for the consciousness-raising word “anaphora”. I wanted to stick with pronouns because everybody knows what they are, but you’re right, I should boldly switch to “anaphors” in the present circle.

    I like “points left”, it reminds me of a train running “to Cincinnati and points west”.

Speak Your Mind

*