James Fallows has a very interesting piece in the Atlantic about Google’s announcement that they’re phasing out their Translate API:
Here is the part of the explanation that, for me, had the marvelous quality of being obvious — once it’s pointed out — and interesting too:
The intriguing problem is the way that over-use of automatic translation can make it harder for automatic translation ever to improve, and may even be making it worse. As people in the business understand, computerized translation relies heavily on sheer statistical correlation. You take a huge chunk of text in one language; you compare it with a counterpart text in a different language; and you see which words and phrases match up. … Crucially, this process depends on “big data” for its improvement. The more Rosetta stone-like side-by-side passages the system can compare, the more refined and reliable the correlations will become.
But the data is being corrupted by the rapidly increasing volume of machine-translated material:
The more of this auto-translated material floods onto the world’s websites, the smaller the proportion of good translations the computers can learn from. In engineering terms, the signal-to-noise ratio is getting worse. It’s getting worse faster in part because of the popularity of Google’s Translate API, which allows spam-bloggers and SEO operations to slap up the auto-translated material in large quantities. … [This story] reveals a problem I hadn’t thought of — and illustrates one more under-anticipated turn in the evolution of the info age. The very tools that were supposed to melt away language barriers may, because of the realities of human nature (ie, blog spam) and the intricacies of language, actually be re-erecting some of those barriers. For the foreseeable future, it’s still worth learning other languages.
For a detailed analysis of the situation, go here. I should add that this does not affect Google Translate, and a good thing too, because I use it constantly.
Thanks for the comment at the end – Google Translate is great, both for gisting something and for native speaker pronunciation. Also, for amusing errors like this one, snapped from Google Translate’s own sample page:
http://maxqnzs.com/Englisch.jpg
Ah, positive feedback loops. Fun fun fun.
Since I first read this article I’ve been wondering if it isn’t possible for Google to “watermark” it’s translations in some way so that it detects them as it’s own translations? Not sure how this would work, but I can imagine a sequence of spaces and particular use of punctuation might make it possible to mark longer texts?
This brings up a related problem: even human translations can be poor if not atrocious, *often in very different ways* (the translator may have imperfect knowledge of the source or the target language, may translate too literally, may misunderstand various aspects of the original…): I’m no computer specialist, but I simply don’t see how the quantity of data can compensate for the heterogeneity of said data.
positive feedback loops
I love it. Self-referentiality 1:0 Google engineers.
Since I first read this article I’ve been wondering if it isn’t possible for Google to “watermark” it’s translations in some way so that it detects them as it’s own translations?
And thus systematically ignore its own role in what it is trying to do ? That would be an epicylic tweak that doesn’t address the fundamental misconceptions of the undertaking. A bandaid on a positivist pustule.
I wish everyone who hasn’t already done so would run down to the drugstore and buy a copy of Morin’s La nature de la nature. But it ain’t gonna happen.
Etienne: even human translations can be poor if not atrocious, *often in very different ways* … I simply don’t see how the quantity of data can compensate for the heterogeneity of said data.
It can’t – unless the percentage of misleading/wrong translations is small. What justifies assuming that this is the case ? Nothing – it was just assumed, apparently. And of course it was inexcusably dumb to overlook the fact that increasing amounts of imperfect Google output imply increasing amounts of imperfect Google input to itself, explicitly invalidating the assumption that large amounts of translation are statistically reliable.
To put it another way, you can’t learn from yourself if you’re dumb. You have to be smart already in order to get smarter. Smart is a non-starter.
This preponderance of machine translations has reduced the usefulness of Googling as a tool in translating English into Irish. Previously if you wanted to find the official translation of some legal phrase which wasn’t in the usual sources, you could google the English phrase plus some Irish words and generally find some official circular or similar which would give you the answer. Now this approach just brings up the pages of gobbledegook.
Is the “native speaker pronounciation” in Language Tools reliable though? For the time being I’d rather stay with rhinospike (or similar)…
My first translating job many years ago was translating mathematical abstracts, mostly from Russian. It often seemed to me that this would have been a perfect candidate for machine translation, which barely existed then. The terminology was standardized and the style was absolute boilerplate. The main pitfall for a human translator who was not also a mathematician (most of us weren’t) was that the same term might translate differently depending on the field of math. In other words, context. But that information would have been easy to feed a program. I don’t know how much machine translation is used for such things now; I would guess a lot.
More recently I did some commercial translation for the first time in years. I have to say that Google made the whole thing much easier. It had nothing to do with automated translation; it was strictly a matter of how ordinary Google searches, with the mass of material now available on the web, made it possible to zero in on terminology that you would never find in any dictionary. These were everyday commercial and specialized terms that would be completely familiar to people in the respective countries, but that evolve much too fast for traditional print dictionaries, and are only sporadically found in online ones. Often you can find a good definition of the term in some online article in the source language; otherwise you have to do a bit of detective work, and of course you have to avoid not only automated Google translations but things like parallel product pages written in English by non-native speakers, but on balance I’d say it’s not only a tremendous time saver, but makes a marked improvement in the quality of the translation, at least when it’s being done by someone who is not absolutely, perfectly bilingual
I suspect that, in addition to the general ‘feedback amplifies noise’ problem, there’s a specific problem: the dominance of English. Specifically, the ‘data’ for languages-other-than-English is overwhelmed by bad translation from English into ‘local’ languages– this polluted local data then degrades translation both to and from English.
As an Xoogler, I can definitely say that Google Translate (and the Translation API, etc.) are founded on high-quality bilingual corpora that were translated in the first place by humans and are chosen by humans. For example, the original French-English corpus was mostly from Canadian statutes and other government documents. There is no question of Google’s output corrupting its input directly.
I believe the explanation is much simpler: that Google means what it says, and that the cost of running the API is simply too high. Google is chronically short of machine resources, especially network bandwidth and CPU time. This may seem hard to believe when they have jillions of machines in oodles of data centers, but it’s true; they have a massive system for tracking and budgeting production resources. An API that’s wildly overused, brings in no money, and mostly seems to benefit spammers is a natural candidate for the chopping block. As someone who has spent time inside (in both senses; I did not enjoy my stint at Google), it’s plain to me that there is less here than meets the eye.
That said, Google’s default posture of secrecy in my opinion simply encourages this sort of far-fetched Kremlinology, and I wish they would cut it out. Doubtless it has often benefited Google to be the target of itchy curiosity, but doubtless it has often hurt them too (h/t Izaak Walton).
I use Google translate all the time, but it really annoys me why it doesn’t work well with basic Spanish issues like concordance between adjective and noun, subject and verb or the difference of SER/ESTAR. All things that for instance, Word can correct. Does they work so different?
John Cowan: There is no question of Google’s output corrupting its input directly. … An API that’s wildly overused, brings in no money, and mostly seems to benefit spammers is a natural candidate for the chopping block.
Nobody claimed that the output corrupted the input “directly”. Google’s business model itself – how it generates revenue – is indirect. The most familiar Google service, the brower-accessible search facilities, cost nothing, no more than the Translate API does. Google resembles a traditional newspaper publisher more than a hotdog stand. At the stand, only the mustard is free. The newspaper publisher, in contrast, gives you editorial hotdogs for free, in the hope that you will buy their advertisers’ mustard.
Google’s stated mission is “to organize the world’s information and make it universally accessible and useful”. No Kremlinology is necessary to figure out that there is more than meets the eye here, because you can’t make a living from organizing for zilch. Another quote from the eMpTy Pages site on translation technology and related topics linked by Hat and Fallows, the author of the Atlantic article:
“Wildly overused” and “brings in no money”, as you characterized the Translate API, means that advertisers, doing a cost-benefit analysis, are not convinced that the quality of sites formulated in googledegook will attract enough customers to justify the advertising outlay. It is clear to me, at any rate, that the issue is marketing quality control. Good copy, translated or not, makes sense and makes you want more. Quality in the sense of earning the Grand Prix du roman de l’Académie Française is not what this is about.
Leonard, thank you very much for mentioning rhinospike! I’d not heard of it before, and it sounds like a very useful service. As for the authenticity of Google Translate’s native speaker function, it seems OK in Hindi. I only use it for isolated words, especially if I come across something excessively polysyllabic and filled with conjuncts (damn Sanskrit!) so it’s convenient to get at least an idea of how it sounds straight away, to either confirm my own efforts were near the mark or not.
I believe the explanation is much simpler: that Google means what it says, and that the cost of running the API is simply too high.
I’d be inclined to agree, except … Well, hypothetically, there is a different company, hypothetically running a similar API and someone from that company hypothetically told somebody (definitely not bulbul) what the hypothetical costs involved in both developing the MT engine and running the API were. They amount to peanuts for both this hypothetical company and, one would assume, Google, so it’s not a question of money, either in terms of budget or in terms of revenue lost.
Google is chronically short of machine resources, especially network bandwidth and CPU time.
Or, as a Googler friend of mine put it: you know what we call 200 TB of free space over here? Critical low disk space.
So I think John’s right, it’s all about resources.
I can definitely say that Google Translate (and the Translation API, etc.) are founded on high-quality bilingual corpora that were translated in the first place by humans and are chosen by humans.
That’s the theory/plan. But as I’m sure you know, theory is not always that easy to put into practice. I’ve observed that for a number of languages, the Google MT output is garbled in a way which indicates that the corpora used were far from high quality.
Etienne,
I’m no computer specialist, but I simply don’t see how the quantity of data can compensate for the heterogeneity of said data.
Who says it does?
Stu,
Nothing – it was just assumed, apparently.
No. As one of the first steps in the development of a particular MT engine/language pair, the input is checked by humans who are translators/linguists themselves. I know, I’ve done a lot of these checks myself.
Plus you don’t just get bilingual corpora from garbage bins and gutters. You get them from professional translators and agencies who have their own quality control process. I’m not saying it’s all perfect, but mostly it works.
John Cowan: please do put up a warning before you make certain statements, if I had been drinking coffee when I read your remark (“high-quality bilingual corpora…for example, the original French-English corpus was mostly from Canadian statutes and other government documents”) I would have destroyed the screen by spitting out hot coffee at it: as it was I laughed so hard that oxygen intake was an issue.
That is to say, as a francophone in Canada I can assure you and others that Canadian (federal) translations from English to French range in quality from barely acceptable to “somebody got paid to produce this?”: even when the French is grammatically correct (it happens, on occasion)the phraseology, semantics and the more subtle aspects of syntax are so bizarre that it strikes any francophone as being, at best, “off”.
I am but one person, of course, but as I recall, Marie-Lucie had once made similar comments on this blog on the French of translated Federal documents.
definitely not: Plus you don’t just get bilingual corpora from garbage bins and gutters. You get them from professional translators and agencies who have their own quality control process. I’m not saying it’s all perfect, but mostly it works.
Mostly ? As I’ve said before, MT does seem to serve people well who are satisfied with a Me-Dick-You-Jane level of intelligibility, and don’t read much more than the backs of Cheerio packages when they’re not watching TV. Here is a simple everyday Grumbly sentence in Cologne expressing dissatisfaction with MT translations of technical documentation into English:
Here is what Google Translate makes of that:
I rest my case while I go to the bag.
Now that I’m back, I find that when I change the word order of the German to make it more English-like, the translation into English is less distorted:
becomes
It might be easier to translate if I used English directly, and dispensed with German altogether, what say you ?
To my cantankerously homonymous fellow commenter – your experiment with rewording the German (which was so easy to read that even I had no trouble) to be more Englisch seemed sadly apt. I say sadly because it seemed to provide another example of what Hat wrote about in “English bones under the skin”.
I should explain that Es geht mir einfach auf den Sack means “it just chaps my ass”. I have allowed idiomatic use to take precedence here over anatomical precision.
it seemed to provide another example of what Hat wrote about in “English bones under the skin”.
Sharp cookie, Homonym ! I too was thinking of that.
Gosh, it’s the Proud Pieriansipist Himself, I now see !
As I’ve said before, MT does seem to serve people well who are satisfied with a Me-Dick-You-Jane level of intelligibility
And what does that have to do with what I was talking about, which was the quality of the bilingual corpora used in MT training? The corpus is but a first step.
The quality of MT output depends largely on the subject matter and language pair. German is notoriously difficult to deal with, what with the word order and compounds.
Like I keep telling everyone at the day job, MT ain’t no fucking magic solution. It’s a tool to be used wisely depending on the circumstances.
“the ‘data’ for languages-other-than-English is overwhelmed by bad translation from English into ‘local’ languages”
Perhaps that depends on how much like English the bad translations are. Grumbly Stu’s experiment with the German suggests that machine translation likes anglicised ‘other’ languages. Most Hindi films are first written in English, and there has been quite a bit written about badly they are then then translated into Hindi, with constructions and phrasings that aren’t natural at all. That very awkwardness and Englishness of the translations might mean that pasting them into a machine translator would give a better English result, as it did with Stu’s Turkish taxi driver.
German is notoriously difficult to deal with, what with the word order and compounds.
That’s odd, there are over 80 million people who find it a breeze. There are at least that many Turkish speakers.
MT has not noticeably improved since the early ’70s when I first encountered it, despite all the hype. The same is true of morals. Hope springs eternal, while the multitudes limp along.
MT has turned out to be a fundamentally misguided undertaking, like psychoanalysis. It probably has had useful side effects, though – like the astronaut insulation foil that came out of the race to space. And it still provides lots of people with jobs, so it’s not entirely a Bad Thing.
My transposition of phrases in the German sentence didn’t actually make it that much more “English-like”. The English skeleton “discussion about … with …” is just as natural as “discussion with … about …”. Similarly, the German “Diskussion mit … über …” is just as natural as “Diskussion über … mit …“.
In view of that, it is strange that the one German phrase order led to a more-or-less intelligible (and correct !) English rendition, while the other order led to a plain mistake (it was not “recent immigrants” that were being discussed, but navigation routes).
These kinds of examples have been discussed for more than 40 years now. MT is not getting anywhere, but it is as firmly ensconced as the Catholic church.
That’s odd, there are over 80 million people who find it a breeze. There are at least that many Turkish speakers.
Now you’re just being difficult. a) The context we’re employing makes it crystal clear what I meant. b) What does the number of speakers have to do with MT?
MT has not noticeably improved since the early ’70s when I first encountered it, despite all the hype.
Bullshit. Back it up some hard data, then we’ll talk.
MT has turned out to be a fundamentally misguided undertaking
I have no idea what you mean by that. “Flawed” does not equal “fundamentally misguided.” I have been able to get tons of information to which otherwise I would have had no access, or which would have been so tedious to acquire I wouldn’t have bothered, by pasting chunks of foreign text into Google Translate. If all you’re interested in is translations that match those of highly paid human professionals, then yeah, you’re out of luck, but that seems to be a very silly way to look at it.
Stu: Nobody claimed that the output corrupted the input “directly”
As far as I can tell, that’s exactly what Fallows is suggesting in the original article. He’s perhaps not as clear as he could be, but “The more of this auto-translated material floods onto the world’s websites, the smaller the proportion of good translations the computers can learn from” sounds to me as if he thinks Google Translate gets its corpora from stuff it’s pulled off webpages, a certain (rising) proportion of which will be its own output.
Grumbly: You’re quite right about Google’s business model, though what you don’t know is that Google’s decisions about its free services are rarely made with references to that business model. I like to compare most of Google to a non-profit that has a money pit in the basement. When they need more money, instead of writing a grant to some government or foundation, they just go down to the basement, dip a bucket into the money pit, and fish out enough money to go on with. What is more, something may be “clear to you” and nevertheless actually, y’know, false.
Etienne: I am of course absolutely unfit to judge the quality of government French in Canada or any other kind of French, and I know you are right with respect to a great many of the various pieces of paper the federal government emits for public consumption. But I have been told, at least, that the translations of statutes and secondary legislation are of high quality. They are publicly available here, so you can judge for yourself if you feel like it — I’d like to know what you think.
Stuart: Long time no comment! Welcome back.
Tim May: Exactly so.
I don’t understand how Google Translate works with minority languages – i.e. those with few/no corpora and little reliable online text. Maybe it just doesn’t work? GT in Yiddish is awful. Is GT in other minority languages better?
bulbul: What does the number of speakers have to do with MT?
You wrote: “German is notoriously difficult to deal with, what with the word order and compounds.” Any human activity is difficult that is successfully learned and practiced by relatively few people, although many more have tried to acquire the competence. This applies to physical and cognitive activities – deep-sea diving, and high-energy physics. Speaking and understanding German is not such an activity, because 80 million people do it. It is not difficult.
Why might German appear to be difficult from the perspective of an MT worker ? It can’t be because it resembles high-energy physics, or because it cannot easily be translated into the language of deep-sea diving. It may be because the MT worker has fundamental misconceptions about what is involved in language practice (not just German) and communication in general. These misconceptions include the belief that computers, with their ability to eat and regurgitate large amounts of “data”, will one day be able to do what bilingual humans do. It could not be known in advance that this was a misguided belief, but by now that’s a reasonable assessment of the situation.
Primitive machines such as computers and see-saws can do things that humans can’t. But humans (and animals in general) are themselves machines, of an entirely different, much more interesting type. They can do things that computers have failed at, for instance creating intelligible translations of texts not written in Basic German. For several hundred years – until about 60 years ago – human machines were thought of as simply interlocked compositions of smaller, primitive machines. Smaller, primitive machines are indeed involved, but not simply in the mode of composition with feedback loops.
The brain, for instance, bears no resemblance to any computer, nor even a computational cloud – because the brain has no CPUs. Primitive machines cannot repair or reproduce themselves. They have very special, limited uses.
The smart money nowadays – the money I would have, if I had any – is on things like biomimicry and bioengineering. Not trying to build Frankensteins or Transformers to do what we do already, but to learn from, collaborate with and reuse what is being done out there, and that we cannot do by ourselves.
Every time I have applied state-of-the-commercially-available-art MT to texts, I have gotten garbage. Yet Hat claims to have gotten tons of useful results, and bulbul asks where my hard data is. It seems that my snarky comparison of MT with the Catholic church was closer to the truth than I thought. These two gentlemen are Bearing Witness to their beliefs, whereas I myself get along quite well without those beliefs, and am not interested in proof-of-concept squabbles about MT.
Of course I too am Bearing Witness, but unlike the two gentlemen, I say: “Don’t look into your progress-hungry heart, but consult your experience instead, and decide whether MT has lived up to its promises”.
I just used GT to transform the second paragraph in my last comment into German. Can anyone understand it ? Imagine having to read this kind of thing all the time (e.g. MT-processed Microsoft security advisories). O yea of too much faith !:
David M. should be here in this hour of tribulation.
These two gentlemen are Bearing Witness to their beliefs, whereas I myself get along quite well without those beliefs
So let me get this straight. You’re claiming, on the basis of some metaphysical source unknown to me, that I do not in fact get useful results from Google Translate? Am I imagining it, dreaming it up perhaps? Those Wikipedia pages I have improved using sources in languages I don’t read are a fantasy? It seems to me you’re the one with an idee fixe you refuse to reconsider. You have a belief that MT is NG, and you won’t hear anything different.
You’re claiming, on the basis of some metaphysical source unknown to me, that I do not in fact get useful results from Google Translate?
No. I do believe that you find the results useful – after all, you say you do. It’s just odd that I myself have never been able to get satisfactory results – and I mean even barely intelligible results, without any literary quality.
I gave fresh examples above of the kind of output I get. No one else has given counterexamples. Everyone has acted as if they exist but didn’t need to be brought in evidence. John Cowan spoke of Canadian government documents by hearsay, but was taken up sharply by Etienne.
Opinions about GT appear to be an entirely subjective matter, based on faith or the lack of it. From my point of view, you might as well be issuing testimonials as to the positive power of prayer. I’m sure it works for you, but it doesn’t for me. I wonder if I’m going to end up in hell.
Those Wikipedia pages I have improved using sources in languages I don’t read are a fantasy?
How about a conrete, specific example or two ? I’ve given mine.
I don’t want to insist on any metaphysical position as regards MT or GT. All I’m going on is that these technologies have never met my minimal expectations, not even approximately.
It’s just odd that I myself have never been able to get satisfactory results – and I mean even barely intelligible results, without any literary quality.
Perhaps you are looking for something that MT doesn’t and may never provide, whereas other people are looking for something that MT does provide. There’s nothing odd about that. It’s fine if you want to report your personal experiences, but your experiences are not arguments, and people care about your dashed expectations only up to a point.
Perhaps you are looking for something that MT doesn’t and may never provide, whereas other people are looking for something that MT does provide.
That seems to be it. By the way, I believe that the same thing can be true of prayer.
My expectations are unimportant. I just wonder what it means that so many people apparently believe in the usefulness of MT, and yet are unwilling or unable to give examples.
My suspicion is that what MT provides people with is of the Me-Dick-You-jane variety, which can only be understood with a lot of contextual clues not in the text results – exactly the kind of thing that I get when I use GT. That is fine for some purposes. But why pretend that MT provides more than that ?
Also, I suspect that most people only play around with MT occasionally, whereas in the IT industry I am overwhelmed by MT output. The unintelligibility of this stuff is a professional problem.
Has a consensus formed that MT is comparable with prayer ?
Why might German appear to be difficult from the perspective of an MT worker ?
The ridiculous term ‘MT worker’ aside, I gave you two reasons: 1. The weird things German does to its word-order. 2. Compounds.
Or in other words: for the same reasons it appears difficult to your average second language learner.
Speaking and understanding German is not such an activity, because 80 million people do it. It is not difficult.
Then how come you don’t speak Russian? Or Chinese? If so many people can do it, it can’t be that difficult…
MT has not noticeably improved since the early ’70s
Bullshit on a stick, if you pardon my Elamite. There’s tons of research into how the shift from rule-based to statistical MT has improved the quality of output.
These two gentlemen are Bearing Witness
I certainly don’t. Having been in charge of all things MT at the day job for almost a year now and an ‘MT worker’ for almost five years, I have more than a passing familiarity with the subject – both theoretically and practically – and I’m aware of what the technology can* and what it cannot do. You, on the other hand, are ready to write off an entire field of human knowledge you have no idea of based on your limited experience.
* My favorite example: research into how MT can improve translators’ productivity.
e.g. MT-processed Microsoft security advisories
But those are published in English. Why would *you* read a translation, MTd or not?
yet are unwilling or unable to give examples.
I can give you plenty, like the 370 pages of contract amendments and technical specs I had MTd (ES>EN) for another department last Thursday.
Or the notes by Romanian reviewer a colleague had GTd this morning. Or the *** project last week where I used GT and Microsoft Translator to make sense of a Chinese text where the reviewers removed all the line breaks and formatting and I had to make sure the text is entered into an InDesign document properly.
Shall I go on?
Then how come you don’t speak Russian? Or Chinese? If so many people can do it, it can’t be that difficult…
You’re right, it shouldn’t be that difficult, neither more nor less than German is. I don’t know why you single out German as particularly weird.
The problems adults have with learning a second or third language are to a large extent due to age-related neurological barriers, apparently – children don’t have the same problems. The problems are also due partly to lack of interest, partly to inadequate teaching methods etc. MT has none of these problems, and yet performs hardly better than a person with a computer and a couple of dictionaries and grammars on the hard disk – only faster.
e.g. MT-processed Microsoft security advisories
But those are published in English. Why would *you* read a translation, MTd or not?
I have a German Windows 7 Prof on my notebook. I always get linked first to MTd German sites. Since these usually take more time to figure out than I want to take, it is more efficient to chase down the English originals.
Shall I go on?
Do you have an example of DE<->EN ? That’s the only combination I am competent to evaluate.
370 pages of contract amendments and technical specs I had MTd (ES>EN) for another department last Thursday.
I’m sure you had to do some editing on the EN, and suspect that “contract amendments and technical specs” are a subject you are familiar with. That’s certainly a situation in which MT can be very useful.
But how would you like being MTd cold with something in a garbled form on a subject about which you knew little, and had no contextual clues as to what is going on ?
I don’t know why you single out German as particularly weird.
Zum Bleistift wegen der Worstellung die sich in gewissen Aspekten von der der anderen Europäischer Sprachen unterscheidet.
See that thing I did there with the verb? You don’t really do that in Spanish or Slovak or Maltese. That’s one of the things that are difficult to deal with when building an MT engine.
yet performs hardly better than a person with a computer and a couple of dictionaries and grammars on the hard disk – only faster.
Assuming this is true in its entirety – which it bloody well isn’t – yes, that’s it. Except here the difference is between several months (person with a computer and a couple of dictionaries and grammars) vs. several seconds.
Zum Bleistift
Isn’t that a cute idiom ?
See that thing I did there with the verb? You don’t really do that in Spanish or Slovak or Maltese. That’s one of the things that are difficult to deal with when building an MT engine.
There must be something wrong with the MT syntax models. That kind of Wortstellung is an automatic, systematic thing in German. Most speech is automatic anyway. Maybe it’s time for a recuperative dose of rule-based modelling, and not so much statistics.
I’m sure you had to do some editing on the EN
Yes. The title was misspelled, so the MT output was kinda garbled. Had to fix it manually.
But how would you like being MTd cold with something in a garbled form on a subject about which you knew little, and had no contextual clues as to what is going on ?
And that’s what I mean by ‘understanding what it cannot do’ and ‘to be used wisely’. If you end up in the situation you described, low-quality MT output is the least of your problems. Either you chose to get there, which means you’re terminally stupid, or you were forced into it, which means you’re being fucked.
or you were forced into it, which means you’re being fucked.
This is the case, and the burden of my song.
That kind of Wortstellung is an automatic, systematic thing in German.
Automatic, sure. Systematic, that’s another question entirely. Does it really behave that systematically across all types of syntactic structures?
BTW, GT gets this one right, at least on my example sentence (corrected misspellings, removed the cute phrase):
With Systran rule-based engine (ca. 1970s technology), you get this:
Seems like an improvement, wouldn’t you say?
Maybe it’s time for a recuperative dose of rule-based modelling
Now you’re talking. Some companies have seen the light (Apptek, Systran) and have begun to offer hybrid solutions.
Does it really behave that systematically across all types of syntactic structures?
YES ! (well, with a few little exceptions, nothing to write home about …) I personally feel that the position of the participle is at the very bottom of the worry list. Near the top is the string-of-particles phenomenon that is one of the reasons I originally wanted to learn German, to understand it: Wenn ich doch immerhin überhaupt mal wieder damit rechnen muß, daß … (this is colloquial rather than formal)
Some companies have seen the light (Apptek, Systran) and have begun to offer hybrid solutions.
I sense a warm, comfortable feeling of reconciliation coming on …
I agree with bulbul and Mark Twain that the problem is that the German language is just fucked. If I didn’t blame Wagner for Hitler, I’d blame the German language, especially the noun declensions and that thing it does with verbs.
Adam Gopnik had a rambling piece earlier this Spring, ostensibly reviewing some AI books in the wake of Watson on Jeopardy! In passing, he mentioned that his mother, now known for FOXP2, had worked on early efforts to automate Hansard translation. See this paper for how things were in ’69.
What a terrible thing to say about your own mother.
MMcM: here is a novel thought about translation from that ’69 document that I have put in bold:
Well, somebody had to be bold enough to say it !
To put it another way, you can’t learn from yourself if you’re dumb.
“…our minds lie in us like the fish in the pond of a man who cannot fish”
Ted Hughes
http://www.youtube.com/watch?v=TnRebDgZOsA
Grumbly, why wouldn’t you believe that, given that MT is as bad as you say it is, that it’s even worse at some languages than others?
But this question is rhetorical, for watching you in action here convinces me of what I have long suspected: you are a troll. To be sure, you are far more intelligent and knowledgeable than most trolls, but it’s the motivation that makes a troll, and your primary intent (as judged by objective criteria) is not that of participating in a community, but of provoking readers into an emotional response. I suppose all that conversation — or “conversation” — we had a while back about identity (yes, identity) should have been conclusive. For good and ill, though, I’m a trusting sort of person, “always assuming the best of people and always having the best of intentions, but he still doesn’t always think all of the consequences through” (Christophe Grandsire).
Anyway, even if Hat doesn’t ban you (and on balance I hope he does), I at least will be seeing no more of you. Vaya con Dios.
That’s a pity, John: you are far more intelligent and knowledgeable than most. And melodramatic too, it appears. Are you charging me with responsibility for this outburst ?
We’ve hardly exchanged a word in comment threads since that “conversation” months ago. I’ve been tussling here primarily with Hat and bulbul about MT. I see no cause for this sudden attack on my person.
I would have thought that if you didn’t like what you read here, you could simply not read it and not join in the comments. But unlike you, I’m not going to speculate on motives, impute malicious intent or call names. Already in that previous “conversation” it became clear that you are fond of the frown superb and the waggling finger – that is one of the big differences between us.
Grumbly is just frustrated with the continuing inability of MT to come up with what he wants and he’s being bloody-minded about the way he expresses it.
In fact, MT is much worse with some languages than with others. With Chinese and Japanese it’s pretty bad. But it’s still useful in its own way and Grumbly is just being difficult when he denies it. I use it all the time in translation. Sometimes it’s a tossup whether it’s easier to just wade in and translate an article yourself or to give it to Google Translate and have to minutely edit it later. The editing is tedious and can take a lot of time. But on balance it is usually quicker to use Google Translate.
I think I would agree (without giving examples or proof) that MT has got better over time. Under the old rule-based approach, it was unrealistic to expect anything more than Dick-loves-Jane. But the corpus-based approach produces intelligible, often quite good text in a surprising array of situations. What’s more, MT is often good (but not infallible) at finding and using the right technical terms and finding how people’s names are written in Chinese. That can save a lot of tedious legwork.
But MT still has massive drawbacks, like getting figures right or maintaining intelligibility where linguistic structures get complicated (which means that negative sentences suddenly turn into affirmatives). But even with that, it is still useful for getting the drift of what is written in a totally unfamiliar language, as Hat has found out. In other words, despite its huge faults, it’s still a useful tool. I think it would be useful if Grumbly admitted that.
Grumbly,
Though maybe it’s wrong to try to divine motives, I will say that you regularly make a point of being provocative. That could be called seeking to provoke an emotional reaction, or it could be called seeking to provoke a good robust argument. Where’s the line? Anyway, perhaps not every attempt to provoke an emotional reaction ought to be called trolling. Furthermore, to label you as a troll goes a bit beyond calling some of your recent behavior “trolling”. I was taken aback by John’s comment.
On the other hand, your behavior in this thread has seemed extreme even for you–as if the drive to provoke got in the way of actually seeking to understand what others were thinking. (But there I go almost imputing motives. Does that mean that I’m being deliberately provocative myself?)
I don’t remember the “conversation” in question, and I don’t want to. But are you sure that you are so far from being some kind of big finger-waggler yourself? And so what of you are?
I would be flabbergasted, and disappointed, if Hat saw this as grounds for serious disciplinary action. But, jeez, can’t you lighten up a little?
(“one pl” is “questionable content”. I had to use an HTML trick to write “one place”.)
The Right Honorable Mr. G. Stu felt so strongly about this that he actually e-mailed me. Well then.
I’m so going to steal that.
First of all, a public service announcement: “simple everyday Grumbly sentence” is not sarcasm. Sentences of this length occur all the time in spoken German – especially in spoken German, because when you talk, you can’t always plan an entire sentence ahead, so you don’t necessarily know where you’re going to end, know what I mean? (And see what I did there? I lengthened this sentence four times while writing it.)
Second, I wouldn’t have made the last comma, and machine translation always falls flat on its nose and splatters on the pavement when the commas aren’t precisely where it expects them. The comma probably made Google interpret the stuff after it as a relative clause, so it stumbled over the lack of a verb and screwed up.
Well, I am glad I speak it natively so I never had to learn it as a foreign language. Three separate declensions for the same adjectives, and each consists only of distributing the same three or four endings in different but almost random ways?!? That’s so extreme, Mark Twain didn’t even notice it.
On the other hand, knowing French and Latin but never having learned more than tourist phrases of Italian, I find it considerably easier to read this blog in the original than when I click on the Google Translate button. It’s still an effort, so I rarely read that blog at all (even though I think I should read pretty much every post), but Google Translate results in simply incomprehensible sentences several times per post.
Whatever the EU uses for many of the German pages at http://europa.eu is good.
Except that the Catholic church asks for hard data even less than it used to. For instance, only two miracles are now required for canonization instead of the traditional four, and the advocatus diaboli has been abolished or nearly so.
Why can[3pl] to be German[adj, nom/acc.sg.f or nom/acc.pl; refers to a noun that is not fucking there] seem[inf] from the perspective of a MT worker[nom.sg, does not agree with eines, “of a”, gen.sg.m] [note that “worker” just floats around here in isolation; there ought to be at least a hyphen connecting it to its prefix MT; this kind of thing sometimes trips me up while reading] difficult? It cannot be, because it resembles high-energy physics, or because it [“can” missing] not easily be translated into the language of deep-sea diving [which is given a wrong article and lacks the required ending – it’s treated as plural]. It can be, because the MT employees [again lack of connection between the words, but at least the grammar works this time]
Oh. Now it breaks down completely.
At the end of that sentence, there’s the word beteiligt, “lets participate”. But does it belong to the clause that begins with “because”, or does it belong to the embedded relative clause that somehow never ends?
Naturally, I tried it both ways. Both ways lead very quickly to “colorless green the ideas are sleepeth fury”. I am not exaggerating.
I give up. This has been my attempt to retranslate that innocuous little paragraph without machine help and without looking at the original. I have translated German word order as English word order, so all the peculiarities you’ll notice (for instance in the first sentence) sound just as bizarre in German as in English. I also haven’t used homonyms that would be wrong from context, context that computers don’t always understand; for instance “difficult” at the end (!?!) of the first sentence could just as well have been “heavy” – in fact, you can probably find prescriptivists who will claim to only use schwer for “heavy”, while “difficult” would be schwierig.
Yeah, OK. I’ll try the last sentence:
It could not be in advance, that this was an erroneous belief be known, but meanwhile, that a reasonable assessment of the situation.
That a reasonable assessment of the situation what? Where is the verb?
Four mistakes in that sentence:
1) Wortstellung. Obvious typo, never mind.
2) Missing comma right after. That’s a mere orthographic convention (clauses must be separated by commas); it doesn’t impede understanding, as the English convention of never putting a comma in such places shows, and as tons of native speakers who haven’t understood the rule probably also show*. Whether a comma there correctly reflects my intonation (as it otherwise does very faithfully) depends on how fast I speak.
3) Europäischer is an adjective that is not part of a proper name, so it should be in lowercase. Mere orthography, never mind.
4) The three declensions for the same adjectives. GOTCHA! “Europäischer” should be europäischen. The trouble here is that what you wrote isn’t merely wrong, it means something else. Instead of “of the other European languages”, you wrote “of the others of European languages”. It took me two seconds to figure out that this, in context, doesn’t make sense and can’t have been what you meant.
* I say “probably” because not all of them can read even their own writings for understanding. But I digress.
I don’t believe in neurological barriers of this sort.
What is definitely going on is that children learn lots and lots phrases by rote and interpret rules into them later. (Sometimes they fail completely to notice some of the rules with fewer applications and instead keep believing that the cases to which they apply are irregular special cases; this is how rules are lost during language change.) Adults with any education will already expect that a foreign language has rules and will consciously learn as many of them as soon as possible; adults without any education… hm… would probably still try to interpret rules into everything much sooner and faster than children.
Also, children have little else to do. Adults typically don’t. Adults with way too much time on their hands are capable of amazing feats of memory – less than 100 years ago, there was somebody in the Congo who knew the entire Bible by heart.
Exactly. In a dependent clause*, the finite verb goes at the end, period. Most of verything else can be reshuffled for emphasis or for fun (not as much as in Classical Latin, but still), but the finite verb can only go in one place. Recognize a dependent clause ==> move the finite verb to the end.
(Likewise, in independent clauses the finite verb can only appear in the second position, and in questions it can only be number 1, if you somehow count wh- words as 0. Even in most dialects, this marking of clause types by the position of the verb hasn’t been negotiable for at least 900 years now.)
And this is a case in point for how children vs. adults learn languages. You see, I didn’t notice on my own the fact that independent clauses have verb-last word order. I read about it only a few years ago. In my head, relative clauses trigger verb-last word order (verb-second being the default, in contrast to historically-oriented professional linguists claiming that verb-last should be considered the default), so do many but not all conjunctions (I know by rote which ones do and which don’t – it just so happens to turn out that the ones that don’t are used to chain independent clauses, while the ones that do head dependent ones*), and, well, in constructions without a finite verb, the verb goes at the end, too.
Automatic, sure. Systematic, that’s another question entirely. Does it really behave that systematically across all types of syntactic structures?
Absolut tut es das.
Verb in second position. Adverb pulled to first position for emphasis. Subject used to be in front of the verb, but there is only one position in front of the verb, so it had to move all the way to third position. Subject distinguished from object by word order by weak default… look how weak it is:
Das tut es absolut.
Adverb pulled to last position for not quite as emotional emphasis. Object moved to front because something has to be in first position, and the verb cannot. Verb in second position. Verb always in second position.
Es tut das absolut.
…Strangely unidiomatic. But why? Why shouldn’t I be able to put the subject in the first position and achieve good old SVO word order!?! [5 minutes later] …Oh. This only applies to es, which looks too much like a dummy subject in the first position; er and sie are perfectly cromulent there. So, never mind. 🙂 So, SVO, verb in second position, verb always in second position.
Es tut absolut das.
BZZZT! Means something else, because an adverb right in front of a (pro)noun modifies that word and not the verb. Grammatically correct, colorless green ideas still sleep furiously, verb still in second position, verb always in second position.
Tut es das absolut?
You see what I did there. – Subject in front of object, adverb somewhere where it’s not directly in front of either so it can still modify the verb.
[…] es das absolut tut.
You know what goes in the ellipsis: a conjunction, or a relative pronoun optionally preceded by a preposition, itself preceded by an independent clause. (Alternatively, the independent can go at the other end or surround the dependent clause.) – Subject in front of object, adverb somewhere where it’s not directly in front of either so it can still modify the verb, but not in the last position, because that’s where the verb goes.
It absolutely does.
Subject in front of verb. Adverb directly in front of verb. Object optional and omitted here.
Absolutely, it does.
Adverb ripped out of the sentence and let loose somewhere in front of the sentence as a free-floating, dangling interjection for emphasis. Subject still in front of verb. Subject always in front of verb.
Like what?
…Oh yeah. Colloquially, in the above example sentence, the object is actually optional if it goes in the first position; if you take das tut es absolut and omit the object, you suddenly have the verb in what looks like the first position.
If I do after all still need to expect at all that…
🙂
Really? I have never felt such a provocation. What clearly is going on is that Stu shares his own emotional reactions with us, and he seems more extroverted than most here.
…but I missed the identity discussion.
David, you have made my day. At this point the occurrence of the word cromulent in your delicious long post is my favorite thing about it, with the possible exception of the final comment on Stu’s motives, but that may be only because I will have to reread it more slowly to understand the rest.
a mere orthographic convention (clauses must be separated by commas); it doesn’t impede understanding, as the English convention of never putting a comma in such places shows,
Well, I dunno, I think it impeded my understanding a little.
BZZZT!
Is that the TV presenter in The Fifth Element ? <giggles>
What clearly is going on is that Stu shares his own emotional reactions with us, and he seems more extroverted than most here.
That’s exactly right, David. You even figured that out before I did.
Obvious typo, never mind.
Mere orthography, never mind.
And yet, you had to mention it. Please don’t forget to sign my report card, so that I can show it to my parents and be appropriately punished.
“Europäischer” should be europäischen
That was a typo/muscle memory, but, as a wise man once said, never mind.
it means something else.
And how does it do that? I thought I was merely substituting the definite article for the noun, as I’ve seen many native speakers do:
Wortstellung der anderen europäischen Sprachen > die (N) der anderen europäischen Sprachen > der (G) der anderen europäischen Sprachen.
Just FYI, I agree with Mr. Cowan’s* analysis. I’m not sure about the course of action he has chosen, but his description of what’s going on is spot on.
* To whom and to Mr. Emerson best wishes on their name day.
empty: But are you sure that you are so far from being some kind of big finger-waggler yourself?
Sure I’m sure. I am often abrasive and cantankerous – not for me the tut-tutting of an old woman. I don’t waggle my finger, I jab with it. And I don’t frown on what displeases me, I spit bile at it.
It’s called method acting.
Forgot to mention this… morals have, on the global average, most obviously improved since the early 70s, and they continue to do so. Totalitarianism is now more widely considered immoral; Iowa has abolished the death penalty, even it took it much longer than Albania; discrimination is considered acceptable for fewer and fewer things, with, say, gay marriage now being legal in several countries and several US states; and this year, democratic governments are stopping their traditional practice of supporting dictatures that use violence against their own masses. I could go on, but I don’t have all night 🙂
For contrast.
I see.
The typo does that by substituting a genitive ending for another genitive ending, the one that is used without an article for the one that goes with the definite article.
der anderen europäischen Sprachen
“of the other European languages”
one single genitive phrase
der: definite article, genitive plural
anderen europäischen: pronoun and adjective, genitive plural with definite article
Sprachen: noun, genitive plural
(…and all of them feminine of course)
der anderen europäischer Sprachen
“of the others of European languages”
two nested genitive phrases, one with a definite article, one without any article
{der anderen {europäischer Sprachen}}
{of the others {of European languages}}
der: definite article, genitive plural
anderen: pronoun, genitive plural with definite article
europäischer: adjective, genitive plural without article; feminine like all other words in this phrase, so it happens to fit
Sprachen: noun, genitive plural
(Uh, yeah. The indefinite article lacks plural forms, so my claim of three declensions isn’t accurate for the plural. There are just two for each adjective and pronoun.)
…
http://en.wikipedia.org/wiki/Method_acting
Are you seriously trying to say you don’t actually have all these emotions, you’re faking them in order to make them natural for yourself, in order to learn them?
Surely not in order to convince yourself? Because that would be evil.
And so, to bed.
Jeeze, can’t I leave you people alone for five minutes without all hell breaking loose? Grumbly isn’t a troll (though he can be annoying), and I’m not banning anybody. Now everybody behave and study David’s explication of the splendeurs et misères de la langue allemande. There will be a quiz.
I’ll start by putting 5 contrition units in the kitty. Who’s next ?
With Chinese and Japanese it’s pretty bad.
With Japanese it’s spectacularly bad, with Turkish and other OV languages too.
Another thing puzzles me that I haven’t mentioned yet, the fact that GT does so badly with two completely non-exotic, European languages: Spanish (as Julia brought up) and German. I mean “non-exotic” from the point of view of the Amero-Eurocentric boffins who were the driving force behind MT for a long time, and maybe still are.
I would have expected that the structures of Big Western languages were so well-known and thoroughly researched that MT could by now put on a good showing with them, at least. But this is not the case, as I have demonstrated here for German, and as Julia claims for Spanish.
I can think of a number of disobliging explanations for this, but I want to cool it until I hear whether others think this is a fair appraisal of the situation.
I need to be more careful, and write only “GT” when I mean GT, instead of slipping into the “MT” generalization.
I don’t ever find Grumbly annoying.
I don’t really agree with Grumbly, but I think it’s worth noting that there are a lot of things Google Translate is reliably bad at, and nothing that it’s reliably good at. There are things that you can try to use it for, but nothing that you can rely on it for. Even in the best case, when the stars align correctly and it generates a perfectly intelligible, grammatically correct sentence, it will sometimes, frighteningly, be factually mistaken; for example, a current Haaretz article quotes Hezbollah’s Secretary-General as saying that Syria is the only one standing against Israel and the U.S., but Google Translate thinks it’s the only one standing with Israel and the U.S. (That’s Hebrew-to-English, of course. English-to-Hebrew, it never generates a grammatically correct sentence to begin with. That falls in the “reliably bad” category, whereas Hebrew-to-English is very hit-or-miss: not reliably terrible, but not reliably adequate.)
Wow, that’s quite an endorsement to say that you use Google Translate constantly. I was wondering what people think of it.
Google Translate has become TERRIBLE for French. It does not comprehend phrases – only individual words and I have found it getting worse and worse over time. Or even in the same day, it is like it is using a different underlying engine from hour to hour to do the translation. It was actually better a few years ago at working out how the use of certain words in phrases changed the translation of words than it is now. Each time I use it now (mainly to check verb tenses because many French verb tenses “sound” the same when pronounced and only the spelling reveals the tense … and I learned by hearing it so often I misspell it) I realise I cannot even use it. It is pathetic especially at knowing the difference between using a word like “le” or “la” for it versus he or she, possessive terms and common phrases like “chez Jules” is not even translated properly today. There used to be a way to correct translations online and even submit a new sense for your translation but now today it ist there (other days it is – that seems to “come and go”) and then the “help us make google translate better” link gives you a list of phrases and their supposed translation and asks you to confirm one word in the whole translation – that is also terribly designed – again prompting you to choose only the corresponding single word – when often the translation they have is wrong or translated too “freely” rather than a bit more “literally” to help automated translation, so you cannot do a one to one word correspondence. All you can do is choose to “skip” it rather than have some choice saying “no single word corresponds” or something more reasonable. Also I have been prompted to find the “word” that corresponds to an apostrophe in a French sentence on multiple occasions in this part of Google. Clearly they have gotten something wrong now because it is going downhill rather than becoming more accurate !