MIT Technology Review has a brief but intriguing article called “How Google Converted Language Translation Into a Problem of Vector Space Mathematics.” If I could only have read it (or rather, the paper it’s based on) when I was a math major, forty-plus years ago!
The new trick is to represent an entire language using the relationship between its words. The set of all the relationships, the so-called “language space”, can be thought of as a set of vectors that each point from one word to another. And in recent years, linguists have discovered that it is possible to handle these vectors mathematically. For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.
It turns out that different languages share many similarities in this vector space. That means the process of converting one language into another is equivalent to finding the transformation that converts one vector space into the other.
This turns the problem of translation from one of linguistics into one of mathematics. […]
The method can be used to extend and refine existing dictionaries, and even to spot mistakes in them. Indeed, the Google team do exactly that with an English-Czech dictionary, finding numerous mistakes.
That would have been right up my alley. Alas, having forgotten all the math I once knew, I can only gape and wonder if it’s all it’s cracked up to be. (Thanks, Nick!)
Well, having read the article I’m none the wiser. I’d like to see how it works in detail. Their examples seem to suggest some kind of analysis into semantic primitives. Indeed, it seems to suggest that they can decipher completely unknown languages merely by analysing their ‘language space’. So I guess that the decipherment of unknown scripts and languages is just around the corner.
The article seems amazingly simplified when it says that ‘It relies on the notion that every language must describe a similar set of ideas, so the words that do this must also be similar. For example, most languages will have words for common animals such as cat, dog, cow and so on’. This is fine as a larger generalisation about nouns, but rather naive when it comes to more subtle distinctions and verbs. And translation also has to take into account structures, which trip Google Translate up awfully for languages like Japanese.
Given what number-crunching is able to do, I certainly wouldn’t dismiss this kind of thing, but I really would like to know how this method works. Do they really just crunch thousands of documents and discover that ‘that word’ is ‘2’, ‘that one’ is ‘3’, ‘that one’ means ‘cow’, and so on?
It’s a neural net method– they use large datasets to ‘train’ a sort-of context-recognizer. The context-recognizer picks up local semantic correlations in a sort-of geometrical space. The article is accurate insofar as It’s interesting that a sort-of geometrical ‘closeness’ turns out to correspond to a sort-of semantic context.
But I don’t see a statement of how many dimensions this ‘context space’ has– and that makes a difference– since, e.g., ‘closeness’ in three dimensions has quite a different flavor from ‘closeness’ in 100 dimensions. But it is interesting and possibly useful.
The broader point is that the strategy that has been successful over the years is combining a lot of 70% to 80% accurate methods to get overall accuracy in the 90’s, and this looks like one more arrow in the methodological quiver.
“The article seems amazingly simplified when it says that ‘It relies on the notion that every language must describe a similar set of ideas, so the words that do this must also be similar. For example, most languages will have words for common animals such as cat, dog, cow and so on’. This is fine as a larger generalisation about nouns, but rather naive when it comes to more subtle distinctions and verbs.”
It doesn’t even work ith nouns. The gaps between scientific and common names and terminology for animals and plants is one obvious example of how these don’t ahve to be arranged in anything resembling a standrad pattern. kinshp terms are another obvious examples.
Perhaps langauges do have to describe a simialr set of facts, but tyey absolutley do not have to have the same ideas about them, they don’t have to categorize them according to any striclty standard way, and when we do find similarities, they are either often banal or random.
This all sounds a lot like what Scott DeLancey called the objectivist fallacy when it applies to trying to analyze syntactic processes. Language communities just dont to agree to say all the same things even about essentially identical situations.
Eppur si muove, apparently. If they’re using it to correct dictionaries, it’s not all fallacy.
But can it do sounds?
http://www.huffingtonpost.com/2013/09/28/proto-indo-european-language-ancestors_n_4005545.html
Does it work for bad language?
http://www.lrb.co.uk/v35/n18/colin-burrow/frogs-knickers
I like how if you look at the (unlabeled) axes on the English numbers and Spanish numbers graphs, the scales don’t match up.
Essentially all that computers know how to do is linear algebra. The art of numerical analysis consists of taking some (possibly nonlinear) problem that you want to solve, and somehow approximating it by a finite-dimensional linear problem that you can throw at a computer. So, find a new way to project a problem onto some linear space, a voila! new algorithm!
Click through to the paper itself if you want to get into the nitty-gritty.
I guess what they do is they establish correspondence between languages by first choosing pairs of words with (almost) the same meaning and then looking how the new words appear alongside these reference words in both languages and finding the closest matches. For example, if mouse = souris, cat = chat and dog = chien than you can go and see how “dog chases cat”, “cat caught [a] mouse” etc. are distributed in English, find similar distributions involving “animal V animal” in French and learn which French verbs correspond to which English ones.
I think you need to read it all in a context of automated translation. Noone is claiming that this is a detailed insight into languages or translation generally – it’s jsut a general tool that may improve automated translation. Noone’s claiming the method corrects traditional dictionaries built with much human effort – the point is that the linear algrebra approach is (sometimes) able to correct the very computer-generated (Goodle Translate) dictionaries the transformation was built on.
The general idea that words could be represented as points in vector spaces in way which allows linear transformation as a word-for-word translation technique makes some sort of sense in terms of sematic primitives, as long as we acknowledge that we’re only aiming for a fairly crude result. The point here (the focus of an earlier paper rather than this one) is that it seems to “work” when the particular spaces used are based on the contexts that each word appears in. To very roughly extend the example: ‘queen’ and ‘king’ might both be found near things like ‘reign’, while in other ways there contexts are more like ‘woman’ (‘she’, etc.) and ‘man’ respectively.
So, they put a whole lot of source language words in one such space, a whole lot of target language words in a (lower dimension) space, get a ‘dictionary’ of word pairs (in this case from google translate) and find a linear transformation to represent it and see what you get. Nothing spectacular in the sense of ‘now we can translate everything’, but soemthing that can apparently be used to improve existing approaches.
Sure, it’s not going to cope with all the differences between languages. The paper isnt’ at all dealing with translating sentences, so there’s no dealing with structure. I guess it’s conceivable that this approach could do something without an existing dictionary, but that’s not what’s dealt with here. And I can’t at all understand why George Grady thinks the scales on their graphs are relevant at all.
@Jonathan D: Thanks for clarifying that. The author of the MIT Technology Review article clearly believed that they are talking about improving on “traditional dictionaries built with much human effort”, but said author was simply mistaken, and you are clearly correct. From the paper itself:
> To obtain dictionaries between languages, we used the most frequent words from the monolingual source datasets, and translated these words using on-line Google Translate (GT).
What a difference that makes!
MattF: It’s a neural net method– they use large datasets to ‘train’ a sort-of context-recognizer. The context-recognizer picks up local semantic correlations in a sort-of geometrical space. The article is accurate insofar as It’s interesting that a sort-of geometrical ‘closeness’ turns out to correspond to a sort-of semantic context.
The summary doesn’t mention neural networks. The “original paper” linked by Ben Zimmer does. Having skimmed it, I get the impression that the whole business has a simpler description in terms of basic topology. I’ll now mention a few points in support of this claim.
First of all, the “sort-of geometrical space” is defined by means of “local semantic correlations” for a specific language, so it’s no surprise that local semantic correlations can be found in it. The “vector space” for a specific language is a finite set of words from that language on which a “semantic topology” is imposed, and the “dimension” of the vector space is the number of words.
The “correlations” of a word are the various semantic contexts empirically determined for that word. This space is just a set of words with their various “semantic vicinities” taken as a subbase. A word w2 that occurs in some “semantic context” of w1 is defined to be “in the vicinity” of w1.
That’s a global view of things. But the authors are not interested in the global topology of any one of these language-specific “vector spaces”, it seems. The “vector spaces” they are interested in are just specific “local semantic correlations” around specific words. The original paper says that the vicinities are defined (measured) by “Skip-gram or CBOW (continuous-bag-of-words) models”. I like the expression “continuous bag of words”, but it’s hardly a new technique. Something similar is used by spell checkers that suggest corrections – the corrections are “near” the misspelled word in a vicinity of “similarly spelled” legitimate words. “Similarly spelled” refers to some algorithm defining a metric.
Stu, I’m not sure you’ve got the right topology there, but in any case, if you’re only looking at topology, then you’re missing the part that they’re claiming is new. Their claim is that there is enough semantic structure in the context-defined space that a linear transformation based on a relatively small number of word pairs may inform (single word/phrase)translation more generally. Their definitely using the linear structure, not just a metric.