NGRAM.

December 17, 2010 by languagehat 52 Comments

An n-gram is “a subsequence of n items from a given sequence.” Google has come up with what it calls an Ngram Viewer that allows you to compare the frequencies of words in printed books over any span of time since the invention of printing. (It’s case-sensitive, so you can discover, as Shaun Nichols did, that people stopped capitalizing “Socialism” around 1945.) You can read about it in this SciAm article by Katherine Harmon:

The researchers behind the Books Ngram Viewer admit it will not likely replace tried-and-true techniques of close reading…. Despite the program’s capacity to churn out neatly organized analytics at the click of a button (labeled, cheekily, “search lots of books”), Aiden maintains that “we certainly don’t view this tool as an answer machine.” But certainly the program can work as a question generator.

For example, the evolution of the frequency of “evolution” … reveals some unexpected nuances. It was on a general upswing until the mid-1920s, then declined gradually until around 1945 (from about .0035 percent of words in the measured data that year to about .0025 percent). Why the dip—and is it significant? The researchers were unsure and offer this as an example of a lead in for further research, Michel notes.

The Books Ngram Viewer also can shed some light on the popularity of various people, revealing, for instance, a marked dearth of references to Jewish artist Marc Chagall in books published in Nazi Germany, suggesting widespread censorship, the researchers concluded in their paper. (For those more keen on following scientists, the frequency of “Albert Einstein” mentions surpasses those of “Charles Darwin” in the late 1960s, but both enjoy a rise in popularity from about 1975 to 2005, according to a recent search—and the researchers found that Freud ranks higher over time than Einstein or Darwin.)

The first thing I did with it was to check linguistics versus philology; the graphs cross just before I was born.

Addendum. See Geoff Nunberg’s post at the Log for more detail and some interesting commentary.

Comments

Alon says

December 17, 2010 at 9:12 am

First thing I did was check whether the notoriously unreliable metadata in Google Books had been fixed for this purpose, with less than ideal results.
This search yields quite a few results for peronismo published before 1943— the birth date of said movement. I haven’t checked the source books, but there can be little doubt they’re mistagged. There are also mentions of franquismo in the 19th century, and Clinton appears to have been popular well before he was born…
John Cowan says

December 17, 2010 at 10:11 am

You need to look at the actual data (when and where available) to distinguish mistagging from mis-OCRing. For example, the Diccionario de la lengua castellana is correctly dated 1851, but the text actually has “(cronismo”, not “peronismo”, from the wrapping of “anacronismo”. This sort of error is going to be unavoidable even in the most carefully tagged book corpus. The early Bill Clinton hits do seem to be mistagging, though.
Five million books is a lot of books.
AJP Crown says

December 17, 2010 at 11:11 am

I was interested in John’s comment at L.Log that you can trace the prominence of certain years, and I find that “1917” has been subsequently mentioned more in Russian books than “1918” has, as you would expect, except throughout WW2; just prior to Stalin’s death and right after it, and during the early ‘seventies they were equal. I don’t know quite what it means, but I’ll think of something. I’ll certainly be able to waste hours with this.
Alon says

December 17, 2010 at 11:13 am

Point taken, John, although there seem to be quite a few cases of actual mistagging
AJP Crown says

December 17, 2010 at 11:15 am

Here’s my graph. Hooray for the humanities, whatever they are.
JR says

December 17, 2010 at 12:40 pm

If I did it right, looks like since 1850, “Estados Unidos” has always been more common than “los Estados Unidos”. And around 1960 they really began to diverge.
YM says

December 17, 2010 at 1:38 pm

Racial epithets for some reason all peak around 1940.
YM says

December 17, 2010 at 1:42 pm

Oo, this is fun. ‘color’ and ‘colour’ are pretty much mirror images of each other, until about 1960. Is this real?
Matt McIrvin says

December 17, 2010 at 2:19 pm

Speaking of physics: The first words I tried were “proton” and “neutron”. Once the neutron was discovered, they rose in stately synchrony with “proton” just a little bit higher, pretty much what you’d expect from fundamental physics (and chemistry) literature.
Then: the atom bomb, and suddenly “neutron” shot way up and remained higher until the end of the Cold War, when things returned to the way they were before 1945, “neutron” just a little bit below “proton”… only this time both terms were declining in frequency.
Ø says

December 17, 2010 at 3:36 pm

turtle doesn’t get ahead of hare until around 1980
Geraint Jennings says

December 17, 2010 at 4:19 pm

There’s some rather annoying hyphen interpretation going on, but it seems that it provides a rough answer to a question I’ve been unable to provide a satisfactory response to for years. The Channel Islands are les Îles de la Manche in French in the Channel Islands, but modern usage in France calls them les îles Anglo-Normandes in French (it’s slightly more complicated than that, as “Channel Islands” and “Îles de la Manche” can be taken to include Chausey, which is under French sovereignty, whereas “îles Anglo-Normandes” explicitly excludes Chausey). It’s been quite clear that “Îles de la Manche” predominates in 19th century texts both in insular and continental sources, but the question has been: when did “îles Anglo-Normandes” become the dominant form in metropolitan French? A quick comparison in the French corpus indicates that it starts to overtake the alternatives (both Îles de la Manche and the more literary Archipel Normand) in the early 1920s and predominates from the 1960s onwards. Which seems credible. All in all, a tool to be used with caution, but a useful tool nonetheless.
Paul Ogden says

December 17, 2010 at 5:51 pm

“The (literary) critic is in the position of a mathematician who has to deal with numbers so large that it would keep him scribbling digits until the next ice age even to write them out in their conventional form as integers. Critic and mathematician alike will have somehow to invent a less cumbersome notation.”
— Northrop Frye, in his “Polemical Introduction” to “Anatomy of Criticism”, Princeton University Press, 1957.
Dr. Weevil says

December 17, 2010 at 6:24 pm

I don’t know why Alon (1st comment) is so surprised to find prenatal mentions of “Bill Clinton” in books. Neither name is at all rare, and there is no reason for the combination to be particularly rare. There must have been dozens or hundreds of “Bill Clinton”s in the history of the world, and it’s not surprising that at least one of them was occasionally mentioned in books before the 42nd president. A few bumps followed by a sudden rise around 1980 and a steep upward slope in the 1990s is just what I would expect of such a name. On the other hand, the 13th president was very likely the first person in the history of the world to bear the name “Millard Fillmore”, so I would expect prenatal mentions of that name to be zero.
thegrowlingwolf says

December 17, 2010 at 10:51 pm

What a fun little Google toy…I’ve, like the imp I am, been feeding it the F word vs. Mother…very interesting…or even something simple like Heaven,Earth…good fun…is Google gathering information off this? Good old Google. How ’bout checking Search engine,Google.
Orwell,
Ur fiend,
thegrowlingwolf
MMcM says

December 18, 2010 at 12:36 am

first person in the history of the world to bear the name “Millard Fillmore”
Phoebe Millard married Nathaniel Fillmore in 1796. So her name matched for this purpose for four years before their eldest son was born.
Grumbly Stu says

December 18, 2010 at 3:46 am

The WiPe on Millard Fillmore has a useful example of how to add your 2 cents’ worth to any WiPe article. Just figure out an arbitrary relationship between a detail in the article and anything else that strikes your fancy, then comment on that relationship in a parenthesis:

Fillmore was born in a log cabin in Moravia, Cayuga County, in the Finger Lakes region of New York State, to Nathaniel Fillmore and Phoebe Millard, as the second of nine children and the eldest son. (As this was three weeks after George Washington’s death, Fillmore was the first U.S. President born after the death of a former president.)
Grumbly Stu says

December 18, 2010 at 4:19 am

I’ve now deleted the parenthetical remark, giving as change summary: “removed parlor-game irrelevancy”. If someone restores it, that will bolster my claim that someone’s 2 cents’ worth is at issue here.
AJP Crown says

December 18, 2010 at 6:07 am

I plotted Beatles vs Rolling Stones (not even close), and found a blip for the Beatles around 1900. It’s a typo for “beagles”; so much for distinguishing between upper- & lower-case letters, let alone different words. For two minutes, I thought I could write a paper.
AJP Crown says

December 18, 2010 at 6:36 am

“hat” has been going up ever since this blog started. It should reach “language” in about thirty years — well before your 100th birthday, Language.
iching says

December 18, 2010 at 6:54 am

@YM: I agree, this is fun! But I tried colour vs color too and I don’t understand your comment. My result shows colour beating color by 10:1 in 1800, declining to 1:1 by about 1890, then starting to favour color over colour , reaching about 3:1 in 2000. But doesn’t this just reflect the number of books published in AmE versus BrE? Similar result for defence versus defense, except the lines cross about 1920 and remain about 2:1 in favour of defense from the mid 1940s to 2000.
Grumbly Stu says

December 18, 2010 at 7:22 am

O Crownicler of our deceptive times, that’s a clever exposé. You don’t mention that this “reaching” will (by extrapolation) take place not merely because the frequency of “hat” is going up, but also because the frequency of “language” is going down. And yet the graph clearly shows that that is what will be happening. Something important is involved here that I’d never thought of before.
When we speak of a thing “doing” something – in this case, a “hat” token “increasing” in frequency, in order to attain the frequency of a “language” token – we are imputing agency to that thing, whether it is a person or not. Your example makes it clear that an imputation of agency requires an imputation that “all other things remain equal” – these are two sides of the same coin. To put it another way: when one imagines that everything is changing, then there is nothing that can be distinguished in a meaningful way as an “agent”.
In the present case, one could also say that the “hat” token is a patient. It continues as before, while waiting for the “language” token to decrease in frequency.
In other words, an agent needs patients conceptually, just as a doctor needs them financially. However, the more agents you try to imagine, the less plausible it is that any of them can actually accomplish anything. This is more similar to a situation where too many cooks spoil the broth.
Grumbly Stu says

December 18, 2010 at 8:18 am

I should perhaps explain that I am recovering from a bad cold, while reading Luhmann and also the very peculiar novel Die Blendung by Canetti.
This has been translated as “Auto da Fe”. Hmmm… According to the “From the back cover” section for one edition at amazon: “Auto da Fe was first published in Germany in 1935 as Die Blendung (The Building or Bedazzlement)”. I daresay somebody screwed up with “The Building” – if anything, that should be “The Blinding”. An anachronistically evocative translation of the title would be “Blinded by Science”.
language hat says

December 18, 2010 at 9:08 am

“hat” has been going up ever since this blog started.
And “language” (as Grumbly points out) has been plummeting since the mid-’90s—I wonder why?
language hat says

December 18, 2010 at 9:10 am

And now it dawns on me: the prescriptivists were right all along! All these superficially harmless misusages and abusages, which I’ve been thoughtlessly defending… they’re killing language!
dearieme says

December 18, 2010 at 10:17 am

Freud a “scientist”? Good grief. Only in the Global Warming sense, Shirley?
Grumbly Stu says

December 18, 2010 at 10:23 am

they’re killing language!
No no, Hat, they’re just killing “language”. There is still much to be lost.
MMcM says

December 18, 2010 at 10:35 am

I plotted Beatles vs Rolling Stones
Or Dylan vs Donovan, which would be hard to explain if it were really about either of them before maybe the last decade.
a blip for the Beatles around 1900
There are googles before the number, as onomatopoeia.
A J P Crown says

December 18, 2010 at 10:59 am

As individuals, Dylan beats them all, of course. Keith Richards crawls along the bottom, only crossing Ringo Starr.
A J P Crown says

December 18, 2010 at 11:05 am

as onomatopoeia
Is “beat” itself onomatopoeic in origin?
John Cowan says

December 18, 2010 at 10:22 pm

Nobody knows, Crown. It’s one of those words confined to the Germanic family.
A J P Crown says

December 19, 2010 at 6:59 am

Thanks.
Ø says

December 19, 2010 at 8:09 am

What about the Spanish (?) word bate (two syllables) in the children’s song Bate bate chocolate ? Is it a Spanish word? Does it mean “beat”? Is it of (English or other) Germanic origin?
language hat says

December 19, 2010 at 8:40 am

Spainish batir is from Latin battuere, which the OED says is “perhaps of Celtic origin.”
Zackary Sholem Berger says

December 20, 2010 at 11:26 am

Looks like “blue” was more popular once than “green.”
I can’t imagine why, though.
Greg says

December 21, 2010 at 12:02 am

I found it interesting to compare different states. See California vs. Michigan. The graph may mirror the rise and fall of the respective state economies.
Kerry NZ says

December 21, 2010 at 3:13 am

Comparing different german economists/sociologists in German books was interesting – go the Simmel!
Kerry NZ says

December 21, 2010 at 3:16 am

Tonnies does much better in English or Spanish language books than German or French language compared to Sombart and Schmoller. Simmel rules in all langauges.
Thanks for the pointer – this tool is so cool
bruessel says

December 21, 2010 at 4:31 am

I’m not quite sure how this works, but wouldn’t Simmel also include Johannes Mario Simmel, the writer of popular fiction?
marie-lucie says

December 22, 2010 at 12:14 pm

The Channel Islands
I first heard about “les îles anglo-normandes” as a child, when my family spent a month or so in one of the beach towns near Granville on “La Manche”, from which one could take ferries to the islands. Just now (staying with my family) I asked my sisters, who recently went together to “Guernesey”, what the group is called, and they replied in chorus “les îles anglo-normandes”. Perhaps some people use “les îles de la Manche”, but I am not familiar with this use, a direct translation from English, nor with “l’archipel de la Manche”, which would suggest to me a larger number of islands.
Ø says

January 1, 2011 at 10:01 pm

How did we ever get by without Ngram Viewer? Today I used the place-name “Djakarta” in a game, and when my son doubted the spelling with “D” I was able, not only to show him that it’s a real spelling, but to muster hard evidence, in graphic form, that for about thirty years in the 20th century it was more common than “Jakarta”.
iching says

January 4, 2011 at 8:29 am

I have been trying to think of a reason for the decline in the sums of the frequencies of the “won’t” and “will not” tokens in the Ngram viewer English corpus. As Rick S said on a January 4 Language Log post here , they appear to have declined from the early 1800s to 2000 by 40% (.020% to .012%).
Can anyone think of a plausible explanation? Or if it’s not real is it some artefact of Google Books or the n-gram methodology?
Including “shall not” and “shan’t” doesn’t make much of a difference since “shall not” has declined from .008% to less than .002%, and “shan’t” has always been at least an order of magnitude rarer than that. I even discovered that “willn’t” was also a form in the 1800s, but too rare to have a bearing on the issue.
One explanation could be that there are other words or constructions that have come into use to partially replace “won’t” and “will not”. I tried “isn’t going to”, “aren’t going to” and “ain’t gonna” but these cannot be the solution either, since they are also relatively uncommon compared to “won’t” and “will not”. The trend is rather interesting though, all fairly similar, with a rise through the 20th century culminating in a peak around 1942, followed by a decline ending around 1962, then a rise to an all time record in 2000 (.00006%, .00004% and .00003% respectively).
One thing about the Ngram viewer is clear though. The linear scale on the vertical axis is a dog to work with when plotting more than one token. If the scale was logarithmic then parallel lines would really mean a similar trend, since the ratios of the frequencies would be constant. On a linear scale the fact that parallel lines/curves have equal percentage point differences is pretty meaningless and useless.
Ø says

January 4, 2011 at 9:16 am

“will” seems to have declined just as much as “will not” and “won’t”
iching says

January 4, 2011 at 9:34 am

“will” seems to have declined just as much as “will not” and “won’t”
True, but this is complicated by the various noun definitions of “will”. Perhaps we write less about philosophy and “free will” or about legal documents (“last will and testament”) than in the past. “Won’t” and “will not” seem comparatively straightforward to analyse. Parts of speech tags like COCA has would be useful, but they are not available for Ngram.
Trond Engen says

January 4, 2011 at 10:39 am

I was going to suggest the rise of going to, but apparently not.
Ø says

January 4, 2011 at 10:45 am

“I will”, “you will”, and “he will” have declined a lot.
iching says

January 5, 2011 at 4:36 am

I am flummoxed and stymied by the “will not”+”won’t” conundrum (the decline in the frequencies from 1800 to the present day as shown by the Google Books Ngram Viewer).
I just checked the COHA corpus and the same pattern is evident there too. And as Ø points out, “I will” has also declined. This is also confirmed by COHA. And “I’ll” and “I shall” also show big declines. The frequency of “I will”+”I’ll”+”I shall” halved from 1810 to 1910 and halved again to 2000. The frequencies of “I am going to”, “I’m gonna” and “I intend to” are 1-2 orders of magnitude smaller, so offer no help to solving the puzzle.
If no-one can provide a plausible explanation for this weird phenomenon, I am going to stop doing fun searches on COCA/COHA/Ngram, for fear there is some kind of technical bug involved and I am just wasting my time.
AJP Crown (Mrs) says

January 5, 2011 at 5:03 am

Please don’t stop doing fun searches. I’m sure someone can think of a reason. Is there a deadline?
iching says

January 5, 2011 at 5:27 am

Thanks for the encouragement, Crown. No, no deadline. Actually, I now realise I somewhat enjoy the feeling of bemusement — cryptic crosswords, trying to learn a new language, obscure poetry, philosophy texts that are way over my head…
Trond Engen says

January 5, 2011 at 6:27 am

It’s not fun searches you should stop doing because of a bug like that, it’s serious searches.
AJP Crown says

January 5, 2011 at 9:50 am

It’s no fun if you know they’re worthless. I’m not sure they’re reliable enough to count seriously as evidence of anything. Most of the time there are too many variables.
languagehat says

March 23, 2023 at 11:10 am

I just checked to see if the links in the post worked and to my surprise they didn’t — the SciAm one was fine, but the Ngram Viewer is no longer at http://ngrams.googlelabs.com/, it’s at https://books.google.com/ngrams/. Which is fine, but you’d think Google of all places would have provided a redirect. As it was, I had to change the URLs for both that first link and the “linguistics versus philology” one. Bah!
Brett says

March 23, 2023 at 3:17 pm

John Cowan on the reclassification of the Ngram viewer as part of Google Books

NGRAM.

Comments

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments