How Not to Use Ngrams.

A good piece by Ted Underwood from his blog The Stone and the Shell (“Using large digital libraries to advance literary history”), How not to do things with words:

In recent weeks, journals published two papers purporting to draw broad cultural inferences from Google’s ngram corpus. […]

I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.

The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. […]

The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. […]

The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.

There’s much more at the link, and attention must be paid.


  1. Ken Miner says:

    Indeed attention must be paid. I have found that Ngram viewer results change over time, presumably because more and more books are added to the database. For example a comparison of ‘Plato’ vs. ‘Aristotle’ that I did some ten years ago and blogged about showed wonderfully how ‘Plato’ ascended as ‘Aristotle’ declined, and (I thought) supported the frequent notion that our age is comparatively Platonic. However, the same comparison done Dec 6, 2014 showed the two names jostling each other since 1800 with ‘Aristotle’ thereafter beating out ‘Plato’. I’m afraid to try it again – guess I’ve learned my lesson.

