How Not to Use Ngrams.

A good piece by Ted Underwood from his blog The Stone and the Shell (“Using large digital libraries to advance literary history”), How not to do things with words:

In recent weeks, journals published two papers purporting to draw broad cultural inferences from Google’s ngram corpus. […]

I’m writing this post because systems of academic review and communication are failing us in cases like this, and we need to step up our game. Tools like Google’s ngram viewer have created new opportunities, but also new methodological pitfalls. Humanists are aware of those pitfalls, but I think we need to work a bit harder to get the word out to journalists, and to disciplines like psychology.

The basic methodological problem in both articles is that researchers have used present-day patterns of association to define a wordlist that they then take as an index of the fortunes of some concept (morality, individualism, etc) over historical time. […]

The fallacy involved here has little to do with hot-button issues of quantification. A basic premise of historicism is that human experience gets divided up in different ways in different eras. […]

The authors of both articles are dimly aware of this problem, but they imagine that it’s something they can dismiss if they’re just conscientious and careful to choose a good list of words. I don’t blame them; they’re not coming from historical disciplines. But one of the things you learn by working in a historical discipline is that our perspective is often limited by history in ways we are unable to anticipate. So if you want to understand what morality meant in 1900, you have to work to reconstruct that concept; it is not going to be intuitively accessible to you, and it cannot be crowdsourced.

There’s much more at the link, and attention must be paid.


  1. Indeed attention must be paid. I have found that Ngram viewer results change over time, presumably because more and more books are added to the database. For example a comparison of ‘Plato’ vs. ‘Aristotle’ that I did some ten years ago and blogged about showed wonderfully how ‘Plato’ ascended as ‘Aristotle’ declined, and (I thought) supported the frequent notion that our age is comparatively Platonic. However, the same comparison done Dec 6, 2014 showed the two names jostling each other since 1800 with ‘Aristotle’ thereafter beating out ‘Plato’. I’m afraid to try it again – guess I’ve learned my lesson.

  2. I’m pleased to see that The Stone and the Shell is still going strong in 2023; the latest post is on large language models in education:

    Our vision of those challenges is confined right now by a discourse that treats models as paper-writing machines. But that’s hardly the limit of their capacity. For instance, models can read. So a lawyer in 2033 may be asked to “use a model to do a quick scan of new case law in these thirty jurisdictions and report back tomorrow on the implications for our project.” But then, come to think of it, a report is bulky and static. So you know what, “don’t write a report. What I really need is a model that’s prepared to provide an overview and then answer questions on this topic as they emerge in meetings over the next week.”

    A decade from now, in short, we will probably be using AI not just to gather material and analyze it, but to communicate interactively with customers and colleagues. All the forms of critical thinking we currently teach will still have value in that world. It will still be necessary to ask questions about social context, about hidden assumptions, and about the uncertainty surrounding any estimate. But our students won’t be prepared to address those questions unless they also know enough about machine learning to reason about a model’s assumptions and uncertainty. At higher levels of responsibility, this will require more than being a clever prompter and savvy consumer. White-collar professionals are likely to be fine-tuning their own models; they will need to choose a base model, assess training strategies, and decide whether their models are over-confident or over-cautious.

    The core problem here is that we can’t fully outsource thinking itself.

    Another chance to celebrate a blogger who hasn’t given up!

  3. David Eddyshaw says

    Probably replaced by a Large Language Model …

Speak Your Mind