A new Digital Scholarship in the Humanities article by Eetu Mäkelä, James Misson, Devani Singh, and Mikko Tolone (open access) examines Early English Books Online (EEBO):
Abstract
Digital archives that cover extended historical periods can create a misleading impression of comprehensiveness while in truth providing access to only a part of what survives. While completeness may be a tall order, researchers at least require that digital archives be representative, that is, have the same distribution of items as whatever they are used as proxies for. If even this representativeness does not hold, any conclusions we draw from the archives may be biased. In this article, we analyse in depth an interlinked set of archives which are widely used but which have also had their comprehensiveness questioned: the images of Early English Books Online (EEBO), and the texts of its hand-transcribed subset, EEBO-TCP. Together, they represent the most comprehensive digital archives of printed early modern British documents. Applying statistical analysis, we compare the contents of these archives to the English Short Title Catalogue (ESTC), a comprehensive record of surviving books and pamphlets in major libraries. Specifically, we demonstrate the relative coverage of EEBO and EEBO-TCP along six key dimensions—publication types (i.e. books/pamphlets), temporal coverage, geographic location, language, topics, and authors—and discuss the implications of the imbalances identified using research examples from historical linguistics and book history. Our study finds EEBO to be surprisingly comprehensive in its coverage and finds EEBO-TCP—while not comprehensive—to be still broadly representative of what it models. However, both of these findings come with important caveats, which highlight the care with which researchers should approach all digital archives.
1. Introduction
The purpose of this article is 2-fold. First, we aim to show, with major datasets often used for digital scholarship, that the collection history and composition of datasets matter, and cannot be ignored when doing research without jeopardizing the validity of results. Second, by demonstrating this principle in a descriptive manner across various dimensions of interest (including temporal, geographical, and linguistic coverage), we also wish to offer a solution: a series of practical guides for users of these datasets, with which they can make informed decisions about which imbalances they need to account for, and how. While this paper’s analyses of composition and its consequences will benefit users of the datasets of Early English Books Online (EEBO n.d.) and EEBO-TCP (n.d.) specifically, our guides offer a template which is readily usable for other collections, as evidenced by our sister publication on Eighteenth Century Collections Online (Tolonen, Mäkelä, and Lahti 2022).
It looks like a valuable read for anyone who uses those archives. Thanks, Leslie!
I was expecting something like the Black Book of Carmarthen, or at least, the Black Rabbit of Inlé, here. Shameless bait-and-switch headline!
TCP is “Text Creation Partnership”, bee tee dub.
Shameless bait-and-switch headline!
Yes, the hinted-at revelations are pretty weak sauce. Release the kraken!
Perhaps it would be useful to maintain a distinction between “archive” or “library” and “corpus,” with the latter striving toward being representative.
“Striving” is important, but because a true representative sample of such material is rarely achievable. EEBO, for instance, only covers print sources. That leaves oral and manuscript evidence out, as well as genres of print material that have not been preserved. (Other archives, such as Eighteenth Century Collections Online do have some manuscript coverage, but it is spotty.)
The best such collections can probably achieve is to be open and upfront about the types of material they contain and the known gaps in its coverage.
You’re right, that’s a useful distinction.