Can LLMs Transcribe Historic Documents?

June 5, 2024 by languagehat 26 Comments

A post by Ben Brumfield says:

Recently, both OpenAI and Google released new multi-modal large language models, where were immediately touted for their ability to transcribe documents. Also last week, I transcribed this document from the Library of Virginia’s collection from the Virginia Revolutionary Conventions. […] How do historians discussing this issue find these documents? Traditionally, they had to be transcribed by humans, which is how I stumbled on this document. Traditional HTR tools like Transkribus’s English Eagle transformer model–and by “traditional”, I mean transformer technology from 18 months ago–produce output like this. It’s not great – the strikethroughs cause some recognition problems, and the insertions really scramble the reading order of the text. […] With the release of ChatGPT4o, we can attempt HTR via a large language model instead of a transformer. I uploaded the document and asked ChatGPT to transcribe it, using TEI to represent the strike-throughs and insertions. […] The Transkribus output is obviously raw, and in need of correction. It looks tentative when you read it in isolation. The ChatGPT output looks much more plausible, and–in my opinion–that plausibility is treacherous.

Interesting stuff, and the images at the link explain his point. Thanks, Leslie!

Comments

David Eddyshaw says

June 5, 2024 at 3:09 pm

Automated Plagiarism Engines bullshit. At least, it would be bullshit if they actually possessed any “intelligence” at all. Human bullshit is produced by human beings with no concern for truth or falsehood; APEs have no actual contact with the world in which things can be true or false at all. Human beings are cruel if they are indifferent to human suffering; bombs are “cruel” only as a metaphor.

The more “advanced” APEs are, the more plausible the bullshit appears to a human being. But the nature of the product itself does not change.
languagehat says

June 5, 2024 at 3:20 pm

So you’re saying large language models should not be used to transcribe documents? Is it a moral crime, or just a philosophical one?
David Marjanović says

June 5, 2024 at 3:26 pm

I, for one, am saying they shouldn’t be used for that – at least as long as they’re unable to prefer “I don’t know” over a plausible-looking 60% match that is not marked as such. In other words,

that plausibility is treacherous.
David Eddyshaw says

June 5, 2024 at 3:29 pm

Exactly.

Neither a machine nor a human being that cannot say (and mean) “I don’t know” can or should be trusted – at all.
Yuval says

June 5, 2024 at 3:35 pm

Here to add the grain of salt for the piece’s accuracy or understanding of the topic matter – large language models are transformers.
J.W. Brewer says

June 5, 2024 at 3:53 pm

Just eyeballing the two different outputs, it looks like the one on the left was trying to make best-statistical-guesses on a letter by letter basis, leading to lots of strings that are not actual English words but often transparent misspellings of the same, whereas the one on the right is trying to make best-statistical-guesses of correctly-spelled English words, although not necessarily ones that would be syntactically or semantically cromulent in context. So the guessing/pattern-matching is occurring at different levels of generality. One can imagine pluses and minuses of both approaches, but they will predictably make different sorts of errors. The failure to indicate “here’s something faint that I really can’t make out” so you know you need to go back and fill in a blank via a different approach might seem like an applied instance of the failure to say “I don’t know,” but it’s distinct enough it seems like it ought to be separately fixable.
languagehat says

June 5, 2024 at 4:06 pm

I don’t think anyone would advocate for humans chilling with a drink and letting LLMs do all the work for them unsupervised, but surely it’s worth trying to improve their output. Humans take a long time to do things, and yes, they do the things better, but when you’ve got mountains of stuff to go through (like the debris from the 2009 Cologne archives disaster, which I understand they’re still working on) mechanical aids can be a great help.
Stu Clayton says

June 5, 2024 at 4:32 pm

large language models are transformers.

The 2017 Google research paper that, say the adepts, put LLMs into overdrive: Attention is all you need. This business is beginning to make a little more sense to me:

#
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
#

It seems that what they call an “attention mechanism” works by feeding on itself (output -> input -> output …). Or rather parts of itself feeding on parts of itself (“multi-head attention”), so all parts chow down in parallel. In the old days we called something like that feedback loops, or not ? But maybe this is not what is meant. More research is in progress.

In the paper the expression “English constituency parsing” appears. I found a nice, short explanation: Constituency vs dependency parsing. I see hackles rising in the audience, but hey, this stuff is not poisoned merely because Chomsky licked it. It’s just a thing you can use to plagiarize and transform.

#
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
#

“Stacked self-attention” ! I love it. Edward Casaubon was stacked like a brick shithouse.
Stu Clayton says

June 5, 2024 at 4:50 pm

but when you’ve got mountains of stuff to go through (like the debris from the 2009 Cologne archives disaster, which I understand they’re still working on) mechanical aids can be a great help.

So wow, very yes.
Stu Clayton says

June 5, 2024 at 4:55 pm

Microsoft’s new Windows 11 Recall [uses AI] is a privacy nightmare

Linux users can go back to sleep. For now.

On first reading the headline, I thought that Windows 11 was being recalled to fix the airbags, or retrofit the blue screens with unbreakable glass.
David Marjanović says

June 5, 2024 at 5:20 pm

in b4 Total Recall
David Eddyshaw says

June 5, 2024 at 5:44 pm

Linux users can go back to sleep

Linux is, of course, not particularly secure intrinsically. Our protection as desktop users is just that there are too few of us to bother cracking, and anyway, we’re all Commies with no money.

I recall a nice study that tried to run some classic Windows viruses etc under WINE. It was disappointing. WINE needs a lot more work before it can provide the full Windows experience.
David Eddyshaw says

June 5, 2024 at 5:54 pm

(We do know enough not to believe that Microsoft has only our best interests at heart, though.)
Brett says

June 5, 2024 at 5:58 pm

“We Can Bullshit It for You Wholesale”
Y says

June 5, 2024 at 6:11 pm

Automatic transcription of handwritten documents can be used in two ways, exactly like OCR of printed documents

First, it can be used to get quick and dirty transcriptions, full of errors, but searchable. When looking for, say, a name or a rare word in GBooks or a newspaper archive, you might miss some occurrences because of OCR errors, but without the OCR you wouldn’t find any at all, unless you had a thousand people to hand-transcribe them over the next hundred years.

Secondly, machine transcription can ideally be a first step before manual transcription. If the OCR errors are not too numerous, it might take less time to correct them than to type everything from scratch (or dictate and correct speech recognition errors).

The two systems compared have issues. Transkribus supposedly has worked well for some purposes, even hard tasks like transcribing large collections written in Nastaliq. The default model tested here is not too great for that particular sample. I checked it out with the example image (you can go to transkribus.ai and upload images to test them.) It does a poor job in segmenting the lines, often splitting a line in two and deciding that its end comes first. The character recognition isn’t all that, either. I don’t know if further training the default model would help, or if Transkribus itself is to blame.

The ChatGPT results suggest a more accurate handwriting recognizer, integrated with a language model — that is, basically, autocorrect. It may be just fine for people taking notes on paper in business meetings. If you are hand-correcting a machine transcription, though, the forced correct spelling might be a disadvantage. An incorrect but properly misspelled word catches the eye less than a misspelled one, and might be more easily missed.
languagehat says

June 5, 2024 at 7:32 pm

First, it can be used to get quick and dirty transcriptions, full of errors, but searchable. […] Secondly, machine transcription can ideally be a first step before manual transcription.

Yes, that’s my take on it. Sure, it’s crappy, but it has its uses.
Ook says

June 6, 2024 at 1:15 am

@hat
“I don’t think anyone would advocate for humans chilling with a drink and letting LLMs do all the work for them unsupervised”
Au contraire, I would love to do that, as long as I could take full credit for the good results and blame someone else (“must’ve been the intern!”) for any errors. And I can well imagine that this kind of setup is already quite common.
~flow says

June 6, 2024 at 4:52 am

@David Marjanović

> they shouldn’t be used […] as long as they’re unable to prefer “I don’t know” over a plausible-looking 60% match that is not marked as such

I’d wish for an OCR or LLM model that is able to mark up outputs in varying levels of granularity for their reliability—something that, if it can be done, will of course in itself have error bars. But at least highlighting the parts of a document that were more of a wild guess should be helpful.
maidhc says

June 6, 2024 at 5:06 am

There are a huge number of cuneiform inscriptions and only a handful of people who know how to read them. But a lot of them are pretty boring, like who paid their taxes with how many cows. But if you aggregate the boring stuff it could actually give interesting information about the economic system. Plus it would be good to find whatever really interesting stuff may be lurking in there.

I’m only using cuneiform as an example, but there are a number of other cases where there is a lot of source material and not enough people to process it thoroughly. Even 19th century handwritten documents are no longer readable for an increasing number of people.

This could well be an area where LLMs could handle a lot of the preliminary processing.
languagehat says

June 6, 2024 at 9:24 am

Exactly. Let’s not throw out Ea-nāṣir’s copper ingots along with the bathwater!
David Marjanović says

June 6, 2024 at 11:49 am

Fair enough.

(There could well be several unknown languages in the British Museum alone. Better than the Holy Grail, really.)
Y says

June 6, 2024 at 12:06 pm

The older US censuses were all digitized by hand, I think. Very tedious, but so useful that people were found to do it.
I try to envision those who typeset by hand Migne’s Patrologia, presumably working from his handwriting, but when I do, my stomach gets an uncomfortable feeling.
ə de vivre says

June 6, 2024 at 1:35 pm

Getting AI to read cuneiform for a clay tablet is a tall order given the physical irregularities of the tablets and the variation in sign forms. I don’t know of anyone seriously undertaking such a project right now, but there are attempts to use AI to linguistically annotate already transliterated texts. One issue for annotating Sumerian, though, is that there isn’t yet consensus about some pretty basic aspects of Sumerian grammar, so even human-produced linguistic annotation has to either be strategically vague or take positions that appear solid when they show up in machine-readable data, but are really just a guess that isn’t obviously false.

There’s an interesting corpus study that uses documents from a single late 3rd millennium archive called the Sumerian Network Project, which looks at 15,000ish transliterated administrative texts from a warehouse and distribution centre. They use unsupervised classification techniques to sort the corpus into text groups. This is kind of neat because it can help form hypotheses about genres and relatinships in administrative texts, but I’m not sure that the results are that meaningful in and of themselves. Basically, they’re saying that you *can* group the texts into N groups, but those groups don’t necessarily have any meaning either for the texts’ creators or for modern researchers.

The more interesting result to me is that they map out relationships between named entities (e.g., people and the named buildings they work in). This data about networks of interaction could probably be used to find out some interesting things about how the Ur III state operated.
ə de vivre says

June 6, 2024 at 1:47 pm

there are a number of other cases where there is a lot of source material and not enough people to process it thoroughly.

This is pretty much every archival instutition in the world. I work at an archive where we’d love to have an AI tool to process still and moving image documents and give us keywords about locations and people in them. Otherwise, they’ll just sit undescribed until the day that Canada decides to massively fund its cultural heritage (i.e., never) or vinegar syndrome overtakes them. Unfortunately, the technology isn’t there yet for fully automatic AI tools. I notice that the writers of the original article work for a company that does crowd-sourced transcription, which is probably the best tool that archival institutions have right now, but not everyone is able to mobilize an interested public (or pay people to be interested).
Y says

June 6, 2024 at 4:19 pm

People have been working on automatic transcription of cuneiform from photographs. See e.g. the latter sections of this paper and later papers which cite it.
Allan from Iowa says

June 9, 2024 at 2:22 pm

At a previous job we used commercially available OCR software that gave a confidence level for each character, ranging from 1 to 10 if I recall correctly. This was with printed texts in a limited range of fonts. We aggregated the scores to word and sentence level and used a lot of domain-specific heuristics to decide which bits to send to human review.

This was about 15 years ago and I don’t know if things have improved since then.

Can LLMs Transcribe Historic Documents?

Comments

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments