Jigsaw Segmentation and the Vatican Archives.

When I saw the title of Sam Kean’s Atlantic article “Artificial Intelligence Is Cracking Open the Vatican’s Secret Archives” I groaned inwardly, assuming it was the usual excessive hype for what was probably a banal story. But no, it’s really something (though I still jib at the term “artificial intelligence”). It starts out with the fact that the Vatican Secret Archives is “one of the grandest historical collections in the world,” with 53 linear miles of shelving, but also “one of the most useless”:

Of those 53 miles, just a few millimeters’ worth of pages have been scanned and made available online. Even fewer pages have been transcribed into computer text and made searchable. If you want to peruse anything else, you have to apply for special access, schlep all the way to Rome, and go through every page by hand.

But a new project could change all that. Known as In Codice Ratio, it uses a combination of artificial intelligence and optical-character-recognition (OCR) software to scour these neglected texts and make their transcripts available for the very first time. If successful, the technology could also open up untold numbers of other documents at historical archives around the world.

Kean describes the difficulty of using OCR on handwritten text, then says:

In Codice Ratio sidesteps these problems through a new approach to handwritten OCR. The four main scientists behind the project—Paolo Merialdo, Donatella Firmani, and Elena Nieddu at Roma Tre University, and Marco Maiorino at the VSA—skirt Sayre’s paradox with an innovation called jigsaw segmentation. This process, as the team recently outlined in a paper, breaks words down not into letters but something closer to individual pen strokes. The OCR does this by dividing each word into a series of vertical and horizontal bands and looking for local minimums—the thinner portions, where there’s less ink (or really, fewer pixels). The software then carves the letters at these joints. The end result is a series of jigsaw pieces

The details are fascinating (the process involved help from high schoolers), but I’ll let you discover them at the link; the possibilities are exhilarating:

Like all artificial intelligence, the software will improve over time, as it digests more text. Even more exciting, the general strategy of In Codice Ratio—jigsaw segmentation, plus crowdsourced training of the software—could easily be adapted to read texts in other languages. This could potentially do for handwritten documents what Google Books did for printed matter: open up letters, journals, diaries, and other papers to researchers around the world, making it far easier to both read these documents and search for relevant material.

Thanks, jack!

Comments

Stu Clayton says

May 9, 2018 at 11:06 pm

This could potentially do for handwritten documents what Google Books did for printed matter

I hope not. 99% of google-scanned books that I have encountered when searching for something are character salads tossed in an OCR shredder.
DF says

May 10, 2018 at 12:29 am

That is not the case at all and you should revisit your assumptions. Google’s OCR is amazing, even for older books with long S’s and other typographical oddities. That’s not to say they don’t have other problems with copyright and actually showing you the books you want to find (and not hiding them with some predictive search algorithm). But lack of quality OCR is not one of them.
geekosaur says

May 10, 2018 at 12:30 am

The jigsaw business is more or less intended to prevent shredded character salad, though.
Stu Clayton says

May 10, 2018 at 1:05 am

I have no cause to revisit my assumptions, because I’m not making assumptions. I report my experience. That’s why I wrote “books that I have encountered”. To deny me my experience would involve not different assumptions, but impertinence.

Others may have other experiences. I did not generalize to all of Google Books.

I can imagine a simple explanation for this – different people look for different words from different semantic fields, and end up with different books that have been scanned. For example, I often have searched for German words, and far too often got mis-hits in scan-garbled Frakturschrift. The kind of things I search for tend to turn up in 18C and 19C books.

As a result, when a hit link begins with “books.google.de” and I see a 18C date in the summary, I now usually don’t bother to follow it.
Stu Clayton says

May 10, 2018 at 1:28 am

By contrast, here’s a claim from the article that is based on revisitable assumptions:

# Like all artificial intelligence, the software will improve over time #

I like a good laugh early in the morning.
DF says

May 10, 2018 at 1:33 am

Exactly, it’s missing the point that Google Books only got so good because they crowdsourced their character recognition to tens of millions of people through reCaptchas, not some magical self-improving AI.
Stu Clayton says

May 10, 2018 at 1:39 am

Hey, that’s pretty cute, I didn’t know that ! I had imagined that those weird letters were intended only to stymie sneaky software character recognition. It had not occurred to me that you could kill two birds with this one stone.
maidhc says

May 10, 2018 at 1:57 am

If they just scanned the text and made it available, it would be a help. Even if the OCR wasn’t perfect, if it could be something an area specialist could start with and correct.

It could possibly improve because I imagine the same scribes were writing for years. You might be even able to recognize a particular person’s handwriting and train the model on that specific person.

However there are a lot of potential problems too. Old parchment books tend to have weird marks on them, or even holes in the parchment, the ink eats through the page to the other side, and things like that.

But talking about crowdsourcing may be a little optimistic, because the size of the crowd that can read medieval Latin handwriting has got to be pretty small. I guess you could train SCA enthusiasts, re-enactors and people like that, to increase the size of the pool.

Complaining about Google not doing a good job on Frakturschrift is misleading because the OCR is not trained on Frakturschrift.
juha says

May 10, 2018 at 2:10 am

they crowdsourced their character recognition to tens of millions of people through reCaptchas

And now they have switched to image recognition:
select the buses/cars/show windows, etc
Athel Cornish-Bowden says

May 10, 2018 at 2:44 am

What about just a few millimeters’ worth of pages have been scanned and made available online? That seems not only to be a small amount but an incredibly small amount. The pages I have scanned myself amount at least to a few centimetres, and maybe to a few decimetres.

As the author uses “miles” and “millimeters” in the same sentence I wonder if they realize how small a millimetre is.
Stu Clayton says

May 10, 2018 at 3:09 am

select the buses/cars/show windows

The images can reasonably be assumed to be ones that Google software is not (yet) able to recognize. The competition can analyze them to estimate how far ahead or behind their software is as compared to Google’s, and concentrate on improving their software on precisely those images.

This is a standard procedure in all areas of human endeavor. In “markets”, for instance, players don’t just observe prices and respond. They also observe the responses of other players observing prices.
Stu Clayton says

May 10, 2018 at 3:31 am

And they observe themselves being observed.
languagehat says

May 10, 2018 at 8:53 am

But talking about crowdsourcing may be a little optimistic, because the size of the crowd that can read medieval Latin handwriting has got to be pretty small.

Did you miss the part about the high school students? Anyone can match symbols.
John Cowan says

May 10, 2018 at 9:14 am

I agree that millimeters seems weird, but these are priceless manuscripts and would have to be scanned with extreme caution.
languagehat says

May 10, 2018 at 9:22 am

I wondered about the millimeters too; if it were a language issue I probably would have bestirred myself to write the author and ask, but I didn’t care enough. If anyone does write him, let us know what you hear back!
Stephen Goranson says

May 10, 2018 at 10:57 am

A somewhat related effort (unless I misunderstand it) is underway with Qumran manuscripts, available here:
https://lirias.kuleuven.be/bitstream/123456789/576008/1/ICPRAM_2017_128.pdf
M.A. Dahli et al., “A Digital Palaeographic Approach towards Writer Identification in the Dead Sea Scrolls.”
J.W. Brewer says

May 10, 2018 at 11:34 am

Once the technology gets good enough to decode handwritten medieval MSS, maybe it will then be ready to take a stab at books printed in Fraktur with better results than hitherto obtained?
Michael Eochaidh says

May 10, 2018 at 12:10 pm

The project to crowd-source transcription of (at least some of) the Oxyrhynchus Collection at Oxford has come up here before.

http://languagehat.com/varia-3/
languagehat says

May 10, 2018 at 1:31 pm

Well remembered! And from there, I note: “Knowledge of Greek not a prerequisite.”
maidhc says

May 10, 2018 at 10:54 pm

Looking at medieval handwriting, it’s a challenge to figure out where the letters begin and end.
Peter Erwin says

May 11, 2018 at 11:18 am

Re the “millimeters” question: I found the same basic claim in one of the papers by the actual (Italian) researchers: “Indeed, the World Wide Web only contains a small part of the traditional archives. (It is evocative to think that it may only contain a few millimeters out of the 85km of linear shelves in the Vatican Secret Archives.)”. So I don’t think it’s a translation or confusion-about-the-metric-system issue.

http://ceur-ws.org/Vol-2037/paper_11.pdf
Stu Clayton says

May 11, 2018 at 11:38 am

“Evocative to think that it may” – whatever that was intended to mean, I understand it as meaning “I have no idea”. A less florid style ought to be more informative.
Trond Engen says

May 11, 2018 at 7:15 pm

Maybe, just maybe, something has been scanned without reaching the World Wide Web.
Rick says

May 12, 2018 at 11:08 am

Regarding “And now they have switched to image recognition:
select the buses/cars/show windows, etc”

This is to help train the machine learning algorithms that control self-driving cars. They are crowdsourcing the ability to drive a car. (They also utilize online video gaming for this, as you would expect.)