AI Model for Ancient Papyri.

June 3, 2026 by languagehat 33 Comments

As anyone who has been following LH for any length of time will be aware, I am no fan of “AI,” but this seems like a situation in which large language models could be of great use; the Austrian Academy of Sciences reports:

The Austrian Academy of Sciences (OeAW) is collaborating with Mistral AI and Sail Reply, a Reply Group Company, on the development of a Large Language Model (LLM) for Ancient Greek: Apollo, named after the Greek god of light and patron of the arts and sciences, will propel research on ancient Greek texts. The model supports advanced searching and automatic text restoration in hundreds of thousands of undeciphered papyri and inscriptions, making it possible to accurately capture content in a matter of hours rather than years. The OeAW and its partners are doing pioneering work, as LLMs have not yet been developed for a historical language evolving over many centuries or the reconstruction of heavily damaged ancient texts.

On behalf of the OeAW, the project is led by Anna Dolganov, an ancient historian and papyrologist at the Austrian Archaeological Institute of the OeAW, who provides field–specific guidance, oversees the integration of relevant sources, and guarantees scientific quality. Through her expertise, Dolganov ensures that historical contextualization and methodological standards are upheld. […]

Anna Dolganov: “Our project with Mistral AI and Sail Reply is building the world’s first advanced multimodal Large Language Model for an ancient language, trained on the largest digital corpus of historical Greek to date. This AI system can be developed in many directions for a wide range of research tasks, from reconstructing fragmentary inscriptions and papyri to conducting semantic and thematic searches across the entire Greek textual tradition to deciphering handwritten texts. For example: there are one million Greek papyri worldwide that have never been read, tens of thousands of which are held by the Papyrus Collection of the Austrian National Library. Such treasures of historical knowledge are our target. This LLM marks the beginning of an exciting journey in the study of antiquity.”

I didn’t realize there were so many unread papyri — if this works as advertised, it could be a boon. Thanks, Martin!

Comments

Stephen Goranson says

June 4, 2026 at 4:27 am

In editions of fragmentary papyri, it is important to distinguish what can be read and what is restored by conjecture. I hope the AI publications, labeled as such, will be clear about which is which. If so, it will be helpful.
AntC says

June 4, 2026 at 5:27 am

Anna Dolganov, an ancient historian and papyrologist …

She doesn’t look ancient to me, judging by her biog photo. B.A. in Classical Philology, Harvard University (magna cum laude, 2005)

How did these tens of thousands of papyri end up in Austria?

@SG even if the AI gives no better than a fair guess, that’ll at least make a searchable resource for humans to go and eyeball a specific text amongst the tens of thousands.
Hans says

June 4, 2026 at 6:43 am

How did these tens of thousands of papyri end up in Austria?
Probably excavated and brought home by Austrian (or maybe even Austro-Hungarian) archaeologists? Or bought frim local antique traders for Austrian musums? Both practices became frowned upon only at some point in the second half of the last century.
PlasticPaddy says

June 4, 2026 at 8:14 am

https://de.wikipedia.org/wiki/Papyrussammlung_und_Papyrusmuseum_Wien
—
Die Sammlung verdankt ihr Entstehen in erster Linie dem Professor für Geschichte des Orients an der Universität Wien, Josef Karabacek. Mit Hilfe des Teppich- und Kunsthändlers Theodor Graf konnte dieser in den Jahren 1881 und 1882 die rund 10.000 Papyri des 1. Fayyumer Funds nach Wien bringen…. Bis 1899 erweiterte Erzherzog Rainer die Sammlung laufend durch neue Ankäufe und machte sie im selben Jahr seinem Onkel Kaiser Franz Joseph I. zum Geschenk, der sie als Spezialsammlung in die k.k. Hofbibliothek (die heutige Österreichische Nationalbibliothek) eingliederte. Sie erhielt neue Räumlichkeiten am Josefsplatz. Zusätzliche Ankäufe von hauptsächlich Ostraka erfolgten 1899 und 1911.
—
To paraphrase, an Austrian professor of Oriental History at Vienna University brought 10000 Papyri to Vienna in 1881-82. New purchases were added to the collection until 1899 by Archduke Rainer [he had bought the collection in 1883], who then gave it to his uncle the Emperor, who had it housed as a special collection in the imperial court library (now called Austrian National Library). Additional purchases (mostly of ostraka) were made in 1899 and 1911.

The corresponding English Wiki article is shorter and says less about when and how the papyri were acquired.
cuchuflete says

June 4, 2026 at 8:14 am

How did these tens of thousands of papyri end up in Austria?
Probably excavated and brought home by Austrian (or maybe even Austro-Hungarian) archaeologists? Or bought frim local antique traders for Austrian musums? Both practices became frowned upon only at some point in the second half of the last century.

or… as Mr Kipling might have had it, “Take up the brownish man’s property.”
David Marjanović says

June 4, 2026 at 9:34 am

an Austrian professor of Oriental History at Vienna University brought 10000 Papyri to Vienna in 1881-82

“with the help of” a “carpet and art dealer”. Yay, free market.
Rodger C says

June 4, 2026 at 10:39 am

In editions of fragmentary papyri, it is important to distinguish what can be read and what is restored by conjecture. I hope the AI publications, labeled as such, will be clear about which is which.

Hopefully the AI will automatically distinguish them via the existing notation.
Stu Clayton says

June 4, 2026 at 10:54 am

Hopefully the AI will automatically distinguish them [what can be read and what is restored by conjecture] via the existing notation.

Hopefully the AI will automatically and honestly say whether it is doing that. In my limited experience, however, AI will fawn and attempt to ingratiate itself, or simply lie, when it encounters ambiguity.

The fawning and mendacity are already in the source materials, of course, so no intelligence or creativity is needed to bring them to light.
Jerry Friedman says

June 4, 2026 at 11:28 am

Maybe they’ll have trained it better than that.

Do LLMs distinguish between certainty and conjecture? That may have to come in when the human is in the loop, as AntC suggested.
Stephen Goranson says

June 4, 2026 at 1:19 pm

As already suggested, we await whether a specific AI edition will include indication or discussion of whether a letter is only partially visible, and tendencies if known of that scribe, and whether a proposed reconstruction relies on, say, a formulaic legal phrase known from the same time and location, and paleographic dating, and accounting for known provenance and/or provenience, and so on. An experiment.
Yuval says

June 5, 2026 at 10:41 am

Here’s hoping that you feel comfortable in doing away with the disclaimer in the future. Knee-jerk Luddism (or appeasement of such types) is just as harmful as knee-jerk Kool-Aid drinking, and it doesn’t take much to examine individual uses of LLMs or other such technologies on a case-by-case basis with a level head.
Stephen Goranson says

June 5, 2026 at 1:06 pm

If it were the case, Yuval, that you were addressing me, about which I may be mistaken, I would have a twofold response:
Show me an AI-assisted example
and
Why hope to do “away with with the disclaimer”?
rozele says

June 5, 2026 at 3:27 pm

Maybe they’ll have trained it better than that.

Do LLMs distinguish between certainty and conjecture?

they not only do not, they cannot.

“training” just means feeding in more human-coded material for pattern extrapolation, nothing more – and often less, when in practice it means feeding in LLM-coded material instead.

in theory, an LLM could label a mark as (say) “conjecturally a mem”*. but it will only apply that label to marks that resemble the ones that are labeled that way in its feedstock. so the possibilities abound for both false positives (a mark that’s clearly an unusually compressed ayin being labeled as a mem, say) and false negatives (a definite mem with an unusually long rightmost stroke being labeled as a nun with a squiggly tail, say). and novel marks (i don’t think i’ve ever read a yiddish letter without finding at least one) will be labeled by their similarity to already-encoded ones, which is laughable even just thinking about the langer mems i have encountered, let alone the langer feys or even lameds.

and that’s not even getting into, to take an example from a family letter i’m translating for a friend, the fact that for a particular writer the same two short vertical marks can be an alef, a tsvey-vovn, a vov-yud pair, a yud-vov pair, or a tsvey-yudn. for that writer, there’s also the added twist that she uses those marks in places that in almost any standardized system would call for a single vov, a melupm-vov, a single yud, a yud and a shtumer alef, a vov and a shtumer alef, or a pasekh-tsvey-yudn. which also highlights the limited utility of using longer sequences of marks to help: her “di” (article), “du” (pronoun), “tsi” (“whether”, “or”), and “tsu” (preposition) are often realized with the same marks (and by extrapolation, perhaps also “tsi” (verb stem)). it has been a fun** ride; i do not think it is an automatable one.

there’s a reason why the largest recent manuscript-deciphering project in the yiddish world (the first stage of the KMDMP), which is also probably the most important yiddish digital humanities project around, relied entirely on multilingual groups of human decipherers.

and all of this is with a living language written in a known and living script by comfortably literate writers, using high-quality scans of original sheets that are in quite good condition.

.
* i’m using this as my example because the first lesson of deciphering yiddish handwriting is “if it’s just a blob, it’s a mem” – and you quickly learn that there is a nearly infinite variety of blobs that are mems, that lots of yiddish manuscripts have blobs that aren’t mems (or even letters at all), and that lots of other kinds of enigmatic marks can be mems but can also be other letters.

** in part because it’s teaching me how thoroughly it’s possible for a person’s writing to conceal their spoken lect – the ambiguities i’ve mentioned (plus never using nekudes at all) mean that the key vowel differences aren’t legible, or aren’t readily identifiable if they are indicated in some way.
languagehat says

June 5, 2026 at 5:18 pm

If it were the case, Yuval, that you were addressing me

No, Yuval was addressing me (see the beginning of my post). I am unrepentant.
Brett says

June 5, 2026 at 6:58 pm

@rozele: Machine learning algorithms that are not language models can actually be designed to be much better at quantifying uncertainty about their responses to inputs than the LLMs. So I don’t think that identification of letter forms is necessarily going to be plagued by the kind of overconfidence we see in ChatGPT.
rozele says

June 5, 2026 at 7:56 pm

i think “overconfidence” unjustifiably attributes agency to something that would accurately be called “inaccuracy” (overconfidence is what the hucksters want us to have about the inaccurate results), but i do take your point about the difference in flavors of software!
Jerry Friedman says

June 5, 2026 at 11:01 pm

On the subjects of the LLM explaining its reasoning and stating uncertainties, does anyone know whether this is a place where the LLMs known as reasoning models can help.

@rozele: AIUI, LLM chatbots are sycophantic—sorry, they produce sycophantic texts—because their trainers and users give good ratings to such texts. I’m suggesting that if you start your model with clean silicon and give the trainers careful instructions, maybe you can avoid that.

I don’t think we even know whether the OeAW’s LLM will be able to chat. Maybe it will just produce ancient Greek texts, with or without paleographic symbols and other annotations, as Stephen Goranson and Rodger C may be expecting.

But I trust that any “editions” it produces will be accompanied by the digital images they’re based on and information about provenience (a word I just learned), so users can decide for themselves whether any letters are only partially readable, etc.
Nat Shockley says

June 5, 2026 at 11:56 pm

But I trust that any “editions” it produces will be accompanied by the digital images they’re based on and information about provenience (a word I just learned), so users can decide for themselves whether any letters are only partially readable, etc.

Exactly. Anytime the machine output produces something interesting, the first step will be to look at the original document to verify that it does indeed say what the machine claims that it says.

The real risk from the inaccuracy of the machine is not that it will produce interesting content which is inaccurate, but rather that it will fail to correctly recognize interesting content, so the researchers will never find it. But they’re not finding it now, because they don’t have the time to look for it. So the new technology should definitely provide a massive net benefit.
languagehat says

June 6, 2026 at 6:33 am

But they’re not finding it now, because they don’t have the time to look for it. So the new technology should definitely provide a massive net benefit.

That’s my take. It’s not a matter of trusting the AI (which no one should ever do), it’s a matter of letting it do some gruntwork to free up human labor. And I certainly hope nobody involved expects the software to “chat.”
Jerry Friedman says

June 6, 2026 at 10:36 am

Why wouldn’t it be able to chat? I’d think users would like to be able to get answers to questions such as “Why did you conjecture this instead of that?” and “Was this formulaic legal phrase [as Stephen Goranson mentioned] used at this time and place?” as long as the answers actually reflected the software’s data and process.
rozele says

June 6, 2026 at 1:10 pm

get answers to questions such as “Why did you conjecture this instead of that?”

that is exactly the thing that LLMs (et al) cannot do. when those kinds of questions are the input, the output is – like all of their output – based on what is statistically likely to follow such an input in their feedstock – it doesn’t have a damn thing to do with “why”s. this software doesn’t “answer questions” for any sense of the phrase beyond “produces an output when you give it an input” (which is true of haruspicy, too – and any resonably competent haruspex can explain why when the input is a knife, the output will be intestines).

the problem with the second question should be obvious from the (remarkably consistent) results when u.s. government lawyers ask LLMs to provide them with precedents to include in briefs.

Maybe it will just produce ancient Greek texts, with or without paleographic symbols and other annotations

i am extremely skeptical about the human cross-checking of such results, even in the very narrow best case you’re describing. we know a lot about how deeply humans’ assessments of ambiguous images are shaped by their expectations for what they’ll see. and it’s hard to imagine a more compelling form of expectation than the combination of promised infallibility and elliptically-stated “if you want funding, you’d better show us you’re using the software our major donors have bet billions on” threats.

(i’m basing that last point partly on recent conversations with a generally very sharp and critical retired city lawyer who cannot seem to take in that LLMs are pattern-extrapolators, not search engines, despite reading the coverage of fake case citations, and a renowned philosopher of consciousness (the family friend who’s the reason i got to play with ELIZA in the early/mid-1980s) who has been using high-end LLMs to summarize papers rather than reading them.)
languagehat says

June 6, 2026 at 1:51 pm

God, that’s depressing.
Jerry Friedman says

June 6, 2026 at 3:57 pm

@rozele: that is exactly the thing that LLMs (et al) cannot do. when those kinds of questions are the input, the output is – like all of their output – based on what is statistically likely to follow such an input in their feedstock – it doesn’t have a damn thing to do with “why”s.

Please see the “reasoning models” I mentioned above. That’s what they’re doing, but they’re applying it step-by-step to complex problems, and they report the intermediate steps.

And it seems to me to be a reasonable way for software to work. Suppose there’s a mark on a papyrus, and it’s statistically likely in edited Greek texts that similar marks are interpreted as sigmas, omicrons, or somewhat defaced rhos. (I’m just guessing about what letters are likely to be confused. People can substitute others if I guessed wrong.) Suppose further that given the preceding and following letters, an omicron gives a word and neither a rho nor a sigma does. It will be statistically likely that the training data will contain that sequence of letters with the omicron, so the software will come up with that word.

This is a narrow domain, and the trainers and users will be specialists to some extent. Not, of course, infallible. I think there’s reason to hope for more useful results than the general-purpose chatbots give.

i am extremely skeptical about the human cross-checking of such results, even in the very narrow best case you’re describing.

Many of us seem to be imagining different things. What you call the “very narrow best case” is what I’d expect to be the typical case. What other cases do you expect?

I trust no one will claim the OeAW’s LLM will be infallible. I too find your stories of credulous people who should know better depressing. But soon enough, they’ll learn painful lessons. And other people will start from a very skeptical position.
Hans says

June 7, 2026 at 3:45 am

@rozele:
You and your lawyer friend (or the LLMs used) may be a bit behind the curve. The LLMs that we currently use at my work provide sources for their assertions, which are actually existing (not “hallucinations”) and checkable. Checking that this is the case is of course a step that you need to do, but these LLMs now are actually useful for finding real information.
Carlos says

June 7, 2026 at 5:48 am

Just as a small trial, here’s Claude on this handwritten Yiddish document, in response to this prompt: “Can you transcribe this manuscript and indicate points of uncertainty and degrees of uncertainty in the transcription?”. Apologies in advance if formatting/characters go wrong. I have no Yiddish and no idea if this performance is accurate and/or is the kind of thing that would be useful to scholars in the field. I’d be interested to hear from those who are able to assess it.

This is a Yiddish document written in Hebrew script (cursive Yiddish hand), dated 1899, page 495. The script is a typical Eastern European cursive Yiddish of the late 19th century — likely a legal record, communal register, or business document given the numerical entries at the bottom.

Header line (right to left, as is standard):
1899 · 25 · חשון [?] · נאוועמבער [?] · [place name?]
The numerals 1899 and 25 are in Western Arabic numerals written left-to-right. The Hebrew-script words read right to left and appear to give a Hebrew calendar month (חשון, Cheshvan) and possibly a transliterated month name. Moderate confidence on the month names — the cursive is compressed.

Body text — diplomatic transcription (right to left):

הס לו מרכבי ינפ נדהגלטאן גוין לפטניונה
טנאי ון צאלין אלגנק נחזיה בוד לחגין
הס חניך בסטאלאסני כוים בין ונים כין פוין, [?]נילא
דחל דין כין פסט קידלאסן פונכס בין לכדן טלמינסיר
מנסלס נחסאן פסאלחס כדוס ונים [?]מחי נכשניני דכילן
נחיר פניינ[?] גוין דחועולסי גוין נכפוס [אחרי?]
הס חני פסטאלאסאן גוינם לן מדאלין נחרנפסן מנסלן
מל פ דין מאני פאין בן 29 לאין פיכין מון דחל
דחין נחסין פיר בפאדחין בחין בין ניין,לחיר נחין 50 91 [?]פלאזיר
מלנס נחיסן פיר פאדחן פיד בינחפד הנחרפין 305 פנלאזיר
100 · בן 22 פן · [?]נחפדולן טנלאין
900 · פן 25 · פן · [?]דחיר פריל · [?]אימקדמה
10 · [?]קחהל פרכס

Uncertainty notes:

Overall caveat — high: Cursive Yiddish of this period is highly individual, with many ligatures, abbreviations, and regional spelling conventions. Without a native reader familiar with this specific hand, confident transcription of the body text is not possible. What I can offer is a best-effort diplomatic rendering.

Script: The hand uses standard Ashkenazic cursive letter forms, but several letters are ambiguous — particularly ב/כ, ד/ר, and ו/י in this hand. This is a fundamental source of uncertainty throughout.

Header: The date 1899, 25 with a Hebrew month is the most legible part. חשון (Cheshvan, the 8th Hebrew month, falling Oct–Nov) is a reasonable reading consistent with November. High confidence on the year and day number; moderate on the month.

Document type: The numerical entries at the bottom (100, 900, 10; and figures like 50, 91, 305 in the body) strongly suggest a financial or legal record — possibly a community ledger, court record, or property transaction. The repeated structural phrases suggest formulaic legal or administrative language.

Place name in header: The rightmost word(s) of the header line may be a place name or institutional name. Low confidence — too compressed to read reliably.

Body lines generally: Individual word readings are low to moderate confidence throughout. The text would benefit from examination by a specialist in late 19th-century Eastern European Yiddish documents — institutions like the YIVO Archive or the National Yiddish Book Center have paleographers familiar with this material.

Numerical entries (last four lines): The figures 22, 25, 29, 50, 91, 100, 305, 900, 10 are clearly legible. The words accompanying them (likely denoting currency, quantities, or names) are moderate confidence at best.

Recommendation: For a document of this type and period, a specialist transcription service such as those offered by YIVO or Translated.com’s Yiddish division would be the appropriate route for a fully reliable transcription.
Jerry Friedman says

June 7, 2026 at 9:43 am

Interesting. It can certainly talk plausibly about uncertainties. Funny that it transcribes “November” but doesn’t “realize” it [edit: till later in the answer].

What might it mean by “diplomatic”?

I can barely recognize a single letter of the manuscript, but based on my less than rudimentary Yiddish, I have the gravest possible doubts about Claude’s transcription after the date. We’ll see what rozele and anyone else who knows Yiddish says*. However, I’m prepared to whole-heartedly support the recommendation to find a human reader.

*ETA: Should that be “say”? Usually I’m big on plural verbs with compound subjects, but the “knows” may be throwing me off. I think I should change it to “Looking forward to comments by rozele and anyone else who knows Yiddish.”
Carlos says

June 7, 2026 at 12:13 pm

Two more examples, with different degrees of uncertainty, from Claude again. First this.

This is a medieval Latin psalter — Psalms 112–114 (Vulgate numbering) in Gothic textura script. Here is the transcription with uncertainty notes:

[Decorated initial L] aydatę pueri d̃m̃ : laudatę
nomen domini
Sit nomen domini benedictum :
ex hoc nunc & usq; in sęlin.
A solis ortu usq; ad occasum : lau-
dabile nomen domini.
E xcelsus sup omnes gentes d̃m̃ :
& sup celos gloria eius.
Qui sicut dominus deus nr qui
in altos habitat : & humilia res-
picit in celo & in terra.
Suscitans a terra inopem : & de
stercore erigens pauperem.
Ut collocet eum cum principib⁹ :
cum principibus popli sui.
Qui habitare facit sterilem in
domo : matrem filioru letantem.
In exitu isr̃l de egypto : domus
iacob de popło barbaro.
Facta est iudea scificatio eius :

Uncertainty notes:

“aydatę” (line 1): The decorated L initial begins the word Laudate; the L is the illuminated letter, so this reads Laudate. High confidence.
“d̃m̃” (lines 1, 4): Abbreviated dominum or domini with a tittle (overbar). Confident; standard medieval abbreviation.
“usq; in sęlin” (line 4): The final word looks like sęclin or sęculum abbreviated — likely sęculum (= saeculum, “for ever”). The Vulgate reads in saeculum. Moderate confidence on the exact abbreviation form; it could be sęclm̃ rather than sęlin.
“nr” (line 5): Abbreviated noster — standard contraction. High confidence.
“principib⁹” (line 9): The superscript 9-like mark is the standard abbreviation for -us or -bus, giving principibus. High confidence.
“popło” (line 12): The letter between pop and o is slightly ambiguous — could be l with a stroke or an abbreviated form; Vulgate reads populo. Moderate confidence.
“scificatio” (last line): This is almost certainly sanctificatio with a contraction (sc̃ificatio — the sancti- compressed). The Vulgate reads sanctificatio eius. The abbreviation mark may not be visible at this resolution. Moderate-high confidence.
The faded mirror-text visible on the right margin is ink bleed-through from the verso; it does not affect the transcription.
The decorated initials S, A, E, Q, S, U, Q, I, F in the left margin are versicle initials and are not part of the text body proper.

And this:

This is Old English in an insular/Anglo-Saxon minuscule hand. Several letters are clipped at the edges. Here is the transcription:

Line 1: …o · þam þſiode þꝑud · þehe a[…]
Line 2: …ulope · rcop þam þſilogan þ[…]
Line 3: …tolſane · helle hſtaþꝑ · hſanot[…]
Line 4: …ꝑ · ꝑꝑæcna biden · oðop oþſan[…]
Line 5: …gurta þſuþouꝛ · þahe hit gſ[…]
Line 6: …te · berþild · þuꝑle gſinnod · gſ[…]
Line 7: …yle · nece ꝑnſade lege · heht h[…]
Line 8: …noꝛ · þlaxan þꝛce bnogun · hæ[…]

Uncertainty notes:

General caveat — high: This is a fragmentary crop with all lines cut at both left and right edges, which makes contextual disambiguation very difficult. Without surrounding text, word boundaries and word identification are substantially harder.
Script identification: The hand is insular minuscule, consistent with an Anglo-Saxon manuscript, probably 10th–11th century. The ſ (long s), þ (thorn), ð (eth), and characteristic letterforms confirm Old English.
“þſiode” / “þſilogan” / “þſuþouꝛ” (lines 1, 2, 5): The ſ after þ is puzzling — in OE þ and ſ don’t normally cluster this way. These may instead be þ followed by a different letter, or the first letter may be p (which can look very similar to þ in this hand). Low–moderate confidence on these readings; the distinction between p, þ, and ƿ (wynn) is critical and difficult at this resolution.
“rc̈op” / “rcop” (line 2): Could be ræop, rcop, or a word beginning clipped at left. Low confidence.
“hſtaþꝑ · hſanot” (line 3): The hſ cluster is suspicious for the same reason as above — possibly hſ = hs with long s, or the first letter is misread. Low confidence.
“ꝑꝑæcna biden” (line 4): The ꝑ abbreviation typically expands to per- or pro- in Latin contexts, but in OE manuscripts it can stand for þær- or similar. “biden” could be bidan (to wait/remain) — plausible OE vocabulary. Moderate confidence on biden/bidan.
“oðop oþſan” (line 4): oðop could be oððe (or) + oþ (until/up to). Moderate confidence.
“berþild” (line 6): This looks like a personal name — Berþild or Beorþild — which would suit a charter or hagiographic context. Moderate confidence.
“þuꝑle gſinnod” (line 6): þuꝑle is uncertain; gſinnod likely ends a word clipped at right. Low confidence.
“þlaxan þꝛce bnogun” (line 8): These are the most uncertain readings in the fragment. þlaxan and bnogun don’t match obvious OE lexical items — likely the result of edge-clipping distorting word beginnings. Low confidence.

Overall assessment: The text is genuine Old English prose (not verse — no alliterative structure visible), possibly from a charter, saints’ life, or homily. A confident full transcription would require the complete manuscript page.
David Marjanović says

June 7, 2026 at 12:22 pm

“Diplomatic transcription” is a technical term; I don’t know what it means, though.

Please see the “reasoning models” I mentioned above. That’s what they’re doing, but they’re applying it step-by-step to complex problems, and they report the intermediate steps.

Or do they make them up afterwards, basing the choice of words on what’s likely to be in such a report?
Y says

June 7, 2026 at 12:32 pm

The header line of the Yiddish ms. begins with טאראטא ‘Toro[n]to’, if that helps, for a start.

The AI produced somewhere between 99% and 100% gibberish, now New and Improved with gibberish error bars, by request.

A diplomatic transcription means one that aims to reproduce every aspect of the written manuscript, for example line breaks, cross-outs, insertions, etc. It’s a variable concept, depending on how closely the transcription follows the manuscript.
Stephen Rowland says

June 7, 2026 at 6:50 pm

A (non-diplomatic) transcription of the Old English above:

and þam werode wrað þe he æ[r wurðode
wlite and w]uldre. Sceop þam werlogan
w[ræclicne ham weorce] to leane,
helleheafas, heard[e niðas.
Heht þæt witehu]s wræcna bidan,
deop, drea[ma leas, drihten ure,]
gasta weardas, þa he hit ge[are wiste,
synnih]te beseald, susle geinnod,
g[eondfolen fyre and færc]yle,
rece and reade lege. Heht [þa geond þæt rædlease] hof
weaxan witebrogan. H[æfdon hie wrohtgeteme

Full text here, lines 35-45.
Jerry Friedman says

June 8, 2026 at 3:45 pm

Thank you, Hans, for the report of experience, and Y, for information including “diplomatic”.
Hans says

June 12, 2026 at 4:26 am

Today’s SMBC. It all makes sense now!
bulbul says

June 19, 2026 at 1:48 pm

What I want to know is where they got “600 million words from historical Greek texts.” There is a similar project done in Princenton called Logion (https://huggingface.co/princeton-logion/logion-bert-base) which only aims to provide a tool that guesses the next word. That one was trained on some 50 million words which is supposed to be the entirety of TLG.

AI Model for Ancient Papyri.

Comments

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments