Extracting Books from LLMs.

March 27, 2026 by languagehat 29 Comments

The arXiv paper Extracting books from production language models by Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang is alarming but not in the least surprising. The abstract:

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model’s weights during training, and whether those memorized data can be extracted in the model’s outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure […]. With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer’s Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

Écrasez l’infâme ! And if you’re tired of thinking about the evils of LLMs, I bring you news of An Old Welsh Reader, edited by Simon Rodway:

This reader contains edited texts, with English translations, of all the independent texts extant in manuscripts of the ninth, tenth, and eleventh centuries, with a selection of twelfth-century texts. They are accompanied by extensive notes and glossaries, along with an introduction which considers the prehistory of Welsh and its relationship with other Celtic languages. The volume also contains a comprehensive list of the sources of Old Welsh and an outline grammar: the first specifically dedicated to Old Welsh to appear in English. Appendices contain editions of one of the very few ancient Celtic texts from Britain, the Bath pendant, and the only sizeable text in another early medieval Brittonic language, the Old Cornish portion of the Leiden leechbook.

Now that’s my idea of a good time.

Comments

David Eddyshaw says

March 27, 2026 at 4:07 pm

I bring you news of An Old Welsh Reader

Brill! Unfortunately, it doesn’t look like it’ll be out in time for my birthday …

[As for the rest: LLMs are powered by truly massive shameless theft, and are marketed by systematic deception; and their proprietors actively support fascism both financially and by undermining the foundations of democracy. Apart from that, I have no quarrel with them.]
J.W. Brewer says

March 27, 2026 at 4:19 pm

Who could forget that great pulp mystery _Perry Mason and the Case of the Leiden Leechbook_?
languagehat says

March 27, 2026 at 4:30 pm

Truly, the Leiden Leechbook is a great name.
David Eddyshaw says

March 27, 2026 at 5:07 pm

Hey, there’s nothing funny about leechcraft!

I’ve always supposed that the not-very-cuddly animals get their name from us pillar-of-the-community types, but Wiktionary seems not to think so:

https://en.wiktionary.org/wiki/leech

Medical leeches were still in occasional use in the memory of some of my ophthalmology teachers, and apparently still are, even now, just about:

https://en.wikipedia.org/wiki/Hirudo_medicinalis

Sadly, I have no experience of such uses myself.
languagehat says

March 27, 2026 at 5:31 pm

OED (1901 entry) sez: “Commonly regarded as a transferred use of leech n.¹ [‘physician’]; this is plausible, but the forms Old English lyce, early Middle English liche, Middle Dutch lieke, suggest that the word was originally distinct, but assimilated to lǽce leech n.¹ through popular etymology.”
David Marjanović says

March 27, 2026 at 6:07 pm

…but Wiktionary says: “From Middle English leche (“blood-sucking worm”), from Old English lǣċe (“blood-sucking worm”), akin to Middle Dutch lāke (“blood-sucking worm”; > modern Dutch laak).” And indeed laak means, among unrelated things, “(dated) leech”, synonym bloedzuiger, i.e. bloodsucker.

At the same time, the vowel fits the physician-and-surgeon word: Gothic lekeis, borrowed into Slavic as лѣкарь with a separately borrowed suffix.
David Eddyshaw says

March 27, 2026 at 6:19 pm

OED is suggesting that the unpleasant animal actually had a different vowel originally, I think, and has been assimilated to the “physician” word in form by folk-association of the meaning. But that implies that this assimilation eventually happened in Dutch as well as in English, which is possible, I suppose, though it gives one pause.
languagehat says

March 27, 2026 at 6:24 pm

Well, the Dutch and the English didn’t exactly live on opposite sides of the earth; I can imagine the “leech = doctor” idea passing from one to the other, since the words were so similar.
David Marjanović says

March 27, 2026 at 6:27 pm

Unfortunately German As I Know It doesn’t help: “bloodsucking leech” is Blutegel (Egel having a venerable and completely unrelated IE pedigree), “physician or surgeon” is Arzt < none less than archiater (umlaut systematically removed from the singular; plural Ärzte).
Bathrobe says

March 27, 2026 at 7:45 pm

I’ve asked AI about translations of the first paragraph of Death in Venice. They invariably produce garbled versions that make anything they say completely unreliable. They read like some kind of cobbled together version that follows the sense but not the actual wording. When challenged they admit that they don’t actually know the real version. I suspect that’s because they are explicitly not allowed to quote the actual wording, but it’s disconcerting that they serve up their cobbled-together version as though it’s the real thing. Definitely credibility issues here.
PlasticPaddy says

March 27, 2026 at 7:50 pm

Just to clarify, I believe the Dutch word means and only meant “bloodsucker”. Similar words in other Germanic varieties mean (or meant, if they are obsolete),
either “doctor” or “bloodsucker”, but not both, except for Old English (and by inheritance Modern English). For example, Gothic exhibits the “doctor” word. Even if the “doctor” word is older, the “bloodsucker” word could have been originally separate. See
https://etymologiebank.nl/trefwoord/laak2
David Eddyshaw says

March 27, 2026 at 8:06 pm

physician or surgeon” is Arzt < none less than archiater

English “doctor” and French docteur (rather than médicin) have spread widely in West Africa, but undergone some odd sea-changes in the process of being passed from language to language.

Hausa has ended up with likita somehow; Kusaal has du’ata, which I think is based on Mampruli dogta, as being the form you’d have expected if the words had actually been cognates (or something.)

Gulimancema has lotoli, or logitoli for those with ample time on their hands.

Mooré has lògtórè, which doesn’t look too odd apart from the initial l; but it has also decided on the plural logtoɛɛmba, which must be by analogy with mórè “Muslim”, plural moɛɛmba because why not?

Dunno what the deal is with all these initial l‘s. All these languages do have an initial /d/, though Mooré often rhotacises it.
maidhc says

March 28, 2026 at 12:22 am

In Swedish “doctor” is läkare and there’s a verb läka “to heal”.

“Leech” is blodigel, similar to German.
Lameen says

March 28, 2026 at 3:18 am

Ah, so Finnish lääkäri actually is cognate with “leech”.

Wiktionary has exceptionally good coverage of Finnish etymology, and I have to say I was not prepared for how much of it is just layer upon layer of Germanic loanwords.
Trond Engen says

March 28, 2026 at 8:05 am

Norw. Bm. lege m. “doctor”, lege v. “heal (medically)” < Da., Nyn. (off.) lækjar m. “doctor”, lækje v. “heal (med.)”, also i.a. lækjedom “healing”. “Leech” is (blod)igle.

Slavic лѣкарь looks rather like a borrowing from North Germanic.
Nelson Goering says

March 28, 2026 at 9:58 am

“plural Ärzte”

I’ve heard that the German punk band Die Ärzte chose their name because they thought there weren’t enough bands in the Ä section in (I assume at the time) record shops.
J.W. Brewer says

March 28, 2026 at 10:28 am

@Nelson Goering: The usual story (although it has been denied by members of the band,* and wikipedia calls it “the legend”) about the name of the San Francisco rock group the Beau Brummels is that they chose their name (in 1964, to be clear) so that their records would be located immediately next to those of the Beatles in stores using alphabetical order for their inventory, which most did.

*I don’t think they deny that they chose the name to give their American group a vaguely British-sounding aura at a time when the so-called British Invasion was transforming the U.S. music biz. See also the suspiciously-British overtones of the name of their contemporaries the Sir Douglas Quintet, who hailed from Austin, Tex.
J.W. Brewer says

March 28, 2026 at 10:37 am

Going back to the OP, I’m not sure why this is particularly shocking. Plenty of “original” non-fiction books written by human authors on previously well-trodden subjects are the result of the author’s “research” having consisted of having read 6 or 10 previous works on the same subject and then freely drawn on their content in a way that does sufficient mixing-and-matching of sources and rephrasing of wording to avoid actual copyright infringement. If you somehow figured out how to override/disable the “safety measures” that keep those writers and their publishers from being sued for copyright infringement they would probably give you recognizable long block quotes from recognizable underlying sources. They’ve been trained not to do that (or not to do it so unsubtly as to get caught), but that’s not because of some ineffable spiritual essence of human authorship.
languagehat says

March 28, 2026 at 10:56 am

If you think automated programs ripping off authors and human authors borrowing from other human authors, with greater or lesser degrees of honesty, are pretty much the same thing, we’ll have to agree to disagree.
J.W. Brewer says

March 28, 2026 at 11:00 am

I don’t understand where or how you are drawing a line between “ripping off” and “borrowing … with greater or lesser degrees of honesty.” Other than of course the supposed proverb that goes (wordings vary) something like ““Immature artists borrow; mature artists steal.”
Athel Cornish-Bowden says

March 28, 2026 at 11:06 am

Recently I’ve been receiving suggestions that I register for payment under the Anthropic LLM settlement. I’ve been in two minds whether to do that, but I have to make up my mind very soon because the deadline is Monday. If they want my bank details, then forget it, but otherwise, why not. Only one of my books is concerned.
David Eddyshaw says

March 28, 2026 at 11:19 am

My wife is engaged in this currently, as one of the heirs of her father, who was quite widely published in the US.

Our erstwhile across-the-road-neighbour too (who now lives in France.)

Anything that costs Anthropic money is a public service, though in a better world the entire modus operandi of LLM “training” would of course be illegal (as opposed to merely grossly unethical.)
ulr says

March 28, 2026 at 11:25 am

there weren’t enough bands in the Ä section in (I assume at the time) record shops.

I doubt there ever was a record shop with an Ä section; anything starting with Ä would be put into the A section.
Craig says

March 28, 2026 at 11:52 am

@Athel Cornish-Bowden, any US settlements I’ve been party to in recent years—and lately it’s been about one per year and all small amounts—have disbursed funds in paper checks. I have no idea how they handle non-US claimants.
Brett says

March 28, 2026 at 12:05 pm

For settlements, I usually get a virtual debit card.
David Marjanović says

March 28, 2026 at 3:06 pm

Slavic лѣкарь looks rather like a borrowing from North Germanic.

Yes, but it’s not by any means limited to East Slavic.

Plenty of “original” non-fiction books written by human authors on previously well-trodden subjects are the result of the author’s “research” having consisted of having read 6 or 10 previous works on the same subject and then

citing them, making clear which information was taken from which source… unless of course the publisher insisted on having neither foot- nor endnotes.

anything starting with Ä would be put into the A section

Maybe very large stores had separate Sch and St sections too?
J.W. Brewer says

March 28, 2026 at 3:29 pm

@David M.: You’re joking, right? I’m not talking about nerdy niche academic books with copious citations and notes and total sales in the high three figures – I’m talking about the sort of normal generic mass-market schlock written by hacks that dominates the publishing industry.
David Marjanović says

March 28, 2026 at 3:39 pm

Wiktionary has exceptionally good coverage of Finnish etymology, and I have to say I was not prepared for how much of it is just layer upon layer of Germanic loanwords.

Enough to make clear that the Germanic merger of Pre-Gmc *ā into *ō resulted in *ā first, which then turned into *ō – from Germanic data alone, plus Caesar even, you could bbbbarely guess at it. Also, it followed the merger of *o into *a.
V says

March 28, 2026 at 3:50 pm

ѣ would not prompt Ä in an LLM, probably. There’s not enough correlation, even though they sound the same.

Extracting Books from LLMs.

Comments

Speak Your Mind

Archives

Search

Recent Posts

Recent Comments