The arXiv paper Extracting books from production language models by Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang is alarming but not in the least surprising. The abstract:
Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model’s weights during training, and whether those memorized data can be extracted in the model’s outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure […]. With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer’s Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
Écrasez l’infâme ! And if you’re tired of thinking about the evils of LLMs, I bring you news of An Old Welsh Reader, edited by Simon Rodway:
This reader contains edited texts, with English translations, of all the independent texts extant in manuscripts of the ninth, tenth, and eleventh centuries, with a selection of twelfth-century texts. They are accompanied by extensive notes and glossaries, along with an introduction which considers the prehistory of Welsh and its relationship with other Celtic languages. The volume also contains a comprehensive list of the sources of Old Welsh and an outline grammar: the first specifically dedicated to Old Welsh to appear in English. Appendices contain editions of one of the very few ancient Celtic texts from Britain, the Bath pendant, and the only sizeable text in another early medieval Brittonic language, the Old Cornish portion of the Leiden leechbook.
Now that’s my idea of a good time.
I bring you news of An Old Welsh Reader
Brill! Unfortunately, it doesn’t look like it’ll be out in time for my birthday …
[As for the rest: LLMs are powered by truly massive shameless theft, and are marketed by systematic deception; and their proprietors actively support fascism both financially and by undermining the foundations of democracy. Apart from that, I have no quarrel with them.]
Who could forget that great pulp mystery _Perry Mason and the Case of the Leiden Leechbook_?
Truly, the Leiden Leechbook is a great name.
Hey, there’s nothing funny about leechcraft!
I’ve always supposed that the not-very-cuddly animals get their name from us pillar-of-the-community types, but Wiktionary seems not to think so:
https://en.wiktionary.org/wiki/leech
Medical leeches were still in occasional use in the memory of some of my ophthalmology teachers, and apparently still are, even now, just about:
https://en.wikipedia.org/wiki/Hirudo_medicinalis
Sadly, I have no experience of such uses myself.
OED (1901 entry) sez: “Commonly regarded as a transferred use of leech n.¹ [‘physician’]; this is plausible, but the forms Old English lyce, early Middle English liche, Middle Dutch lieke, suggest that the word was originally distinct, but assimilated to lǽce leech n.¹ through popular etymology.”
…but Wiktionary says: “From Middle English leche (“blood-sucking worm”), from Old English lǣċe (“blood-sucking worm”), akin to Middle Dutch lāke (“blood-sucking worm”; > modern Dutch laak).” And indeed laak means, among unrelated things, “(dated) leech”, synonym bloedzuiger, i.e. bloodsucker.
At the same time, the vowel fits the physician-and-surgeon word: Gothic lekeis, borrowed into Slavic as лѣкарь with a separately borrowed suffix.
OED is suggesting that the unpleasant animal actually had a different vowel originally, I think, and has been assimilated to the “physician” word in form by folk-association of the meaning. But that implies that this assimilation eventually happened in Dutch as well as in English, which is possible, I suppose, though it gives one pause.
Well, the Dutch and the English didn’t exactly live on opposite sides of the earth; I can imagine the “leech = doctor” idea passing from one to the other, since the words were so similar.
Unfortunately German As I Know It doesn’t help: “bloodsucking leech” is Blutegel (Egel having a venerable and completely unrelated IE pedigree), “physician or surgeon” is Arzt < none less than archiater (umlaut systematically removed from the singular; plural Ärzte).
I’ve asked AI about translations of the first paragraph of Death in Venice. They invariably produce garbled versions that make anything they say completely unreliable. They read like some kind of cobbled together version that follows the sense but not the actual wording. When challenged they admit that they don’t actually know the real version. I suspect that’s because they are explicitly not allowed to quote the actual wording, but it’s disconcerting that they serve up their cobbled-together version as though it’s the real thing. Definitely credibility issues here.
Just to clarify, I believe the Dutch word means and only meant “bloodsucker”. Similar words in other Germanic varieties mean (or meant, if they are obsolete),
either “doctor” or “bloodsucker”, but not both, except for Old English (and by inheritance Modern English). For example, Gothic exhibits the “doctor” word. Even if the “doctor” word is older, the “bloodsucker” word could have been originally separate. See
https://etymologiebank.nl/trefwoord/laak2
physician or surgeon” is Arzt < none less than archiater
English “doctor” and French docteur (rather than médicin) have spread widely in West Africa, but undergone some odd sea-changes in the process of being passed from language to language.
Hausa has ended up with likita somehow; Kusaal has du’ata, which I think is based on Mampruli dogta, as being the form you’d have expected if the words had actually been cognates (or something.)
Gulimancema has lotoli, or logitoli for those with ample time on their hands.
Mooré has lògtórè, which doesn’t look too odd apart from the initial l; but it has also decided on the plural logtoɛɛmba, which must be by analogy with mórè “Muslim”, plural moɛɛmba because why not?
Dunno what the deal is with all these initial l‘s. All these languages do have an initial /d/, though Mooré often rhotacises it.
In Swedish “doctor” is läkare and there’s a verb läka “to heal”.
“Leech” is blodigel, similar to German.
Ah, so Finnish lääkäri actually is cognate with “leech”.
Wiktionary has exceptionally good coverage of Finnish etymology, and I have to say I was not prepared for how much of it is just layer upon layer of Germanic loanwords.
Norw. Bm. lege m. “doctor”, lege v. “heal (medically)” < Da., Nyn. (off.) lækjar m. “doctor”, lækje v. “heal (med.)”, also i.a. lækjedom “healing”. “Leech” is (blod)igle.
Slavic лѣкарь looks rather like a borrowing from North Germanic.
“plural Ärzte”
I’ve heard that the German punk band Die Ärzte chose their name because they thought there weren’t enough bands in the Ä section in (I assume at the time) record shops.
@Nelson Goering: The usual story (although it has been denied by members of the band,* and wikipedia calls it “the legend”) about the name of the San Francisco rock group the Beau Brummels is that they chose their name (in 1964, to be clear) so that their records would be located immediately next to those of the Beatles in stores using alphabetical order for their inventory, which most did.
*I don’t think they deny that they chose the name to give their American group a vaguely British-sounding aura at a time when the so-called British Invasion was transforming the U.S. music biz. See also the suspiciously-British overtones of the name of their contemporaries the Sir Douglas Quintet, who hailed from Austin, Tex.
Going back to the OP, I’m not sure why this is particularly shocking. Plenty of “original” non-fiction books written by human authors on previously well-trodden subjects are the result of the author’s “research” having consisted of having read 6 or 10 previous works on the same subject and then freely drawn on their content in a way that does sufficient mixing-and-matching of sources and rephrasing of wording to avoid actual copyright infringement. If you somehow figured out how to override/disable the “safety measures” that keep those writers and their publishers from being sued for copyright infringement they would probably give you recognizable long block quotes from recognizable underlying sources. They’ve been trained not to do that (or not to do it so unsubtly as to get caught), but that’s not because of some ineffable spiritual essence of human authorship.
If you think automated programs ripping off authors and human authors borrowing from other human authors, with greater or lesser degrees of honesty, are pretty much the same thing, we’ll have to agree to disagree.
I don’t understand where or how you are drawing a line between “ripping off” and “borrowing … with greater or lesser degrees of honesty.” Other than of course the supposed proverb that goes (wordings vary) something like ““Immature artists borrow; mature artists steal.”
Recently I’ve been receiving suggestions that I register for payment under the Anthropic LLM settlement. I’ve been in two minds whether to do that, but I have to make up my mind very soon because the deadline is Monday. If they want my bank details, then forget it, but otherwise, why not. Only one of my books is concerned.
My wife is engaged in this currently, as one of the heirs of her father, who was quite widely published in the US.
Our erstwhile across-the-road-neighbour too (who now lives in France.)
Anything that costs Anthropic money is a public service, though in a better world the entire modus operandi of LLM “training” would of course be illegal (as opposed to merely grossly unethical.)
I doubt there ever was a record shop with an Ä section; anything starting with Ä would be put into the A section.
@Athel Cornish-Bowden, any US settlements I’ve been party to in recent years—and lately it’s been about one per year and all small amounts—have disbursed funds in paper checks. I have no idea how they handle non-US claimants.
For settlements, I usually get a virtual debit card.
Yes, but it’s not by any means limited to East Slavic.
citing them, making clear which information was taken from which source… unless of course the publisher insisted on having neither foot- nor endnotes.
Maybe very large stores had separate Sch and St sections too?
@David M.: You’re joking, right? I’m not talking about nerdy niche academic books with copious citations and notes and total sales in the high three figures – I’m talking about the sort of normal generic mass-market schlock written by hacks that dominates the publishing industry.
Enough to make clear that the Germanic merger of Pre-Gmc *ā into *ō resulted in *ā first, which then turned into *ō – from Germanic data alone, plus Caesar even, you could bbbbarely guess at it. Also, it followed the merger of *o into *a.
ѣ would not prompt Ä in an LLM, probably. There’s not enough correlation, even though they sound the same.