Extracting Books from LLMs.

The arXiv paper Extracting books from production language models by Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang is alarming but not in the least surprising. The abstract:

Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model’s weights during training, and whether those memorized data can be extracted in the model’s outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure […]. With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer’s Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.

Écrasez l’infâme ! And if you’re tired of thinking about the evils of LLMs, I bring you news of An Old Welsh Reader, edited by Simon Rodway:

This reader contains edited texts, with English translations, of all the independent texts extant in manuscripts of the ninth, tenth, and eleventh centuries, with a selection of twelfth-century texts. They are accompanied by extensive notes and glossaries, along with an introduction which considers the prehistory of Welsh and its relationship with other Celtic languages. The volume also contains a comprehensive list of the sources of Old Welsh and an outline grammar: the first specifically dedicated to Old Welsh to appear in English. Appendices contain editions of one of the very few ancient Celtic texts from Britain, the Bath pendant, and the only sizeable text in another early medieval Brittonic language, the Old Cornish portion of the Leiden leechbook.

Now that’s my idea of a good time.

Comments

  1. David Eddyshaw says

    I bring you news of An Old Welsh Reader

    Brill! Unfortunately, it doesn’t look like it’ll be out in time for my birthday …

    [As for the rest: LLMs are powered by truly massive shameless theft, and are marketed by systematic deception; and their proprietors actively support fascism both financially and by undermining the foundations of democracy. Apart from that, I have no quarrel with them.]

  2. J.W. Brewer says

    Who could forget that great pulp mystery _Perry Mason and the Case of the Leiden Leechbook_?

  3. Truly, the Leiden Leechbook is a great name.

  4. David Eddyshaw says

    Hey, there’s nothing funny about leechcraft!

    I’ve always supposed that the not-very-cuddly animals get their name from us pillar-of-the-community types, but Wiktionary seems not to think so:

    https://en.wiktionary.org/wiki/leech

    Medical leeches were still in occasional use in the memory of some of my ophthalmology teachers, and apparently still are, even now, just about:

    https://en.wikipedia.org/wiki/Hirudo_medicinalis

    Sadly, I have no experience of such uses myself.

  5. OED (1901 entry) sez: “Commonly regarded as a transferred use of leech n.¹ [‘physician’]; this is plausible, but the forms Old English lyce, early Middle English liche, Middle Dutch lieke, suggest that the word was originally distinct, but assimilated to lǽce leech n.¹ through popular etymology.”

  6. David Marjanović says

    …but Wiktionary says: “From Middle English leche (“blood-sucking worm”), from Old English lǣċe (“blood-sucking worm”), akin to Middle Dutch lāke (“blood-sucking worm”; > modern Dutch laak).” And indeed laak means, among unrelated things, “(dated) leech”, synonym bloedzuiger, i.e. bloodsucker.

    At the same time, the vowel fits the physician-and-surgeon word: Gothic lekeis, borrowed into Slavic as лѣкарь with a separately borrowed suffix.

Speak Your Mind

*