John Cowan wrote me as follows:
I found the following sentence in the Kindle edition of a story by Josephine Tey: “To my unbounded relief, however, Lizbeth lapsed suddenly from the borders of hysteria to her normal fwent calm.”
“Fwent”?
Googling the first part of the sentence shows simply “her normal calm” in other editions, so “fwent” is probably not a typo for some other word (and if it were, what word could it be?) Googling for “fwent calm” shows no other instances. Bizarre.
Any guesses what went wrong? My only guess is that it is some kind of markup (not HTML) that infiltrated the text, like the 1805 KJV Bible edition where “to remain” (apparently being used in place of “stet”) wound up being printed in Gal 4:29, making it read “But as then he that was born after the flesh persecuted him that was born after the Spirit to remain, even so it is now.” It *almost* makes sense.
I for one am baffled, and I’m curious what the assembled Hatters make of it.
It is surely a typo. F is next to G, and Gwent is a Welsh preserved county (presumably well known for the stoic calmness of its inhabitants).
One of my friends once used the word “parostroj” (Czech for steam engine) when writing his diploma thesis to mark sections he intended to revisit later. (The thesis was about astrophysics, not steam engines.) Quite naturally one of the steam engines evaded final editing and survived till the submitted version. The committee members thought it was there to test whether someone actually reads the thesis at all.
covfefe. Even a typo can get you into the headlines these days.
But it’s clearly not a typo, since the intended sentence has “Lizbeth lapsed suddenly from the borders of hysteria to her normal calm,” with nothing that could be twisted into “fwent.” If someone were literally typing in the text for this edition, one might guess their fingers slipped and produced a little gibberish without their noticing, but surely that’s not how things happen these days.
It’s lenition with epenthesis.
Okay, I’ve given you a hint. You guys should be able to take it from there…
More seriously, I wonder whether it’s the kind of typo that occurs when someone is typing elsewhere and then bumps the cursor, doesn’t recognize what happened and goes back to where they meant to be–“I fucking went… (whoops, what happened?) Anyway, I went to three stores and couldn’t find the right model.” A common laptop error for those without the good sense to disable touchpad tapping.
It’s like a bit of a chapter heading got in when a page break in the original was removed, or the stray words they used to print at the bottom of the page to match them up.
I’ve done a bit of DP Canada proofreading, and the oddest things do get into the scans, although they try to make sure they all come out again.
Maybe the editor meant to hit CTRL+F followed by “went”, but missed CTRL?
It’s like a bit of a chapter heading got in when a page break in the original was removed, or the stray words they used to print at the bottom of the page to match them up.
I’ve done a bit of DP Canada proofreading, and the oddest things do get into the scans, although they try to make sure they all come out again.
Thanks, I guess that’s the best we’re going to do unless someone involved with the production of this edition drops by and says “Shit, I can’t believe that stayed in there! Here’s how it happened…”
Perhaps hat is a little old to put on his dancin’ shoes for this, but I present you with “La Fwent,” an actual musical composition from circa 2003. (This is the original mix; there are remixes out there.) https://www.youtube.com/watch?v=pL63HkDn9KU
Good lord! That raises two questions: was Pedro Delgardo involved in the production of this book, and where did *he* get the name from? (And yes, I’m too old for that techno stuff.)
FWIW one generally-quite-reliable music-reference website reports that “Pedro Delgardo” is a stage name for Pete Gawtry, who is said to be a “Techno/Electro/Electronica DJ & producer from Leeds, UK.” “Gawtry” isn’t quite an anagram for “Gwent” but it’s in the neighborhood?
“Gawtry of Gwent” sounds like a medieval romance.
One day I saw upon the stair
an errant fwent that wasn’t there.
It fwasn’t there there again today.
Oh how I wish it fwent away.
The ostrich and the merkin
soared up high in fwenglish phoome.
We fwent astray again, Deare Fwends,
Parr for the coarse?
Or a modern historical novel.
I present you with “La Fwent,” an actual musical composition …
Name clearly a tribute to Discos Fuentes has often been described as Colombia’s version of “Motown”, … — for which I’d happily put on my dancing shoes/Buona Vista Social Club vibe.
BTW how long am I supposed to put up with that “La Fwent”‘s intro before it actually starts?
That long introduction is meant to bring you back from the borders of hysteria to a place where you can hear the subtler variations.
I think maybe the practical answer to AntC’s “how long” question is that with a sufficient dosage of C11H15NO2 [sorry for lack of subscript numbers] it just wouldn’t bother him?
Maybe the editor meant to hit CTRL+F followed by “went”, but missed CTRL?
This seems to me like the best suggestion yet. And it’s true that books are rarely typed in by hand (other than by the original author) any more, but the final stage of OCR correction still has to be done by hand.
an errant fwent that wasn’t there
It was perhaps chasing a rapidly vanishing fnord. In any case, a lovely verse. The original of Wendy from Peter Pan seems to have gotten her nickname from fwiendie < friend+ie.
C11H15NO2
C₁₁H₁₅NO₂, using Unicode subscript digits, typed as AltGr+q followed by the digit on the Moby Latin keyboard and its UK relatives. AltGr+q in general provides smart quotes, em and en dashes, and other such punctuational rarities; the superscript digits are just AltGr+digit.
My first thought is that it’s a glitch produced by OCR software. If the original is scanned at an angle, or has a crease in the page, or has columnated text or some other unusual formatting (this last one being unlikely in the case of a novel), then a word can easily jump to the wrong place in the OCR-produced text, especially using cheap software
The misplaced word in this case would also have been misread by the software. Could there be a nearby passage missing the word “fluent”, for example?
John Cowan: …the final stage of OCR correction still has to be done…
I’m going to stop you right there. Lots of commercially produced e-books have clearly never had a human read-over between the ORC and publication.* Works by major authors actually get the OCR mistakes corrected, but I would not put Josephine Tey in that category.
* I recently paid a couple of dollars for a digital copy of Kothar, Barbarian Swordsman by Gardner Fox—supposedly one of the best pastiches of Robert E. Howard’s Conan stories. (Fox makes no attempt to hide the fact that Kothar is a reskinned version of Conan. Plenty of the proper names are clearly chosen to sound similar to Howard’s. Moreover, decades later, when the Conan stories were adapted into a comic book series by Marvel, Fox—who was a famous comic writer himself, creator of The Flash, Hawkman, and Doctor Fate—allowed his Kothar stories to be adapted as Conan stories when Marvel needed material for additional issues.) There are glaring OCR errors on practically every page, and at one crucial point several whole lines are missing. No human ever checked the text before it was made available for sale.
I suspect the various Gutenberg projects are much better checked than some commercial versions of out-of-copyright texts – it takes a lots of volunteer hours, and you’re not going to pay for the work if you’re going to end up selling it for 49p the lot. Although some of them just use the Gutenberg text, of course!
This text, though commercial, is fairly clean — I would notice.
@Jen in Edinburgh: Yes, thanks to Distributed Proofreaders (which I used to be pretty involved with too, specializing in technical works), the texts at the Project Gutenberg sites tend to be very clean. I don’t know why anyone releasing a (superfluous) e-book edition of an out-of-print work would use any other version, except perhaps due to a misunderstanding of copyright laws. It’s e-books for things that are still in copyright (like Kothar, Barbarian Swordsman, since Fox only died in 1986) but which are not expected to sell many copies that really tend to be the pits. The quality of the scan from which the text was extracted, and the specific software used, can make for big differences in the qualities of the e-book products.
It’s a wonder to me how OCR got stuck where it has been for years now. An undistracted human reading printed text of reasonable quality will have a Zero Point Zero error rate, including identifying different scripts. The best commercial OCR programs will boast something like maybe one error per page. I imagine that the market for OCR these days is not scholars, but people digitizing office materials and legal evidence that nobody will read anyway.
[fnord]
The best commercial OCR (for legal contracts, e.g., which lawyers certainly do read) is improving steadily. An example is ABBYY. I haven’t tested any of the OCR readers that claim to be AI-based, but I expect that if they aren’t that great now, they certainly will be.
@rozele:
There’s a typo in your comment.
I have used ABBYY quite a lot. It’s the best, and it’s so-so. And it’s a pain to get it from so-so to a little better than so-so.
I was just trying out the entirely non-commercial ocrmypdf on my non-searchable pdf of Lukas Neukom’s grammar of Nateni. It managed the French text pretty well – enough to make that part reliably searchable, anyhow – though (forgivably) it seems to have given up altogether on the diacritic-heavy Nateni.
Y: It’s certainly much better in some languages than others. What languages were you using? And had you paid the extra $$$$ to unlock the higher levels?
DE: OCRing depends entirely on having a predictive model of the language being OCRed (and the same is true for speech recognition). Without that, the results are not even so-so.
@DE: but you can see it!
JC: I used ABBYY 14, I think the $$$ version (i.e. $130 ten years ago, not quite $$$$), which lets you train it if it has problems with a particular font. I have used it mostly for linguistic texts written in European languages about obscure languages, some using a variety of ad-hoc diacritics. For English/French/German it’s “acceptable”, meaning that if you search the text, you are likely to find what you are looking for. For anything else it’s hopeless, unless, for each book, you spend many hours fighting the adversarial training user interface; if you do, it gets closer to “acceptable”.
A human, for comparison, could transcribe any of these texts, diacritics and all, with 100.0% accuracy, without any language model.
with 100.0% accuracy
I wouldn’t put 100.0% on anything human; typos exist. Though I guess 99.97% (corresponding to roughly one typo per 1-2 pages, depending on font size) is quite achievable with some care, and that would round to 100.0%.
A more important consideration is that in human-transcribed texts the typos are usually less frequent in the weird bits (e.g. foreign text with diacritics), because those are more carefully looked over. Old-style OCR would have a lot less of an idea of what to do with that sort of thing; modern “AI”-based OCR would probably be prone to straight-up inventing stuff that vaguely looks like it could be there.
which lets you train it if it has problems with a particular font
I think this concept goes back to at least the 1990s. (It wasn’t perfect back then either.)
It occurs to me that the title of this post is (most likely accidentally) parallel to the full title of Bram Stoker’s novel, which is Dracula: A Mystery.
Quite the opposite. German and French references in scientific papers in English are almost invariably misspelled unless of course enough of the authors speak one of these natively.
I want to add a caveat to my comment about about how Project Gutenberg generally has good quality texts. This is not so much true for the books that were posted in the early days of the project, before a proofreading system was established. Unfortunately, this includes a lot of the most interesting works. I have recently been reading the Memoirs of General W. T. Sherman (document number 4361 on the site), which is riddled with errors. Most of the problems are obvious OCR mistakes, but there are formatting and other kinds of errors as well. Moreover, while most of the mistakes are easy to spot and mentally correct, not all of them are. I am genuinely unsure, from what I have been reading, whether there was a brigadier general named “Smart,” or whether that it just an occasional error for “Stuart.”
It’s Stuart. Search in the archive.org version (or in GBooks, which had digitized it).
It’s a 12-inch single and as such gets in full flow immediately — or about as full as it gets, anyway (techno really isn’t known for its fullness). But they do pretty often also build the song up somewhat gradually to help with live mixing, e.g. here subtle bass (re)drop around 1:20, hihats drop at 1:40.