Josh Sucher writes:
Last week, my brother and I took in a screening of the 1976 classic Network that just happened to be captioned. As a result, it really struck me how impressive the vocabulary in that movie is. Immane! Oraculate! Auspicatory! So many of what my dad used to call 50¢ words.
So I went home and spent a few hours making this, a list of words found in the dialogue of Network, ranked by their estimated frequency in the English language. I used a Python library called wordfreq (which, sadly, was deprecated last fall, a decision its creator partially attributed to the prevalence of AI slop making it impossible to analyze human word usage after 2022).
I decided to add definitions to my list of esoteric Network words, which turned out to be an interesting challenge. Rare words are… rare! Every dictionary API has some different subset of them. It took a few to flesh out the list.
The wordfreq data was so compelling that I decided to keep pulling the thread on this, and after a few late nights I am very happy to share Lettervoxd. Lettervoxd is a tool that extracts esoteric words from about 25,000 movies from the past century. It lists (nearly) every one-in-a-billion word that can be found in the giant corpus of subtitles I downloaded from Open Subtitles.
More details, as well as links and images, at Josh’s page. When you go to the Lettervoxd site, click on a word to see the movies it’s been used in. What a great thing to create!
Just poking around the site, as one does, I noticed the word “paddywhack” which it says means “threshed unmilled rice”. You may also recognize it from “knick knack paddy whack, give a dog a bone”… for which the internet produces a lot of commentary and speculation, none of which refers to the milling of rice.
I used a Python library called wordfreq (which, sadly, was deprecated last fall, a decision its creator partially attributed to the prevalence of AI slop making it impossible to analyze human word usage after 2022).
Have things already come to this pass ? This year I notice weird locutions and “typos” turning up more frequently on Spiegel and politico. The lyrics from Musixmatch on Spotify are full of mistakes like “your” for “you’re”, “beaconing” for “beckoning”.
HI is starting to imitate AI. Life follows Art !! It has ever been thus, ¿ no ?
Coincidences happen. Hickory appears in a string of seven nonsense syllables in a children’s rhyme that was first written down before the American tree became known in English.
Clearly Josh did not come across ADHD or OCD in the wordlist. Those concepts (and maybe Josh himself) did not exist in 1976.
Oof.
I like the idea a lot, but, as with Google ngrams, transcription errors spoil the fun. When I saw abdominous ‘paunchy’ I got excited; but the supposed source is Se7en, “the transverse abdominous muscles.” I suppose you could use the word this way but more likely someone didn’t know how to spell abdominis.
I noticed the word “paddywhack” which it says means “threshed unmilled rice”.
Odd. Not in the OED, which has (entry revised 2005):
Aha, that will be the OED’s paddy 1.a. “Now frequently in form padi. Rough or unhusked rice (Oryza sativa), either as a growing crop or when harvested but not yet threshed.” No whack in sight.
… which has a history that can be traced far back: from AHD, “Malay padi, rice plant, rice in the field, unhusked rice, from Proto-Malayo-Polynesian *pajay, from Proto-Austronesian, rice plant.” Other descendants from this root can be seen at Wiktionary.
I don’t know if the OED — which didn’t go any farther than “Malay padi … Compare Javanese pari” — could have looked up that etymology when they revised paddy in 2005; Wiktionary cites The Austronesian Comparative Dictionary from “(2010–)”. But it’s too bad that they entered palay, “Philippine English. Rice that has not been husked” in 2018 without inquiring into its origins any further than “Tagalog”; in fact the Tagalog and Malay words are cognate, and I’d think they could have found that out at the time.
the prevalence of AI slop making it impossible to analyze human word usage after 2022
An editing colleague and I have independently noted, especially in academic English, “advancement” where the noun “advance” would have been expected, blurring a traditional distinction: “the advancement of learning” versus “recent advances in crystallography”. I attribute this shift to the known surge in recourse to AI, and predict that ngram evidence (up to 2022, so far) will show this even more strongly as later years are added to the database.
Lettervoxd “pharyngeals” disappointed me