Graphing the Distribution of English Letters.

David Taylor at prooffreader.com (“data hacking and sundry curiosities”) posted back in 2014 a very nice graph titled “Distribution of English letters toward beginning, middle and end of words.” He adds:

I’ve had many “oh, yeah” moments looking over the graphs. For example, words almost never begin with “x”, but it’s quite common as the second letter. There’s a little hump near the beginning of “u” that’s caused by its proximity to “q”, which is most common at the beginning of a word. When you remove “q” from the dataset, the hump disappears. “F” occurs toward the extremes, especially in prepositions (“for”, “from”, “of”, “off”) but rarely just before the middle.

A final thought: the most common word in the English language is “the”, which makes up about 6% of most corpuses (sorry, corpora). But according to these graphs, the most representative word is “toe”.

What fun!

Comments

  1. marie-lucie says:

    I am a fan of cryptograms (essentially letter-replacement puzzles, mostly in sentences) and in trying to solve them you learn to become very conscious of such distributions. Solving crossword puzzles involves distributions too, but as most of them deal with single words rather than sentences, very common function words like the or an, which affect the overall distribution of the letters that make them up, rarely occur in those puzzles.

  2. f,s are more common initially and finally, whereas v,z are more common medially. Huh.

    I could ad hoc explain it as devoicing or as voicing assimilation, but I don’t know if that is the right explanation.

  3. It would be interesting to repeat the experiment looking at pharmaceutical brand names. My guess is that many of the graphs would come out looking like the mirror image of these ones.

  4. Ian Myles Slater says:

    “f,s are more common initially and finally, whereas v,z are more common medially”

    This immediately reminded me of the rules for pronouncing Old English, but I couldn’t remember the details. However, I have “Bright’s Old English Grammar and Reader” (third edition, corrected printing, 1971) at hand, which explains the situation regarding f, s, and thorn/edh having dual values:

    “they represent voiced sounds when they occur singly (not doubled) between voiced sounds (except when the first is part of a prefix, e.g., the f in gefeoh remains [f]. Elsewhere they remain voiceless sounds.”

    Which I found fairly opaque when I first encountered it — and still do. However, the editors provided instances reflected in modern English spelling.

    For example, for voiced f [v] it provides ofer, efne, haerfest (over, even, harvest), and for voiceless, feld, aefter, hof (field, after, hoof). (Yes, the “hof” should have a long mark.)

    The f/v spelling switch, when it came, did not extend throughout, as witness of/off (which still reflects the doubling rule).

  5. Which I found fairly opaque when I first encountered it — and still do.

    It seems parallel to what happened in Romance, where (generally) intervocalic /s/ has become /z/, while intervocalic /ss/ has remained /ss/ or /s/. Basically the geminates are “stronger” and more resistant to voicing.

  6. @m-l, me too! And I agree the word distributions in crossword puzzles or cryptograms don’t correspond to usual corpora.

    I see my rules of thumb are validated: the word-final letters are more likely to be ‘e’ or ‘s’.

    I’d like to see a similar graphing of the positions of doubled-letters. ‘ee’, ‘oo’,’ss’, ‘ff’, ‘zz’, …

    @Y: as Taylor points out, he hasn’t scaled the Y (ahem) axis for overall relative frequency of each letter. I guess ‘v’, ‘z’ are overall less frequent than ‘f’ or especially ‘s’. I don’t find his averages so puzzling.

  7. Geminates and consonant clusters tend to resist lenition quite often. For voicing, Uralic, Dravidian and new Indo-Iranian are other good examples (per some descriptions, Tamil retains an allophonic singleton voicing rule to this day), but also begadkefat spirantization in Hebrew and Aramaic leaves geminates unaffected.

  8. Bathrobe says:

    Looking at his post on The trendiest words in American English for each decade of 19th & 20th c. (determined by a chemistry/astronomy technique):

    Also present [in the 19th century] are deliberately misspelled words like “uv” for “of” and “ter” for “to” (like “Ah oughts ter uv dun somethin”).

    Given that American English tends to be rhotic, I’m curious why “ter” would have been so common in American contexts. This kind of spelling was common in Australian writing at the turn of the 20th century (e.g., C. J. Dennis) and fitted the non-rhotic nature of Australian English. Is it possible that spellings like “ter” were a phenomenon all over the English-speaking world?

  9. I looked at COHA for the 1880s, the dominant decade for ter. It appears in dialogue spoken by African Americans and rural white Southerners, both non-rhotic groups. Here’s a paragraph from the 1883 novel The Red Acorn; the speaker is a white Southerner:

    “Cuss-an’-burn the blasted ole smooth-bore,” said Fortner, contemptuously. “Don’t waste no tear on that ole kick-out-behind. We’ll go ‘long ‘tween Wildcat an’ the Ford, an’ pick up a wagon-load uv ez good shooters ez thet clumsy chunk o’ pot-metal wuz. Shake yourself together. We’ve on’y got a mile or so ter go now.”

    Note that wuz represents /wʌz/ at a time when this was a non-standard pronunciation in American English; it is standard today. In Huckleberry Finn it is the same two groups (including Finn-as-narrator) whose speech Twain writes with wuz, whereas he uses was for the higher-class speakers, presumably representing /wɑz/ or /wɒz/. The 1890 Century Dictionary gives /woz/ as the only pronunciation, where /o/ represents the LOT vowel.

  10. Presumably Br’er (Rabbit) was intended to be pronounced “bruh” (even though people now rhyme it with “rare”)?

  11. I’ve always assumed so.

  12. I never realized that!

  13. Eli Nelson says:

    Actual phonetic rhotacization of some word-final schwas also occurs in “some Southern dialects, such as Appalachia and the Ozarks” according to “R-Dissimilation in English“, by Nancy Hall (2007, p. 30).

    It seems somewhat unlikely to me that rhotic speakers would use “ter” as a written representation of the sound “tuh”. Of course, it could be the case that non-rhoticity was so widespread at this time that it was considered standard, as in modern British English (where “ter” is I guess used as eye-dialect and to show that it is not being pronounced “too”).

  14. I think ter develops in two stages. In the first, it is a rhotic representation of intrusive /r/ as used by non-rhotics (but not AAVE speakers), where ought to have is pronounced /ɔtərʌv/ and written “ought ter uv”. In the second, ter comes to be a conventional fixed spelling for non-rhotic dialects even in places where linking /r/ does not appear, like ter go here. I haven’t done the work to track this down, to be sure.

  15. The use of intrusive r in to, you, etc. is less common than intrusive r in general. It’s passably common in the dialects of England, from what I gather, but doesn’t show up in modern RP – nor can I recall hearing it here in the Northeastern US. (Myself, I used intrusive r through adolescence but never had it in those.)

  16. J.W. Brewer says:

    I’m not sure about some of the fine details of the methodology here, but the pictures are pretty to look at and sometimes suggestive of meaningful patterns one might not have otherwise noticed. I recently saw (don’t know where but I think not here?) an interesting link which has a somewhat different approach to presenting data on the same underlying phenomenon. http://norvig.com/mayzner.html (scroll down until you get to “Letter Counts by Position Within Word”).

  17. “In the second, ter comes to be a conventional fixed spelling for non-rhotic dialects even in places where linking /r/ does not appear, like ter go here.”

    It may be not just conventionalised spelling, but a misunderstanding of the dialect. Rhotic speakers imitating non-rhotic speech often add erroneous /r/’s due to misanalysing linking /r/ — i.e. hearing /ði aɪˈdɪər ɪz/ with rhotic ears, they conclude that idea is realised as /aɪˈdɪər/ in general. As a Brit living in the states for some years, I heard such imitations many times…

  18. An excellent point.

  19. Rhotic idear seems to be a special case. Back in 1972, I often heard firmly rhotic Presidential candidate George McGovern from firmly rhotic South Dakota talk about “new idears”, and he certainly was not imitating or mocking anyone. Somehow this form had gotten lexicalized for him.

  20. I’ve heard non-prevocalic idear from rhotic New Englanders too.

  21. Rodger C says:

    In the south, ideal is often used to mean idea. There seems to be a tendency feel that idea is somehow incomplete. Has it been remodeled after dear and deal respectively?

Speak Your Mind

*