The Small World of English.

This site looks interesting, but the details are above my pay grade:

Building a word game forced us to solve a measurement problem: how do you rank 40+ ways to associate any given word down to exactly 17 playable choices? We discovered that combining human-curated thesauri, book cataloging systems, and carefully constrained LLM queries creates a navigable network where 76% of random word pairs connect in ≤7 hops—but only when you deprecate superconnectors and balance multiple ranking signals. The resulting network of 1.5 million English terms reveals that nearly any two common words connect in 6-7 hops through chains of meaningful associations. The mean path length of 6.43 hops held true across a million random word pairs—shorter than we’d guessed, and remarkably stable.

This is consistent with the small-world structure and near-universal connectivity seen in lexical network research on smaller datasets. The network’s structure makes intuitive semantic navigation possible—players can feel their way through meaningful transitions: a crown’s gemstones lead to emerald’s foliage and finally to a forest canopy, or a flame becomes an ember, then a glowing memory, a mental recall, and finally the action to cancel.

English exhibits network effects remarkably similar to social networks—nearly any random pair of words can reach each other in just a few hops through chains of meaningful associations. This “small world” phenomenon was first measured in word co-occurrence networks, and persists even after we deprioritize superconnector words that might otherwise dominate many paths. To probe this, we randomly sampled 1 million word pairs (4 days processing on 32 cores), to get a strong statistical sampling of the connected core of English.

There’s much more at the link, including many charts and examples; there’s a section “Understanding Our Biases,” which is a good thing, and at the end there’s a “Making the Game” link which gives the background. (Via chavenet’s MeFi post.)

Comments

  1. David Eddyshaw says

    English exhibits network effects remarkably similar to social networks—nearly any random pair of words can reach each other in just a few hops through chains of meaningful associations

    How remarkable (or not) this is depends not on the number of hops but on how many “meaningful associations” each word typically has.

    You can make the number of hops as small as you like by increasing the number of associations.

    a crown’s gemstones lead to emerald’s foliage and finally to a forest canopy, or a flame becomes an ember, then a glowing memory, a mental recall, and finally the action to cancel.

    The examples suggest a degree of semantic latitude worthy of an Afro-Asiatic etymological dictionary.

    “6-7” actually strikes me as surprisingly high. Seems their “AI” is not very good at this …

  2. jack morava says

    This is quire new to me but I have to differ, and suspect that the 6/7 steps bound/estimate is what makes LLMs work as well as they do. A 6/7 length path is a point in a 6 or 7-dimensional space, and (as Barbie and David Hilbert say), that kind of navigation is hard.

    -> The examples suggest a degree of semantic latitude worthy of an Afro-Asiatic etymological dictionary.

    I should belt up and look at the paper but IIUC that would be the point. This reminds me of an all-time fave

    https://www.ucpress.edu/books/the-tibeto-burman-reproductive-system/paper

  3. David Eddyshaw says

    Your understanding of connectivity undoubtedly surpasses mine.

  4. J.W. Brewer says

    I just want to note that the book Jack Morava recommends is remarkably cheaply-priced for a scholarly work on a niche topic published by a university press. Good for the press in question! Unless they afforded it by sacking all their copy editors, of course.

  5. David Marjanović says

    emerald’s foliage
    a glowing memory

    ~:-| I guess I fail the Turing test tonight.

  6. David Eddyshaw says

    Soon, only a human being will have the necessary creative spark to fail a Turing test.

  7. Jonathan D says

    Embers as memories is in Merriam Webster, no blaming the LLM for that.

  8. i always forget that matisoff is a specialist in an area entirely unrelated to the work i know of him from (a peril of being one’s own key informant on a project, i suppose). i hope his writing is just as flavorful on matters tibeto-burman!

  9. English exhibits network effects remarkably similar to social networks …

    Are these social networks the ‘Six Degrees of Separation’ nonsense?

    … 95% of the letters sent out had failed to reach the target.

    Not only did they fail to get there in six steps, they failed to get there at all.

    I think that’s what the ‘Small World’ site is conceding:

    Our analysis revealed a fundamental division in the network:

    Reachable terms (56.8%): 870,522 words that appear in the top-40 associations of at least one other word
    Unreachable terms (43.2%): 662,903 words that never appear in any other word’s top-40 list
    The unreachable terms include rare compounds (“stewing in one’s own grease”), technical terminology (“thermodispersion”), proper nouns (“Besisahar”), and alternative capitalizations. While these terms can point to other words, no words point back to them strongly enough to rank in any top-40 list. This doesn’t affect puzzles—which start from common words …

    IOW if a word has a good income and lives in a nice middle-class suburb with a decent delicatessen, it’ll meet the ” ≤7 hops” to other bourgeois words. If it lives in the Rural South or inner-city slum where it can’t afford to go out to a coffee shop/wine bar every week, not so much.

  10. @AntC: I don’t think that 95% failure rate can be right. Certainly, some letters failed to reach the target in Medford, Massachusetts, but Milgram actually took that into account. The “six degrees of separation” thing gets misapplied all the time. It’s not the maximum number of steps between two people in America (or the world). It’s the average length of the letter chains that actually found their way to the target (implying that the average number of steps possible between two Americans is actually smaller, since the participants in the experiment obviously had to guess who to send the letter to next). And, what’s relevant here, is that because there seemed to be a uniform probability of a recipient just throwing a letter away rather than passing it on, longer chains were more likely to fail than shorter ones. So, while the average completed chain length was about six, if all chains were completed, the average would actually have been about seven.

  11. J.W. Brewer says

    Wait, so does the claim “nearly any random pair of words can reach each other in just a few hops” exclude by fiat all pairs including one or two of the 43.2% of the total universe of words deemed “unreachable terms”? That’s a peculiar usage of “nearly any,” since less than a third (0.568 squared) of random pairs drawn from the larger set will meet that criterion.

  12. IOW if a word has a good income and lives in a nice middle-class suburb with a decent delicatessen, it’ll meet the ” ≤7 hops” to other bourgeois words. If it lives in the Rural South or inner-city slum where it can’t afford to go out to a coffee shop/wine bar every week, not so much.

    To take the human side of that analogy, I’d bet that what makes the difference for most Americans is not one’s routine in-person social life but higher education and the military, and these days, whatever connections on social networks would be considered meaningful.

    Incidentally, I give the authors points for starting with “small world” instead of “decrees of separation”, and for not picking a “superconnector” and finding the distance of every word from it so the closer ones can brag.

  13. jack morava says

    One way to interpret this model is as an image of molecular/solid state physics, with words as molecules in some physical medium, each with its own chemical affinities. In general a molecule at point x will interact with its neighbor nearby at point y which may interact with a neighbor at point z nearby, u.z.w. until you’re far enough away for the propagated ripple to have died out. If the medium isn’t too turbulent or unsteady, this expected/average time or number of steps may be a useful statistical variable.

    It can happen in (models for) solid state physics (in particular, the Landau-Ginzburg model for superconductivity) this `coherence length’ can go to zero or infinity, which is interpreted as a phase change. It may make sense to interpret the `navigable network where 76% of random word pairs connect in ≤7 hops’ as the condensed phase
    (eg like mayonnaise with droplets of unconsolidated oil & water droplets mixed in, to account for the missing 24 or so percent).

    [I’ve spent considerable time this morning googling for references for this but there is such an immense literature on this topic in solid state/soft matter physics as to make it very hard to find sensible nontechnical accounts of what is effectively a poetic image. I learned this from a talk by Edward Witten, explaining why water often becomes opalescent with tiny bubbles just before the phase transition to boiling.]

  14. David Marjanović says

    Soon, only a human being will have the necessary creative spark to fail a Turing test.

    I mean, I’m not creative enough to endow an emerald with however metaphorical foliage, or even to describe a memory as “glowing” which people have evidently done before.

    at point z nearby, u.z.w.

    Heh.

  15. PlasticPaddy says

    @dm
    I am not sure the step from memory to glowing is a direct one, rather memory => remnant => ash/ember => glowing ash/ember.

  16. jack morava says

    @ JW , does

    https://escholarship.org/content/qt3c40r8jv/qt3c40r8jv.pdf

    work for you (or anybody)?

    @ rozele, my wife was in a class he taught at Columbia when we met & since then \dots
    I have old notes of his on Bondage and Dominance Theory

  17. @J. W. B.: Wait, so does the claim “nearly any random pair of words can reach each other in just a few hops” exclude by fiat all pairs including one or two of the 43.2% of the total universe of words deemed “unreachable terms”? That’s a peculiar usage of “nearly any,” since less than a third (0.568 squared) of random pairs drawn from the larger set will meet that criterion.

    It’s a good question, but for one thing, “word” isn’t coterminous with “term” here, because “term” includes such locutions as “stewing in one’s own grease”.

    Also the sense of “any” may be more like “If you pick any two words at random from the corpus, in nearly every case, you can get from one to the other in a few hops,” since the unreachable words are very rare. If that’s what the authors mean, I might still call it peculiar, but less so.

    @DM: I assume the connection between “emerald” and “foliage” is “emerald foliage” and “emerald-green foliage”.

  18. @jack: See also percolation theory, about which I know nothing.

  19. Rodger C says

    “stewing in one’s own grease”

    Who makes a stew using grease? I think it’s “stewing in one’s own juices.”

  20. Stu Clayton says

    Who makes a stew using their own juices ?

  21. Trond Engen says

    A Stu? Surely somebody did.

  22. jack morava says

    @ Jerry Friedman,

    I agree, percolation theory is interesting, but I really do know nothing about it.

    It’s a side issue but questions about what’s known vs what’s known to be known bother me…

Speak Your Mind

*