Making Sense of Urban Dictionary.

I presume we’ve all used Urban Dictionary from time to time and been both enlightened (so that’s how the kids are talking today!) and amused (beer: “Possibly the best thing ever to be invented ever. I MEAN IT.”). I always vaguely wondered how useful it was from a scientific point of view, and now I have Dong Nguyen, Barbara McGillivray, and Taha Yasseri’s paper “Emo, Love, and God: Making Sense of Urban Dictionary, a Crowd-Sourced Online Dictionary” to tell me. Here’s the abstract:

The Internet facilitates large-scale collaborative projects. The emergence of Web~2.0 platforms, where producers and consumers of content unify, has drastically changed the information market. On the one hand, the promise of the “wisdom of the crowd” has inspired successful projects such as Wikipedia, which has become the primary source of crowd-based information in many languages. On the other hand, the decentralized and often un-monitored environment of such projects may make them susceptible to systematic malfunction and misbehavior. In this work, we focus on Urban Dictionary, a crowd-sourced online dictionary. We combine computational methods with qualitative annotation and shed light on the overall features of Urban Dictionary in terms of growth, coverage and types of content. We measure a high presence of opinion-focused entries, as opposed to the meaning-focused entries that we expect from traditional dictionaries. Furthermore, Urban Dictionary covers many informal, unfamiliar words as well as proper nouns. There is also a high presence of offensive content, but highly offensive content tends to receive lower scores through the voting system. Our study highlights that Urban Dictionary has a higher content heterogeneity than found in traditional dictionaries, which poses challenges in terms in processing but also offers opportunities to analyze and track language innovation.

There’s a discussion of the article at “The Anatomy of the Urban Dictionary,” by Emerging Technology from the arXiv (do their friends call them Em or ET?):

The team also compare the lexical coverage of Urban Dictionary and Wiktionary. It turns out that the overlap is surprisingly small—72 percent of the words on Urban Dictionary are not recorded on Wiktionary.

However, the team note that many words on Urban Dictionary are relevant to only a small subset of users. Many are nicknames or proper names such as Dan Taylor, defined as “A very wonderful man that cooks the best beef stew in the whole wide world.” These usually have only one meaning. […]

The work provides a unique window into a website that has come to play an important role in popular culture. That should set the scene for other studies. In particular, an interesting question is whether online dictionaries not only record linguistic change but actually drive it, as some linguists suggest.

Via MetaFilter.


  1. To continue a pattern of leaving a suggestion for a new post at the end of a recent but likely defunct thread (so as not to hijack the new one):

    I thought these were interesting assertions relative to the development of Anatolian Indo-European languages, arguing that in keeping with the direction pointed by some recent genetic evidence, that these are younger rather than older branches of IE. I wondered what more skilled linguists might make of it:

    >There is pretty good linguistic evidence to suggest that the relative contributions of language contact and drift to language change is much more heavily weighted towards language contact than is widely assumed (I’ll save the detailed evidence of that for another day.)

    >Also, other Indo-European languages had substrate languages which were all much more similar to each other as a result of their common origins in First Farmer languages which themselves have common origins, and if there is a shared change in a group of languages due to parallel interactions with similar substrates, this makes the direction and nature of language change non-random and makes these languages look younger than they really are.


  2. David Marjanović says

    “I’ll save the detailed evidence of that for another day”, says the author. Well then. We’ll have to wait.

    I agree, though, that the cited recent papers strongly suggest that the speakers of the Anatolian languages seem to have lacked Yamnaya ancestry and had not come from the west, but from the east.

  3. A couple of the data points involve analysis of what is flawed in the work of New Zealander Quentin Atkinson (who, among other things has developed language origin branching trees that are much older than known calibration points in part by failing to consider the impact of language contact) and the remarkably conservative state of the Icelandic language relative to Old Norse.

  4. David Marjanović says

    in part by failing to consider the impact of language contact

    How so?

    What they did, however, was to code the presence/absence of each root as a separate character, instead of coding each root with the same meaning as a different state of a single character. That made all branches far too long.

Speak Your Mind