Fragments, Ellipsis, and Sluicing.

“A lot of talk is fragments—it’s the kind of thing we understand reflexively as human beings, but it’s much harder for machines,” notes Jim McCloskey, professor of linguistics at UC Santa Cruz. “Linguistic theory teaches us what kind of structures there are in our mind, but how to make sense of these fragments is also a nuanced engineering problem.”

This problem is one that appeals to a researcher like McCloskey, who has dedicated his work to understanding language, and now Silicon Valley tech companies that are seeking to make mobile devices—phones, tablets, and more—that can understand and decode the subtleties of human language.

And in the search for solutions, UC Santa Cruz students helping with this research have found they are able to apply their knowledge and research skills after graduating as analytical linguists for tech companies big and small. […]

McCloskey notes that speakers and writers often leave out informationally redundant grammatical material—such as when the verb “call” is omitted in “Jay Z called, but Beyoncé didn’t.” This process, known as ellipsis, is widespread across the languages of the world, and is particularly common in informal language and dialogue.

Among the many varieties of ellipsis is “sluicing,” where what is omitted is not a verb, but an entire sentence. For example, a speaker may leave out the understood sentence “he called” after “why” in a sentence like: “He called, but I don’t know why [he called].”

Ellipsis creates challenging scientific and engineering problems. Although research over the past 50 years has shown that the principles permitting ellipsis involve many different types of information (grammatical structure, context, real-world knowledge), the precise mix of these principles and their interaction is still an open question.

Progress to date has been delayed by the lack of one crucial resource: databases that are large enough to validate theories and rich enough to form the basis for machine learning.

At UC Santa Cruz, McCloskey is collaborating with faculty and students in the language sciences to develop that resource—a richly annotated database of naturally occurring ellipsis, which will be freely available to researchers around the globe who are trying to understand what their implications might be for our understanding of the nature of human language.

