AI + Language Learning = Whee!

Carolyn Y. Johnson reports for the Washington Post (February 2, 2024) on helping AI to pick up basic elements of language:

For a year and a half, a baby named Sam wore a headcam in weekly sessions that captured his world: a spoon zooming toward his mouth, a caregiver squealing “Whee!” as he whizzed down an orange slide or a cat grooming itself. Now, scientists have fed those sights and sounds to a relatively simple AI program to probe one of the most profound questions in cognitive science: How do children learn language?

In a paper published Thursday in the journal Science, researchers at New York University report that AI, given just a tiny fraction of the fragmented experiences of one child, can begin to discern order in the pixels, learning that there is something called a crib, stairs or a puzzle and matching those words correctly with their images. […]

Linguists, philosophers, cognitive scientists and — increasingly — AI developers have all been puzzling over how humans learn language.

For years, scientists have been trying to understand how children’s minds take shape through carefully controlled experiments. Many involve toys or puppets that allow researchers to probe when various cognitive skills come online. They’ve shown that 16-month-old babies can deploy statistical reasoning to determine whether a noisemaker is broken, and that babies as young as 5 months know that an object still exists even when they can’t see it, a key developmental milestone called object permanence.

In addition, some individual babies have been closely followed over time. Deb Roy, a scientist at the Massachusetts Institute of Technology, set up overhead cameras in all the rooms of his house in 2005 and recorded his son’s linguistic development, providing a massive trove of data that chronicled the acquisition and evolution of words. That work suggested it was not how many times a word was repeated that predicted whether Roy’s son learned it early, but whether it was uttered in an unusual spot in the house, a surprising time or in a distinctive linguistic context.

The innovative use of headcams has given researchers an even more intimate view of early childhood. Since 2013, several families have contributed to the SAYCam database, a collection of audiovisual recordings from individual babies and toddlers over a crucial period of cognitive development, between 6 and 32 months. Families of the babies, who are identified only by first name, put cameras mounted on headbands on their children for about two hours a week.

Scientists can apply for access to the data, which provides a unique window into each child’s world over time and is intended to be a resource for researchers across a variety of fields. Sam, whose identity is private, is now 11 years old. But the recordings of his early life in Australia provided Lake and his colleagues with 600,000 video frames paired with 37,500 transcribed words of training data for their AI project.

They trained their relatively simple neural network on data captured when Sam was between the ages of 6 months and 2 years. The AI, they found, learned to match basic nouns and images with similar accuracy to AI trained on 400 million images with captions from the web. The results wade into, but don’t solve, a long-running debate in science about the basic cognitive skills humans need built into their brains to learn language.

There are various theories of how humans learn language. High-profile linguist Noam Chomsky proposed the idea of a built-in, innate language ability. Other experts think we need social or inductive reasoning skills for language to emerge. The new study suggests that some language learning can occur in the absence of specialized cognitive machinery. Relatively simple associative learning — see ball, hear “ball” — can teach an AI to make matches when it comes to simple nouns and images. “There’s not anything inbuilt into the network giving the model clues about language or how language ought to be structured,” said study co-author Wai Keen Vong, a research scientist at NYU.

The researchers don’t have comparable data on how a 2-year-old would perform on the tasks the AI faced, but they said that the AI’s abilities fall short of those of a small child. For instance, they could track where the AI was focusing when prompted with various words and found that, while it was spot-on for some words such as “car” or “ball,” it was looking in the wrong area when prompted with “cat.” […]

The AI picked up its vocabulary of objects from being exposed to 1 percent of Sam’s waking hours — 61 hours of footage accumulated over a year and a half. What intrigued outside scientists about the study was both how far the AI got based on that, and how far it still had to go to recapitulate human learning. “It’s really important and new to be applying these methods to this kind of data source, which is the data from a single child’s experience, both visual and auditory,” said Joshua Tenenbaum, a computational cognitive science at MIT who was not involved in the work. […]

Michael Tomasello, a developmental and comparative psychologist at Duke University, said that the AI model might reflect how a dog or a parrot can learn words. Experiments show that some dogs can learn more than 100 words for common objects or stuffed animals. But, he pointed out, it remains unclear how this AI could take sensory input and glean verbs, prepositions or social expressions. “It could learn that a recurrent visual pattern is ‘doll’. But how does it learn that that very same object is also a ‘toy’? How does it learn ‘this’ or ‘that’ or ‘it’ or ‘thing’?” Tomasello wrote in an email.

The AI model trained on the child’s experience, he noted, was able to identify things that can be seen, and that’s just a small part of the language that children hear and learn. He proposed an alternative model, where instead of simply associating images with sounds, an AI would need to make inferences about the intention of communication to learn language.

It’s interesting stuff, and seems like a case where AI (as I suppose we must call it) might actually be useful.


  1. David Marjanović says

    Progress on the way to measuring how poor the stimulus really is.

  2. “Associative learning” is associated with…. behaviourism! À la Pavlov. And maybe Skinner? I don’t hold a brief for behaviourism, but Chomsky’s extremes seem to be an extreme reaction to Skinnerian ideas.

    I don’t particularly mind the general idea of a LAD. What I find objectionable is Chomsky’s identification of his extreme formalisms with the human ability to language. X-bar was a “neat” formalism to force all tree structures into a consistent format, adopted on the basis of Chomsky’s belief that they were “right” and later unceremoniously dumped.

  3. Chomsky’s proposals were definitely partially an overreaction against the Skinnerian orthodoxy of the time. (See, for example, my comments here.)

  4. I have my doubts. “Beware the Jabberwock, my son / The jaws that bite, the claws that catch! / Beware the WEIRDo-bird, and sun / The frumious Bandersnatch!”

Speak Your Mind