Automatic Speech Recognition for Low-Resource Languages.

Maeve Reilly writes about an interesting initiative:

For those who speak English, or another language that is prevalent in First World nations, Siri or other voice recognition programs do a pretty good job of providing the information wanted. However, for people who speak a “low-resource” language—one of more than 99 percent of the world’s languages—automatic speech recognition (ASR) programs aren’t much help. Preethi Jyothi, a Beckman Postdoctoral Fellow, is working towards creating technology that can help with the development of ASR software for any language spoken anywhere in the world.

“One problem with automatic speech recognition today is that it is available for only a small subset of languages in the world,” said Jyothi. “Something that we’ve been really interested in is how we can port these technologies to all languages. That would be the Holy Grail.”

Low-resource languages are languages or dialects that don’t have resources to build the technologies that can enable ASR, explained Jyothi. Most of the world’s languages, including Malayalam, Jyothi’s native south Indian language, do not have good ASR software today. Part of the reason for this is that the developers do not have access to large amounts of transcriptions of speech—a key ingredient for building ASR software.

She and Mark Hasegawa-Johnson are trying something called “probabilistic transcription” which involves native English speakers transcribing languages they don’t know using nonsense syllables (the current project focuses on Arabic, Cantonese, Dutch, Hungarian, Mandarin, Swahili, and Urdu). It sounds weird, and I don’t get quite how it’s supposed to work, but I wish them every success. (Thanks, Andy!)

Comments

John Roth says

January 7, 2017 at 11:53 pm

According to the fount of all knowledge, ASR is a fancy name for speech-to-text, so they’re apparently trying to build a speech model that can recognize the phonemes of a language in the face of natural between-speaker variation, without having to train the recognizer. That’s what systems like Siri and Dragon Dictate do. Their probabilistic model seems to be an effort at labeling the phonemes to give the training AI a head start.
languagehat says

January 8, 2017 at 8:42 am

Ah, thanks.
John Cowan says

January 8, 2017 at 9:26 pm

Basically. But speech recognizers don’t identify individual phonemes, they identify strings of them, syllabic or larger.
Yuval says

January 9, 2017 at 12:36 am

But isn’t a lot of ASR about orthography? Are we assuming they have phonetic dictionaries for all the languages?
Lamia says

January 11, 2017 at 2:02 am

Thank you for sharing this.
However, I am person of the old school who prefer direct communication and personal contact with people when it comes to conversation. I am not a machine dependent person and do prefer direct contact of the eyes, hands and emotions (body language)! It is warmer and more friendly and appreciative to communicate human to human, so we are able to share the smile of knowing and not knowing what the other speaker says.
Good day my human friends!
tangent says

January 11, 2017 at 7:11 am

“which involves native English speakers transcribing languages they don’t know using nonsense syllables”

I’ll have to check this out, but it does sound unusable on the face of it for languages that make phonemic distinctions that English doesn’t — “dark” versus non-dark/l/, say. Or tonal languages.
David Marjanović says

January 11, 2017 at 5:01 pm

body language

Overrated. I don’t speak it.
John Cowan says

January 12, 2017 at 7:15 pm

Unless you have paralyzed your face with Botox, you can hardly avoid speaking it.
David Marjanović says

January 13, 2017 at 7:33 am

I do smile on occasion, if that’s what you mean. But that’s pretty much it; I’ve even been called pokerface. You certainly can’t tell my emotions from the way I sit or stand, or from how much eye contact I make.
SFReader says

January 14, 2017 at 2:51 am

I think body language should not be confused with face language…