From Google AI Blog comes news of a striking development in translation, posted by Ye Jia and Ron Weiss:
In “Direct speech-to-speech translation with a sequence-to-sequence model”, we propose an experimental new system that is based on a single attentive sequence-to-sequence model for direct speech-to-speech translation without relying on intermediate text representation. Dubbed Translatotron, this system avoids dividing the task into separate stages, providing a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated (e.g., names and proper nouns). […]
Translatotron is based on a sequence-to-sequence network which takes source spectrograms as input and generates spectrograms of the translated content in the target language. It also makes use of two other separately trained components: a neural vocoder that converts output spectrograms to time-domain waveforms, and, optionally, a speaker encoder that can be used to maintain the character of the source speaker’s voice in the synthesized translated speech. During training, the sequence-to-sequence model uses a multitask objective to predict source and target transcripts at the same time as generating target spectrograms. However, no transcripts or other intermediate text representations are used during inference. […]
By incorporating a speaker encoder network, Translatotron is also able to retain the original speaker’s vocal characteristics in the translated speech, which makes the translated speech sound more natural and less jarring. This feature leverages previous Google research on speaker verification and speaker adaptation for TTS. The speaker encoder is pretrained on the speaker verification task, learning to encode speaker characteristics from a short example utterance. Conditioning the spectrogram decoder on this encoding makes it possible to synthesize speech with similar speaker characteristics, even though the content is in a different language. […]
To the best of our knowledge, Translatotron is the first end-to-end model that can directly translate speech from one language into speech in another language. It is also able to retain the source speaker’s voice in the translated speech. We hope that this work can serve as a starting point for future research on end-to-end speech-to-speech translation systems.
Impressive, if it works as advertised; the audio samples they provide are short but sound good.
vocoder
The word sounded familiar somehow, I even knew what it stands for and vaguely understood its function, but couldn’t recall how I know.
Then it hit me.
“First Circle” by Alexander Solzhenitsyn, of course, and it’s even kind of central to the plot.
It is also able to retain the source speaker’s voice in the translated speech.
Except in the sample they give, a man’s voice becomes a woman’s voice….
In my experience text-to-text translation isn’t all that impressive, so I don’t think I’ll hold my breath on this. (Of course, I speak so peculiarly that speech-to-text within one language doesn’t work for me, so what do I know.)
Um, I’m not surprised if speech-to-text is difficult in Danish… it must be pretty hard in French, too…
The word sounded familiar somehow, I even knew what it stands for and vaguely understood its function, but couldn’t recall how I know.
In my case, I think I had only encountered that word in science fiction (and interpreted it from the context as referring to some kind of text/speech conversion device) – most likely in The Moon Is a Harsh Mistress by Robert Heinlein, but possibly somewhere else entirely.
Whatever it was, though, it didn’t involve Solzhenitsyn – in fact, I’m not sure whether I’ve even read that particular story.
Found that quote:
When work on the secret telephone reached a stage at which alternative experimental programs were possible, Roitman conscripted everyone he could get for the Acoustics Laboratory to work on the “vocoder.” (The name was derived from the English “voice coder.” An attempt was made to substitute a Russian name meaning “artificial speech apparatus,” but it had not caught on.)
and second quote is even funnier:
“You are Engineer . . . er . . . ” Abakumov consulted his scrap of paper.
“Er, yes,” said Valentin absently. “Pryanchikov.”
“You’re a senior engineer in the group working on the, er . . . ” He glanced at his notes again. “. . . artificial speech device?”
“What? What d’you mean, artificial speech device?” Pryanchikov waved a hand dismissively. “Nobody at our place calls it that. They changed its name during the campaign against kowtowing to foreign science. It’s a vocoder. Voice coder. Or scrambler.”
Wiki says vocoder was patented in the US in 1939 and used during WWII.
So the Soviets were about a decade behind when the effort to copy it in GULAG secret laboratories described in the novel started.
I wonder what that does to the accent in the translated voice.
Vocoders used to be used a lot in pop, in ELO’s Mr Blue Sky for instance. That’s where I first heard the word. Now only used for a retro-futuristic feel (e.g. Daft Punk).
The blog post linked to includes a speaker-voice Translatotron example further down the page. Pretty impressive, to my ears.
Neil Young used a Vocoder for the “robot voice” effect on the album Trans. A lot of people hated that album, including his record company, and myself. I kinda like it now though.
Probably nothing. Accent and voice are quite independent.
I was referring to my English pronunciation. I haven’t tried Danish.
David L, the only sample where they’ve tried preserving the original voice is the one labeled “Translatotron translation (original speaker’s voice)”, and the output does sound somewhat like the (male) input.
Reminds me the great scene from the First Circle where the prisoner-scientist pretends to read printed voiceprint for the clueless Soviet secret police chiefs:
It took Roitman forty-five seconds to lead Selivanovsky to them, but with the zeks’ unique quick-wittedness, they had realized at once that Rubin would have to demonstrate his skill in reading from voiceprints…
They understood each other at a glance.
“If you do it, and you can choose the sentence,” Rubin whispered, “say, ‘Voiceprints enable the deaf to use the telephone.’
…
” Right, now Lev Grigorievich will demonstrate his skill. One of the speakers, say, Gleb Vikentich, will go into the soundproof box and read a sentence into the microphone, the machine will record it, and Lev Grigorievich will try to decipher it”.
…
The apparatus began humming…
The whole lab stopped pretending to work and watched in suspense….Rubin alone remained seated, giving them glimpses of his bald spot. Taking pity on his impatient audience, he made no attempt to hide his hieratic ritual but quickly marked off sections of the still-damp tape with the usual blunt copying pencil.
“You see, certain sounds can be deciphered without the least difficulty, the accented or sonorous vowels, for example. In the second word the r sound is distinctly visible twice. In the first word the accented sound of ee and in front of it a soft v—for there can’t be a hard sound there. Before that is the formant a, but we mustn‘t forget that in the first, the secondary accented syllable o is also pronounced like a. But the vowel oo or u retains its individuality even when it’s far from the accent—right here it has the characteristic low-frequency streak. The third sound of the first word is unquestionably u. And after it follows a palatal explosive consonant, most likely k—and so we have ukov’ or ukavi. And here is a hard v—it is clearly distinguished from the soft v, for it has no streak higher than 2,300 cycles. Vukovi—and then there is a resounding hard stop and at the very end an attenuated vowel, and these together I can interpret as dy. So we get vukovidy—and we have to guess at the first sound, which is smeared. I could take it for an s if it weren’t that the sense tells me it’s a z. And so the first word is”— and Rubin pronounced the word for “voiceprints”—“zvukovidy.” He continued: “Now, in the second word, as I said, there are two r sounds and, apparently, the regular verb ayet, but since it is in the plural it is evidently ayut. Evidently razryvayut or razreshayut, and I’ll find out which in a moment. Antonina Valeryanovna, was it you who took the magnifying glass? Could I please have it a moment?”
The magnifying glass was quite unnecessary, as the apparatus made bold, broad marks. It was an old con’s spoof, and Nerzhin laughed quietly to himself, absently smoothing his already smoothed hair. Rubin gave him a fleeting glance and took the proffered magnifying glass. The general tension was growing, all the more so because nobody knew whether Rubin was guessing correctly so far. Selivanovsky was profoundly impressed.
“Amazing,” he whispered. “Simply amazing.”
Meanwhile, Rubin had deciphered the word “deaf” and moved on. Roitman was radiant.
“The final word, ‘telephone,’ we come across so frequently that I’m used to it and can recognize it immediately. And that’s the whole thing.”
“Astonishing!” Selivanovsky said yet again
Can it translate prosody? Like a question from Russian without final pitch rise, to English that uses it? Or emphasis in English source leading to word order choice in Russian, or to syntactic topicalization in a topic-having destination language. That kind of non-written content is where the point of this becomes clear to me.
What actually amazes me most is if we can now apply different “sounds like this person talking” masks effectively. I’ll have to try these out!