Anatoly recently posted about the Acapela Text to Speech Demo, saying he was struck by how well the Russian voice (Алена) rendered the text he entered. I tried it with both Russian and English and was similarly impressed. So I ask the same question he did: is this a particularly good, cutting-edge, site, or is this pretty standard for the current technology? If so, it’s come a long way since I last noticed it.


  1. I recently did a project which involved training a TTS engine for Slovak and aside from a few glitches, I was generally equally impressed. So to answer your question: yes, the technology has come a long way and what you hear is pretty much standard. Not for all languages, though.
    On a related note: recently, there have been a few news reports in these parts about a speech-to-text engine the Academy of Sciences has developed and reportedly the software is to replace court reporters next year. Now that’s something I’d like to see for myself.

  2. Meh.
    I tried the English “I prefer proffered preferences”, in several voices. The british ones sounded more natural than the American ones, but all were unmistakably artificial.
    Oddly, the American voices had difficulty with the first cluster of ‘preferences’, where the p comes out almost like a separate syllable.

  3. The comments at Avva indicate that it’s not any one particular “breakthrough” that makes the voices lifelike, but rather a slow incremental process of refining and polishing and hand-tuning.

  4. Very impressive. The last time I listened to text-to-speech – not long ago – it still was at the stage where I lost track if I wasn’t simultaneously reading the text. The English voices are interesting: “Graham” and “Lucy” are classic RP, to the point of sounding mannered and prissy; “Peter” and “Rachel” are more representative of modern RP, and “Peter” even has a slight edge of Estuary at times.

  5. That you’re testing their TTS with tonguetwisters tells them that their TTS is terrific.
    One of my cousins was working on this kind of thing 25 or even 30 years ago, and at that time he and a lot of others were discouraged at the difficulties involved. Ten years or so later he was more upbeat. Last tie I saw him (about 8-10 years ago) he’d founded a company whose product may play some part in this product. As Wimbrel says, it apparently was an accumulation of specific problems solvable one at a time rather than some large metaphysical problem.

  6. That’s amazing, like something from science fiction. Helen and Peter confessed to a lot of bad stuff, something to do with Waldseemüller’s friend Matthias Ringmann, though perhaps I’m putting words in their mouth.

  7. If you’re using a Mac, you can use the terminal command ‘say’ to hear what current run of the mill text-to-speech is like. I tried
    >say “now is the time for all good men to come to the aid of their party-warty.”
    and the last word sounded rather artificial– but otherwise it sounded pretty good.

  8. michael farris says

    I tried cutting and pasting random bits in random languages (Norwegian, Spanish, Indonesian) and having random voices from random languages speak them.
    A lot of fun.

  9. michael farris: A lot of fun
    Yes. I admit I’ve been idly amusing myself by getting the ultra-RP speakers to say hardboiled quotes from Lock, Stock and Two Smoking Barrels – and the Russian “Alyona” to say “You will daiee, Mister Bond. But not until I have extracted the maximum of pleasure from you”.

  10. The British synthesized voices in the demo cannot pronounce the hard “g” in the word “bagels” — and, strangely enough, the American synthesized voice, in the phrase, “Bagels, bagels, bagels! Is that all you ever bake?”, got one of the “bagels” slightly wrong.

  11. How strange! The English ones all say “bidgels”, but the US one clearly says bay-. I wonder why.

  12. But change it to “Beagles, beagles, beagles! Is that all you ever bake?” and it has no problem. If you try to con it with “Beagles, beagles, bagels!”, it sticks to its guns.

Speak Your Mind