Google's talking AI is indistinguishable from humans


I for one welcome our new Tacotron overlords.


“That girl did a video about Star Wars lipstick.”
“She earned a doctorate in sociology at Columbia University.”
“The buses aren’t the PROBLEM, they actually provide a SOLUTION.”


The uncanny valley for voice recordings vs. AI generated speech is that humans will often breathe into the mic, even for a split second. If they start emulating small breath sounds, the AI voice would be even more difficult to distinguish.


“The entree consists of boiled dog”


Tacotron is based on the Google’s wavenet algo, and one of wavenet’s first accomplishments was making all the gross little breathing, lip smacking, tongue flapping and sniffing noises. It’s also good at generating convincing piano music.


first the AI is 2 - the pace is too uniform
second the AI is 1 - the real voice slows around difficult consonant combinations
third the AI is 2 - it lacks the depth of tonal variation of the other
fourth the AI is 2 - some mis-emphasis on syllables


They’re not fooling me yet.


Maybe we’re all just confused and humans are just starting to sound more like talking AIs.


The bus one might be about the Google buses rather than public transport, though.


Perhaps not until a touch of “human” is introduced, such as stammering, or what you hear ends up f—ing you over.



Still discernible subtle differences but pretty good!


The real question about incorporating AI into speech synthesis isn’t so much how much more realistic you can make it sound but in how many more places speech synthesis can be incorporated into with the AI.

I mean for example take the rise of speech synths alongside singing synths like Vocaloid in Japan. Programs like Voiceroid for speech and CeVIO for both speech & singing have been growing in popularity in use. Those programs are starting to get heavily used in “Let’s Play” gameplay videos, reading news articles, some light usage in skits; short anime clips; small film & etc.

But those programs main problem is they really can’t be used in real time, thus limiting where they can be used and in what capacity/usage. You can narrate prerecorded gameplay footage with the software, but you have a very difficult time trying to both play a game and input words into it at the same time. There has been some experimentation of incorporating machine learning and AI to make voice recognition work with these programs thus making it possible to do both at the same time (

There should be some concern about AI and voice synthesis but mostly around the fact that scammers could possibly target elderly people much easier over the phone, they could target a significantly larger number of them and the increased realism of synthetic voices is only going to make that problem even worse. It could end up being as almost as easy as sending out common spam.


