From Facebook to Amazon to Microsoft to Apple, big tech companies are racing to improve speech-synthesis systems. And the systems, which have become increasingly realistic-sounding, have a lot of potential benefits—from serving the blind, to helping people who are illiterate access information online, to assisting the elderly.

TechRepublic spoke to Alan Black, a Carnegie Mellon University (CMU) researcher who studies speech-synthesis systems. Black explained what’s new in voice-recognition, the current challenges, and what we should be concerned about.

What is the current technology behind speech-recognition?

The technology has moved from being very robotic-like into what’s called unit-selection synthesis. That’s where large databases of natural speech are recorded and then subword units, or phonetic units, are selected to go together that sound like natural speech. The best unit selection synthesizers are pretty much indistinguishable from natural speech. It sounds pre-recorded, even though it’s saying things the original speaker never recorded.

At CMU, looking at how to make conversations more natural. Most of the systems echo you; you speak to the system, it speaks back—that’s not how humans speak. They laugh, they talk back, that’s what makes human speech friendly. None of the systems actually do that yet. That would be quicker, more friendly, it would be able to build a relationship with the human. People would like to do that. From a research point of view, we’re trying to get the system to say “a huh” at the right time. To indicate that the other person can start speaking. So you get a more fluid conversation going on.

 Unit selection started as a research area for a group in Japan in the mid-90s, where we used databases in Japanese and then other languages. Most of the high-quality systems today use unit-selection. Mostly it sounds natural. When it doesn’t, you get a mismatch, people hate it when it goes wrong. It’s hard to stop it from going wrong all the time. Designing the right database is quite hard. But, it’s been moving towards naturalness. When people are interacting with some systems, they can forget it’s a machine and think it’s a person.

Can you share an example?

At Carnegie Mellon, we were running a system giving bus info to people in Pittsburgh. We would tell them about the next bus. Most of the time, it sounded like some human telling you about the next bus. A number of people would forget it’s a machine. They would be more casual and natural speaking to it, and that would make the speech recognition not as good because the speaker was being casual, and then the system would fail. You had to be careful about being too human-like.

What about Siri, who doesn’t sound too realistic?

Siri is deliberately not being very friendly so that people will be careful when they speak to it. It’s not a friend, it’s a professional assistant. They want you to speak clearly to it, like an assistant rather than a best bud. That’s the issue we’re getting. Because we get high synthesis, it does sound very natural, and it confuses the user. When it fails to perform properly, people get more critical of it.

What are the biggest challenges for researchers?

We get undergrads to test the systems at CMU. At big commercial systems, they have hundreds of thousands of users that allows them to do experiments we couldn’t do. One of the issues we have in the research field is getting a user population where we can do experiments. We sometimes work with Apple and Amazon to do that, but the companies can’t allow us to take over their system. If you saw the recent thing with Microsoft, the bot started saying horrible, racist things, because people broke it. I wondered why they didn’t know that was going to happen.

How do speech-recognition systems learn?

A company like Apple most likely records everything that happens and keeps logs of information about what’s going on that allows them to prove it. One big company said that one of the most interesting things to do is go to log to find the most common thing people ask that the system can’t deal with. There’s ways of going through the records to see what’s recognized. People are asking about how to play pieces of music, we never thought of that. That’s the thing you get once you have a big enough system. People are mining for information about what’s missing.

You say that systems like Siri are deliberately making distance between machine-human. Will that change?

Oh yes. The companies who are building the systems want the systems to be an integral part of how humans communicate with their devices. They want to build long-term relationships so that people rely on the systems. They want efficiency, so instead of speaking big long sentences, you can give a grunt in the morning and it knows you want the weather and the news. You don’t have to have your coffee first. It can say, “Good morning, Alan! The weather in Pittsburgh is 47 degrees and here’s the news around the world.” The companies would like to do that so you become more engaged with your personal assistant.

Do you have any concerns about the systems getting too human-like?

We begin to trust them. We begin to believe what they say. We often forget that they’re controlled by companies that are trying to make money. It’s the job of Amazon to tell us what to buy. If you have a personal assistant who starts making recommendations of what to buy, you listen. That could be a problem. It could recommend who to vote for. Going from being useful to influencing. People need to remember the machines are controlled by companies. But, if it’s there all the time, you need to remember it’s their bot, not your bot.

 

[Source:- Techrepublic]