Kane: Then, how does the model actually differentiate between the speech and the background noise?
Andrew: What we do is we train our acoustic models with the noise. We take the sentence of someone saying, “The quick brown fox jumped over the lazy dog,” and include all of this background noise.
Noise like unwanted background speech is particularly difficult because you’re trying to teach this algorithm to recognize and identify human speech. At the same time, you’re trying to tell it that there’s just one specific person that I want you to listen to and ignore all those other voices.
So, we have to teach it that you can have one voice and then several voices underneath it that you want to completely ignore. We train it with all kinds of unwanted background noise that can be anything, such as dogs barking. We have to cut through that noise by including it in our models.
As a developer, you can switch between different acoustic models on the fly. So if you have a device that could be both near-field and far-field (near-field is where you’re very close to the microphone, and far-field can be several meters away), depending on the microphone that picks up the voice, you can send it to a different endpoint. You can switch acoustic models.
If we see that the acoustic model isn’t performing in an optimal way in a new use case, then it could be worth collecting data from the actual environment. But typically, we hope we would have already collected a lot of data in those environments. For cars specifically, we add things like indicator noise, AC fans, and engine noise at different speeds.
With a car, you’re shifting gears and going from 30 miles an hour to 50 or 60, and those frequencies are changing constantly. What we do then is collect data at different speeds, so you have all the engine noise for that model of car trained at different speeds. We also use other noise like the window wipers and in different driving conditions, such as rain.
The Lombard Effect is an important part of this. If you mix a clean recording of someone with background noise, all you have is someone speaking in a quiet environment with background noise. Whereas in real life, when someone is in that environment, they will modify how they speak.
A good example of that is if you enter a restaurant at 5 pm, and it’s really quiet, your voice will be much lower. You can even whisper and have a conversation. But as people come in, the noise gets louder, and you will then start getting a slightly higher-pitched voice, and the frequencies of your voice will change.
We have to consider that because when you’re at a much higher speed in the car and the windows are open, the frequencies that you emit from your voice will also change. It’s not just the amplitude or that you’re speaking louder. The actual frequency range changes as well.
Kane: So noise cancellation headphones work by inverting the signal, so the wave goes in the opposite way and cancels out. Is that right?
Andrew: There is a frequency range. Human voices are within a certain range of frequencies. Background noise, such as fan noises, AC, and the rumbling of a plane, are frequencies that don’t match up with the human voice.
Essentially, you can throw those out. As you said, it sends them back to cancel them out. But it’s a lot easier to identify those frequencies, remove them without interfering with the voice signal than it is from other sources of sounds.
This is like the non-linear elements where it comes in and out and covers some of the same frequencies. With noise cancellation headphones, it cuts out the background noise. But if you have a kid screaming next to you or a baby crying, those kinds of noises won’t be able to be canceled out because it’s much more difficult to do that.
Kane: Do you have to deal with a mask effect where some of the background noise is actually at the same frequency as the speaker?
Andrew: Yeah, essentially, those types of noises interfere with the speech recognition engine because there’s no easy way of canceling and removing them. It’s like reCaptcha on websites. There is a sequence of letters that you’re supposed to type in, but there are additional squiggly lines that have been added to make it purposely difficult. So, that’s the situation that we’re in.
We try to teach the acoustic model to recognize the words regardless of how much noise or the type of noise there is in the environment.
Kane: You mentioned that this is something that you can do on the fly, switching different acoustic models?
Andrew: Switching on the fly is generally for far-field and near-field because they are very different. There are challenges with near-field that you don’t get with far-field. So we mentioned convolutional noise, so that’s typically the noise that you’ll get with far-field.
For example, I’m speaking right now, and my voice is going directly into the microphone. So the sound waves are going in, but they’re also hitting the wall and coming back at the same time. So convolution is essentially like a reverb on the voice coming from the front.
It’s generally a separate model when it comes to near-field. The issue that you get is distortion. The reason why that’s difficult is you can distort everything and then teach the acoustic model to recognize the voice actions. But when it’s just one kind of phoneme that’s been distorted and the rest is normal, that’s where it gets difficult.
That’s what happens with near-field audio because you get too close to the microphone. So that’s why we have separate models. When it comes to all this background noise, we train the model on all of it, so it’s going to be able to perform at 65 mph, 30 mph, and in a parking garage. All of that noise has been included in the acoustic model.
Kane: Is it possible to differentiate from different speakers, especially if there are other people talking in the background?
Andrew: There are different ways of doing that. One is an anonymous way of doing it based on spatial parameters. Whoever triggers it first will get all of the attention from the microphone array. So the microphone will then lock onto wherever that person is, and everyone else is ignored. This way is useful for multi-zone use cases.
There are other ways of specifically identifying a person. That’s generally the way core technology requires enrolling the users. If we’re thinking in terms of members of a family, you can ask each member of the family to say the wake word a few times, so when that person says the wake word, you actually identify and recognize them. That response is appropriate for whoever is actually talking to the voice assistant.
Kane: Voice is obviously leaving the home as it has been on mobile anywhere. More brands are going to be looking at putting voice solutions in place in stores, as they start to open up, and in those quick-service restaurants, and in all different kinds of environments.
Looking for more information on voice AI technology in every environment? Go to www.houndify.com or visit our blog [email protected] for more information.
If you missed the podcast and video live, or if you want to see it again or share with a colleague you can view the interview in its entirety here.
Be sure to join Kane Simms and Darin Clark of Soundhound Inc. in October as they discuss wake word detection and conversational AI. Subscribe to the VUX World newsletter to keep tabs on what’s coming next.