Been experiencing the strange concept of a 'wake word' for a few years now. Its means of switching context ... saying that after I say this special word or phrase, you can interpret everything I said afterwards as special, like a command. My Echos and Google Homes and Siri do that. Sometimes well, some times not. And if there are multiple devices, what if several 'wake'?
It seems that some devices, in certain places, can do it better or worse, leading to misinterpretation. Sometimes this is annoying, even dangerous. I have set timers, and when I didn't carefully wait for a confirmation, discovered they were not set. Its about the acoustics and expectations, I understand, like when you are talking to people. This technical article shows there is lots going on with the wake word now:
Using Wake Word Acoustics to Filter Out Background Speech Improves Speech Recognition by 15% By Xing Fan Amazon Alexa.
One of the ways that we’re always trying to improve Alexa’s performance is by teaching her to ignore speech that isn’t intended for her.
At this year’s International Conference on Acoustics, Speech, and Signal Processing, my colleagues and I will present a new technique for doing this, which could complement the techniques that Alexa already uses.
We assume that the speaker who activates an Alexa-enabled device by uttering its “wake word” — usually “Alexa” — is the one Alexa should be listening to. Essentially, our technique takes an acoustic snapshot of the wake word and compares subsequent speech to it. Speech whose acoustics match those of the wake word is judged to be intended for Alexa, and all other speech is treated as background noise.
Rather than training a separate neural network to make this discrimination, we integrate our wake-word-matching mechanism into a standard automatic-speech-recognition system. We then train the system as a whole to recognize only the speech of the wake word utterer. In tests, this approach reduced speech recognition errors by 15%.
We implemented our technique using two different neural-network architectures. Both were variations of a sequence-to-sequence encoder-decoder network with an attention mechanism. A sequence-to-sequence network is one that processes an input sequence — here, a series of “frames”, or millisecond-scale snapshots of an audio signal — in order and produces a corresponding output sequence — here, phonetic renderings of speech sounds. ... "
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment