Please conduct research on best practices for utilizing speech recognition technologies.

Part
01
of one
Part
01

Please conduct research on best practices for utilizing speech recognition technologies.

Below is the list of requested questions and an overview of the best practices within each question, as it relates to speech recognition technologies. Additionally, it's important to note that this technology is rapidly growing and changing, so it's important to stay in the loop with technology blogs to learn about new techniques.

How should I record the audio?

When recording your audio, use “a sampling rate of 16,000 Hz or higher”, as “lower sampling rates may reduce accuracy”. It is recommended to use a lossless codec to for your recordings, specifically FLAC and LINEAR16 are reliable.
Additionally, it is recommended that “the ideal hardware setup for any speech recognition application should do minimal digital signal processing on those fixed embedded chips. It should take in raw waveforms and build complicated deep learning algorithms on top that are both trainable and flexible”.

If you don't want to go with an advanced technology route, smartphones can be utilized for recording as well. "Speech APIs perform speech recognition by communicating with Apple's servers or using an on-device speech recognizer". It's important to note that Apple's speech recognition techniques are not compatible with all languages. Since the app might "connect to the servers to perform recognition...you must get the user's explicit permission before you initiate speech recognition", in order to protect and respect the privacy of the users.

Do I need multiple microphones?

It does not seem that multiple microphones are needed for speech recognition technologies, as advancements in microphones have been successful, with even our phones having multiple microphones built in. Below is an example of microphone technology (from Vesper) that is being utilized to improve speech recognition performance:
“They are changing the way microphones are designed by exploiting the physical property of piezoelectric material. As a result, their microphone technology doesn’t suffer from dust and environmental degradation like traditional mics do, so their microphones are much higher quality and therefore deliver higher fidelity results, which serve as the signal transmitted for further processing”.

What if the environment is noisy?

Extreme background noise may reduce the accuracy of your recordings. For example, if the background noise was the sound of a parking car, your word/command recognition rate would be roughly 90%, while having the windows fully open reduces the rate to 25%, and having additional speaker background noise with windows open leads to a 0% word/command recognition rate.
However new technologies are starting to emerge, such as the Human-to-Machine Communication (HMC) optical sensors that “facilitates a more natural, personalized, accurate, and secure voice-control experience…by gather[ing] additional data generated during speech as facial skin vibrates around the mouth, lips, cheeks, and throat”.

Additionally, while a recognizer may be designed “to ignore background voices and noise”, it is best to place the microphone as close to you as possible, “particularly when background noise is present”.

What if the users use domain specific language like medical language?

If there is specific domain language, the best systems to use include lexicon adaptation, cloud-based ASR adaptation, and user intention adaptation. Limitations of a cloud-based speech “recognition is the drop-in performance observed in application domains that introduce specific vocabulary (for example, personal names) and language use patterns”. However, if a cloud recognition system is “supplemented with specialized recognizers that are directly adapted to the domain/speaker”. Additionally, utilizing web resources beforehand, such as word2vec, enable learning of “potentially useful new words can be learned beforehand, such that when mentioned in the user’s utterances later, these words are no longer new words to the agent”. By understanding the 'current lexicon', the user will have directions for unknown words as they pertain to a specific domain.
Further, it’s important to note that technology needed for the medical and legal fields that it may vary, dependent on the specifics of the field.

How can I improve the accuracy of the system?

Various steps can be taken to help improve your accuracy, such as test a demo recording before you begin, limit background noises, and set your recorder 6 to 12 inches away from you. Further, “find a quiet space that’ll stay that way”, and consider upgrading your microphone (if you’re utilizing a smartphone microphone). Drinking plenty of water before and during your recordings will also help the sound quality of your voice.

SUMMARY

By reducing background sound, ensuring your microphone is of high-quality, and doing pre-recording research surrounding unknown language and words, you can improve your success with speech recognition technologies. Further, technologies that are cloud-based, and supplemented with specialized recogniziers will lead to optimal recordings.
Sources
Sources