I would like a competitive analysis of speech recognition services, such as Google Cloud, Alexa Voice Service, Nuance and IBM Watson
The Google Cloud, Amazon Alexa, Nuance, and IBM Watson speech recognition APIs all support Japanese and all four except Nuance support multi-speaker identification. All except Alexa support streaming recognition. We have identified the answers to your success criteria and presented our findings below.
The Google speech recognition API supports Japanese, multi-speaker identification, and streaming recognition.
The Google Cloud speech API supports two lengths of audio depending on the mode selected. Synchronous recognition requests, whereby results are returned after all audio has been processed, are limited to 60 seconds. Asynchronous recognition requests, whereby the user can periodically poll for recognition results, support a maximum audio duration of 180 minutes.
Google reports a 4.9% error rate in its technology. There is no evidence that the Google Cloud voice API supports a dictionary customization feature.
The Google API has both a free option and premium pricing options:
- The Silver package starts at $150 per month
- The Gold package starts at $400 per month
- The Platinum package starts at $15,000 per month
The Amazon Alexa API supports Japanese and multi-speaker identification. We found no direct evidence that it supports streaming recognition.
The maximum total combined length of audio supported by Alexa "cannot be more than ninety (90) seconds."
Amazon does not release Alexa's error or accuracy rates. There is no evidence that Amazon Lex supports a dictionary customization feature.
Amazon Lex can be tried for free, allowing users to process up to 10,000 text requests and 5,000 speech requests per month for free for the first year. Beyond this threshold, users are charged based on their usage at rates of $0.004 per voice request and $0.0075 per text request.
The Nuance Dragon API supports Japanese and streaming recognition. It does not support multi-speaker identification.
Nuance does not release its API's audio length, however, as it is a transcription service, we can infer that it allows users to transcribe as much audio as desired within a single session.
Nuance reports a 99% accuracy rate (implying a 1% error rate) and the API allows users to customize vocabulary and commands. Nuance does not release pricing data.
Although no audio length was specified, Watson supports an audio limit of up to 100 megabytes.
In March 2017, IBM reported a breakthrough word error rate of 5.5% for the Watson technology. Watson also allows users to customize its dictionary with new vocabulary.
IBM Watson's pricing structure is based on usage:
- Usage < 250K minutes a month: $.02 per audio minutes transmitted
- Usage > 250K to 500K minutes a month: $.015 per audio minutes transmitted
- Usage > 501K to 1MM minutes a month: $.0125 per audio minutes transmitted
- Usage > 1MM minutes a month: $.01 per audio minutes transmitted
The service will have an add-on price of $.03 / per minute of audio transmitted. There will be no charge for creating or hosting custom models.
The Google Cloud, Amazon Alexa, Nuance Dragon, and IBM Watson speech recognition APIs all support Japanese. All except Nuance support multi-speaker identification. All except Alexa support streaming recognition. We have further identified additional features and pricing levels for the relevant APIs as outlined above.