Learn about troubleshooting steps that you might find helpful if you run into problems using Speech-to-Text.
Cannot authenticate to Speech-to-Text
You might receive an error message indicating that your "Application Default Credentials" are unavailable or you might be wondering how to get an API key to use when calling Speech-to-Text.
Speech-to-Text uses Application Default Credentials (ADC) for authentication.
The credentials for ADC must be available within the context that you call the Speech-to-Text API. For example, if you set up ADC in your terminal but run your code in the debugger of your IDE, the execution context of your code might not have access to the credentials. In that case, your request to Speech-to-Text might fail.
To learn how to provide credentials to ADC, see Set up Application Default Credentials.
Speech-to-Text returns an empty response
There are multiple reasons why Speech-to-Text might return an empty
response. The source of the problem can be the RecognitionConfig
or the audio
itself.
Troubleshoot RecognitionConfig
RecognitionConfig
object (or StreamingRecognitionConfig
) is part of a
Speech-to-Text recognition request. There are 2 main categories of
fields that must be set in order to correctly perform a transcription:
- Audio configuration.
- Model and language.
One of the most common causes of empty responses (for example, you receive an
empty {}
JSON response) is providing incorrect information about the audio
metadata. If the audio configuration fields are not set correctly, transcription
will most likely fail and the recognition model will return empty results.
Audio configuration contains the metadata of the provided audio. You can obtain
the metadata for your audio file using the ffprobe
command, which is part
of FFMPEG.
The following example demonstrates using ffprobe to get the metadata for https://storage.googleapis.com/cloud-samples-tests/speech/commercial_mono.wav.
$ ffprobe commercial_mono.wav
[...]
Input #0, wav, from 'commercial_mono.wav':
Duration: 00:00:35.75, bitrate: 128 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, 1 channels, s16, 128 kb/s
With the command above, we can see the file has:
- sample_rate_hertz: 8000
- channels: 1
- encoding LINEAR16 (s16)
You can use this information in your RecognitionConfig
.
Additional audio-related reasons for an empty response can be related to audio encoding. Here are some other tools and things to try:
Play the file and listen to the output. Is the audio clear and the speech intelligible?
To play files, you can use the SoX (Sound eXchange)
play
command. A few examples based on different audio encodings are shown below.FLAC files include a header that indicates the sample rate, encoding type and number of channels, and can be played as follows:
play audio.flac
LINEAR16 files do not include a header. To play them you must specify the sample rate, encoding type and number of channels. The LINEAR16 encoding must be 16-bits, signed-integer, little-endian.
play --channels=1 --bits=16 --rate=16000 --encoding=signed-integer \ --endian=little audio.raw
MULAW files also do not include a header and often use a lower sample rate.
play --channels=1 --rate=8000 --encoding=u-law audio.raw
Check that the audio encoding of your data matches the parameters you sent in
RecognitionConfig
. For example, if your request specified"encoding":"FLAC"
and"sampleRateHertz":16000
, the audio data parameters listed by the SoXplay
command should match these parameters, as follows:play audio.flac
should list:
Encoding: FLAC Channels: 1 @ 16-bit Sampleratehertz: 16000Hz
If the SoX listing shows a
Sampleratehertz
other than16000Hz
, change the"sampleRateHertz"
inInitialRecognizeRequest
to match. If theEncoding
is notFLAC
orChannels
is not1 @ 16-bit
, you cannot use this file directly, and will need to convert it to a compatible encoding (see next step).If your audio file is not in FLAC encoding, try converting it to FLAC using SoX, and repeat the steps above to play the file and verify the encoding, sampleRateHertz, and channels. Here are some examples that convert various audio file-formats to FLAC encoding.
sox audio.wav --channels=1 --bits=16 audio.flac sox audio.ogg --channels=1 --bits=16 audio.flac sox audio.au --channels=1 --bits=16 audio.flac sox audio.aiff --channels=1 --bits=16 audio.flac
To convert a raw file to FLAC, you need to know the audio-encoding of the file. For example, to convert stereo 16-bit signed little-endian at 16000Hz to FLAC:
sox --channels=2 --bits=16 --rate=16000 --encoding=signed-integer \ --endian=little audio.raw --channels=1 --bits=16 audio.flac
Run the Quickstart example or one of the Sample Applications with the supplied sample audio file. Once the example is running successfully, replace the sample audio file with your audio file.
Model and language configuration
Model selection is a very important to obtaining high-quality transcription
results. Speech-to-Text
provides multiple models that have been tuned to different use cases and must be
chosen to most closely match your audio.
For example, some models (such as latest_short
and command_and_search
are short-form models, which means that are more suited to short audios and prompts.
These models are likely to return results as soon as they detect a period of
silence. Long-form models, on the other hand (such as
latest_short, phone_call, video and default
are more
suited for longer audios and are not as sensitive to interpreting silence as the
end of the audio.
If your recognition ends too abruptly or doesn't return quickly, you might want to check and experiment with other models to see if you can get better transcription quality. You can experiment with multiple models using the Speech UI.
Timeout errors
These issues are, for the most part, caused by misconfiguration or misuse of Speech-to-Text.
LongRunningRecognize
or BatchRecognize
Issue: You're receiving
TimeoutError: Operation did not complete within the designated timeout
.Solution: You can send a transcription to the Cloud Storage bucket or extend the timeout in the request.
This issue occurs when the LongRunningRecognize
or BatchRecognize
request don't complete within the specified timeout and it is not an error
that indicates failure in speech transcription. It means that the
transcription results are not ready to be extracted.
StreamingRecognize
Issue: You're receiving
Timeout Error: Long duration elapsed without audio. Audio should be sent close to real time
.Solution: Time between audio chunks sent needs to be decreased. If Speech-to-Text doesn't receive a new audio chunk every few seconds, it will close the connection and trigger this error.
StreamingRecognize
409 aborted
Issue: You're receiving the
409 Max duration of 5 minutes reached for stream
error.Solution: You're reaching the streaming recognition limit of 5 minutes of audio. When you're getting close to this limit, close the stream and open a new one.
Low transcript quality
Automatic Speech Recognition (ASR) supports a wide variety of use cases. Most quality issues can be addressed by trying different API options. To improve recognition accuracy, follow the guidelines in Best Practices.
Short utterances aren't recognized
Issue: End user short utterances like "Yes", "No", and "Next" don't get captured by the API and are missing in the transcript.
Solution: Take the following steps.
Test the same request with different models.
Add speech adaptation and boost missing words.
If you're using streaming input, try setting
single_utterance=true
.
Consistently unrecognized word or phrase
Issue: Certain words or phrases are consistently misrecognized, like a is recognized as 8.
Solution: Take the following steps.
Test the same request with different models.
Add speech adaptation and boost missing words. You can use class tokens to boost whole sets of words like digit sequences or addresses. Check available class tokens.
Try increasing
max_alternatives
. Then check SpeechRecognitionResultalternatives
and choose the first one that matches the format you want.
Formatting can be challenging for ASR. Speech adaptation can often help get a required format, but post-processing might be necessary to fit required format.
Mixed or multi-language inputs
Issue: Audio contains speech in multiple languages, like a conversation between an English and a Spanish speaker resulting in the wrong transcription.
Solution: This feature isn't supported. Speech-to-Text can transcribe only one language per request.
Permission denied
Issue: You're receiving the following error.
Permission denied to access GCS object BUCKET-PATH. Source error: PROJECT-ID@gcp-sa-speech.iam.gserviceaccount.com does not have storage.buckets.get access to the Google Cloud Storage bucket. Permission 'storage.buckets.get' denied on resource (or it may not exist).
Solution: Provide permission for PROJECT_ID@gcp-sa-speech.iam.gserviceaccount.com to access file in BUCKET-PATH bucket.
Invalid argument
Issue: You're receiving the following error.
{ "error": { "code": 400, "message": "Request contains an invalid argument.", "status": "INVALID_ARGUMENT" } }
Solution: Check arguments and compare them to API documentation, then validate that they're correct. Make sure the selected endpoint matches the location in the request / resource.
Resource exhausted
Issue: You're receiving the following error.
RESOURCE_EXHAUSTED: Resource has been exhausted (e.g. check quota)
Solution: Check how to request quota increase.
Streaming chunk too large
Issue: You're receiving the following error.
INVALID_ARGUMENT: Request audio can be a maximum of 10485760 bytes. [type.googleapis.com/util.MessageSetPayload='[google.rpc.error_details_ext] { message: "Request audio can be a maximum of 10485760 bytes." }']
Solution: You need to decrease size of audio chunks sent in. We recommend sending chunks of 100 ms for best latency and to avoid reaching the audio limit.
Data logging
Issue: Speech-to-Text doesn't provide any Cloud Logging.
Solution: Because Speech-to-Text has data logging disabled by default, customers need to enable it on the project level.