Speech-to-Text is one of the three Vertex AI pre-trained APIs on Google Distributed Cloud (GDC) air-gapped. The Speech-to-Text service recognizes speech in audio files and transcribes audio into text. Speech-to-Text meets data residency and compliance requirements.
The following table describes the key capabilities of Speech-to-Text:
Key capabilities | |
---|---|
Transcription | Apply advanced deep learning neural network algorithms for automatic speech recognition. |
Models | Deploy recognition models that are less than 1 GB in size and consume minimal resources. |
API compatible | Use the Speech-to-Text API and its client libraries to send audio and receive a text transcription from the Speech-to-Text service. |
Supported audio encodings for Speech-to-Text
The Speech-to-Text API supports a number of different encodings. The following table lists supported audio codecs:
Codec | Name | Lossless | Usage notes |
---|---|---|---|
FLAC |
Free Lossless Audio Codec | Yes | 16-bit or 24-bit required for streams |
LINEAR16 |
Linear PCM | Yes | 16-bit linear pulse-code modulation (PCM) encoding. The header must contain the sample rate. |
MULAW |
μ-law | No | 8-bit PCM encoding |
OGG_OPUS |
Opus encoded audio frames in an Ogg container | No | Sample rate must be one of 8000 Hz, 12000 Hz, 16000 Hz, 24000 Hz, or 48000 Hz |
FLAC
is both an audio codec and an audio file format. To transcribe audio
files using FLAC
encoding, you must provide them in the .FLAC
file format,
which includes a header containing metadata.
Speech-to-Text supports WAV
files with LINEAR16
or MULAW
encoded audio.
For more information on Speech-to-Text audio codecs, consult the
AudioEncoding
reference documentation.
If you have a choice when encoding the source material, use a lossless encoding
such as FLAC
or LINEAR16
for better speech recognition.
Speech-to-Text features
Speech-to-Text on Distributed Cloud has the following three methods to perform speech recognition:
Synchronous recognition: sends audio data to the Speech-to-Text API, performs recognition on that data, and returns results after audio processing. Synchronous recognition requests are limited to one minute or less of audio data.
Asynchronous recognition: sends audio data to the Speech-to-Text API and initiates a long-running operation. Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 480 minutes.
Streaming recognition: performs recognition on audio data provided within a bidirectional stream. Streaming requests are designed for real-time recognition purposes, such as capturing live audio from a microphone. Streaming recognition provides interim results while audio is being captured, allowing results to appear, for example, while a user is still speaking.
Requests contain configuration parameters and audio data. The following sections describe these recognition requests, the responses they generate, and how to handle those responses in more detail.
Synchronous requests and responses
A Speech-to-Text synchronous recognition request is the simplest method for performing recognition on speech audio data. Speech-to-Text can process up to one minute of speech audio data sent in a synchronous request. After Speech-to-Text processes and recognizes all of the audio, it returns a response.
Speech-to-Text must return a response before processing the next request. Speech-to-Text typically processes audio faster than real-time, on average processing 30 seconds of audio in 15 seconds. In cases of poor audio quality, your recognition request can take significantly longer.
Speech recognition requests
A synchronous Speech-to-Text API request comprises a speech recognition configuration and audio data. The following example shows a request:
{
"config": {
"encoding": "LINEAR16",
"sample_rate_hertz": 16000,
"language_code": "en-US",
},
"audio": {
"content": "ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv..."
}
}
All Speech-to-Text synchronous recognition requests must include a speech
recognition config
field of type RecognitionConfig
. A RecognitionConfig
object contains the following required sub-fields:
encoding
: specifies the encoding scheme of the supplied audio. This field is of typeAudioEncoding
. If you have a choice in a codec, prefer a lossless encoding such asFLAC
orLINEAR16
for best performance. For a list of supported audio encoding formats, see Supported audio encodings for Speech-to-Text. Theencoding
field is optional forFLAC
andWAV
files, which include the encoding in the file header.sample_rate_hertz
: specifies the sample rate of the supplied audio in Hertz. For more information on sample rates, see Sample rates. Thesample_rate_hertz
field is optional forFLAC
andWAV
files, which include the sample rate in the file header.language_code
: contains the language and region to use for speech recognition of the supplied audio. The language code must be a BCP-47 identifier. Language codes consist of primary language tags and secondary region subtags to indicate dialects. In the example,en
is for English, andUS
is for the United States. For a list of supported languages, see Supported languages.
For more information and a description of optional sub-fields you can include in
the config
field, see RecognitionConfig
.
Provide audio to Speech-to-Text through the audio
parameter of type
RecognitionAudio
.
The audio
field contains the following sub-field:
content
: contains the audio to evaluate, embedded within the request. The audio data bytes are encoded using a pure binary representation. JSON representations use Base64. See Embedded audio content for more information. Audio passed directly within this field is limited to one minute in duration.
Sample rates
You specify the sample rate of your audio in the sample_rate_hertz
field of
the request configuration, and it must match the sample rate of the associated
audio content. Speech-to-Text supports sample rates between 8000 Hz and 48000
Hz. You can specify the sample rate for a FLAC
or WAV
file in the file
header instead of using the sample_rate_hertz
field. However, the
sample_rate_hertz
field is required for all other audio formats.
If you have a choice when encoding the source material, capture audio using a sample rate of 16000 Hz. Lower values might impair speech recognition accuracy, and higher levels have no appreciable effect on speech recognition quality.
However, if you recorded your audio data at a sample rate other than 16000 Hz, don't resample your audio to 16000 Hz. Most legacy telephony audio, for example, uses sample rates of 8000 Hz, which might give less accurate results. If you must use such audio, provide it to the Speech-to-Text API at its original sample rate.
Languages
The recognition engine of Speech-to-Text supports a variety of languages and
dialects. You specify your audio's language (and national or regional dialect)
within the language_code
field of the request configuration using a
BCP-47 identifier.
The Supported languages page has a complete list of supported languages for each feature.
Model selection
When you send an audio transcription request to Speech-to-Text, you can process your audio files using a machine learning model trained to recognize speech audio from that particular source type.
To specify a model for speech recognition, include the model
field
in the RecognitionConfig
object for your request, specifying the model you want to use.
Speech-to-Text on Distributed Cloud supports the following two models:
default
: transcribe audio that is not a specific audio model, such as long-form audio.chirp
: transcribe multilingual audio when requiring higher accuracy. Chirp performs automatic speech recognition in many languages, even if those languages are low-resource languages that don't have a lot of labeled data available for training.
Embedded audio content
Embedded audio is included in the speech recognition request when passing a
content
parameter within the request's audio
field. For embedded audio
provided as content within a REST request, that audio must be compatible with
JSON serialization.
You can send data directly in the content
field for synchronous recognition
only if your audio data is a maximum of 60 seconds and 10 MB. Any audio data in
the content
field must be in Base64 format.
When constructing a request using a client library,
you write out this binary or Base64-encoded data directly within the content
field.
Most development environments come with a base64
utility to encode a binary
into ASCII text data, providing you with the necessary tools and support.
Additionally, Python has built-in mechanisms for Base64 encoding content. The
following examples show how to encode a file:
Linux
Encode the file using the base64
command line tool. Prevent line-wrapping by
using the -w 0
flag:
base64 INPUT_FILE -w 0 > OUTPUT_FILE
Python
In Python, Base64 encode audio files as follows:
# Import the base64 encoding library.
import base64
# Pass the audio data to an encoding function.
def encode_audio(audio):
audio_content = audio.read()
return base64.b64encode(audio_content)
Speech recognition responses
A synchronous Speech-to-Text API response might take some time to return results. Once processed, the API returns a response as in the following example:
{
"results": [
{
"alternatives": [
{
"transcript": "how old is the Brooklyn Bridge",
"words": [
{
"word": "how"
},
{
"word": "old"
},
{
"word": "is"
},
{
"word": "the"
},
{
"word": "Brooklyn"
},
{
"word": "Bridge"
}
]
}
]
}
]
}
All Speech-to-Text API synchronous recognition responses include speech
recognition results of type RecognizeResponse
. A RecognizeResponse
object
contains the following fields:
results
: contains the list of results of typeSpeechRecognitionResult
, where each result corresponds to a segment of audio. Each result consists of one or more of the following sub-fields:alternatives
: contains a list of possible transcriptions of typeSpeechRecognitionAlternative
. The first alternative in the response is always the most likely. Each alternative consists of the following sub-fields:transcript
: contains the transcribed text. When provided with sequential alternatives, you can concatenate these transcriptions together.words
: contains a list of word-specific information for each recognized word.
For more information, see RecognizeResponse
.
Asynchronous requests and responses
An asynchronous Speech-to-Text API request is identical in form to a synchronous request. However, instead of returning a response, the asynchronous request initiates a long-running operation and returns this operation immediately. You can use asynchronous speech recognition with audio of any length up to 480 minutes.
The following is an example of an operation response:
{
"name": "OPERATION_NAME",
"metadata": {
"@type": "type.googleapis.com/google.cloud.speech_v1p1beta1.LongRunningRecognizeMetadata"
"progressPercent": 34,
"startTime": "2016-08-30T23:26:29.579144Z",
"lastUpdateTime": "2016-08-30T23:26:29.826903Z"
}
}
Note that results are not yet present. Speech-to-Text continues to process the
audio and uses this operation to store the results. Results appear in the
response
field of the operation returned when the LongRunningRecognize
request is complete.
The following is an example of a full response after completion of the request:
{
"name": "1268386125834704889",
"metadata": {
"lastUpdateTime": "2016-08-31T00:16:32.169Z",
"@type": "type.googleapis.com/google.cloud.speech_v1p1beta1.LongRunningRecognizeMetadata",
"startTime": "2016-08-31T00:16:29.539820Z",
"progressPercent": 100
}
"response": {
"@type": "type.googleapis.com/google.cloud.speech_v1p1beta1.LongRunningRecognizeResponse",
"results": [{
"alternatives": [{
"transcript": "how old is the Brooklyn Bridge",
"words": [
{
"word": "how"
},
{
"word": "old"
},
{
"word": "is"
},
{
"word": "the"
},
{
"word": "Brooklyn"
},
{
"word": "Bridge"
}
]
}]}]
},
"done": True
}
Note that done
is set to True
and that the operation's response
contains a
set of results of type
SpeechRecognitionResult
,
the same type returned by a synchronous recognition request.
Streaming requests and responses
A streaming Speech-to-Text API recognition call is designed for real-time capture and recognition of audio within a bidirectional stream. Your application can send audio on the request stream and receive real-time interim and final recognition results on the response stream. Interim results represent the current recognition result for a section of audio, while the final recognition result represents the last, best guess for that section of audio.
Streaming recognition requests
Unlike synchronous and asynchronous calls, in which you send both the
configuration and audio within a single request, calling the streaming
Speech-to-Text API requires sending multiple requests. The first
StreamingRecognizeRequest
must contain a configuration of type
StreamingRecognitionConfig
.
A StreamingRecognitionConfig
consists of the config
field, which contains
configuration information for the audio of type
RecognitionConfig
and is the same as the one shown within synchronous and asynchronous requests.
Streaming recognition responses
Streaming speech recognition results return a series of responses of type
StreamingRecognizeResponse
.
Such a response consists of the following fields:
speech_event_type
: contains events of typeSpeechEventType
. The value of these events indicates when a single utterance has been completed. The speech events serve as markers within your stream's response.results
: contains the list of results, which might be either interim or final results of typeStreamingRecognitionResult
. Theresults
list includes the following sub-fields:alternatives
: contains a list of alternative transcriptions.is_final
: indicates whether the results obtained within this list entry are interim or final.result_end_time
: indicates the time offset of the end of this result relative to the beginning of the audio.
Chirp: Universal speech model
Chirp is the next generation of Speech-to-Text models on Google Distributed Cloud (GDC) air-gapped. Representing a version of a Universal Speech Model, Chirp has over 2B parameters and can transcribe many languages in a single model.
You can transcribe audio files in other supported languages by enabling the Chirp component.
Chirp achieves state-of-the-art Word Error Rate (WER) on various public test sets and languages, offering multi-language support on Distributed Cloud. It uses a universal encoder that trains models with a different architecture than current speech models, using data in many other languages. The model is then fine-tuned to offer transcription for specific languages. A single model unifies data from multiple languages. However, users still specify the language in which the model must recognize speech.
Chirp processes speech in much larger chunks than other models. Results are only available after an entire utterance has finished, which means Chirp might not be suitable for true, real-time use.
The model identifier for Chirp is chirp
. Therefore, you can set the value
chirp
in the model
field of the request's RecognitionConfig
object.
Available API methods
Chirp supports both
Recognize
and StreamingRecognize
Speech-to-Text API methods.
Both methods differ because StreamingRecognize
only returns
results after each utterance. For this reason, this method has a latency in the
order of seconds rather than milliseconds after starting speech, compared to the
Recognize
method. However, StreamingRecognize
has a very low latency after
an utterance is finished, for example, in a sentence followed by a pause.