The Speech-to-Text service of Vertex AI on Google Distributed Cloud (GDC) air-gapped recognizes speech from audio files. Speech-to-Text converts the detected audio into text transcriptions using its pre-trained API.
Speech-to-Text includes Chirp, an advanced model for speech trained on millions of hours of audio data and billions of text sentences. This model contrasts with conventional speech recognition techniques in that it focuses on large amounts of language-specific supervised data. These techniques give users improved recognition and transcription for spoken languages and accents.
This page shows you how to transcribe audio files into text using the Speech-to-Text API on Distributed Cloud.
Before you begin
Before you can start using the Speech-to-Text API, you must have a project with the Speech-to-Text API enabled and have the appropriate credentials. You can also install client libraries to help you make calls to the API. For more information, see Set up a speech recognition project.
Transcribe audio with the default model
Speech-to-Text performs speech recognition. You send the audio file from which you want to recognize speech directly as content in the API request. The system returns the resulting transcribed text in the API response.
You must provide a RecognitionConfig
configuration object when making a speech
recognition request. This object tells the API how to process your audio data
and what kind of output you expect. If a model is not explicitly specified in
this configuration object, Speech-to-Text selects a default model.
For more information, see the Speech
API documentation.
The following example transcribes speech from an audio file using the default Speech-to-Text model:
Python
Follow these steps to use the Speech-to-Text service from a Python script to transcribe speech from an audio file:
Install the latest version of the Speech-to-Text client library.
Add the following code to the Python script you created:
import base64 from google.cloud import speech_v1p1beta1 import google.auth from google.auth.transport import requests from google.api_core.client_options import ClientOptions audience="https://ENDPOINT:443" api_endpoint="ENDPOINT:443" def get_client(creds): opts = ClientOptions(api_endpoint=api_endpoint) return speech_v1p1beta1.SpeechClient(credentials=creds, client_options=opts) def main(): creds = None try: creds, project_id = google.auth.default() creds = creds.with_gdch_audience(audience) req = requests.Request() creds.refresh(req) print("Got token: ") print(creds.token) except Exception as e: print("Caught exception" + str(e)) raise e return creds def speech_func(creds): tc = get_client(creds) content="BASE64_ENCODED_AUDIO" audio = speech_v1p1beta1.RecognitionAudio() audio.content = base64.standard_b64decode(content) config = speech_v1p1beta1.RecognitionConfig() config.encoding= speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING config.sample_rate_hertz=RATE_HERTZ config.language_code="LANGUAGE_CODE" config.audio_channel_count=CHANNEL_COUNT metadata = [("x-goog-user-project", "projects/PROJECT_ID")] resp = tc.recognize(config=config, audio=audio, metadata=metadata) print(resp) if __name__=="__main__": creds = main() speech_func(creds)
Replace the following:
ENDPOINT
: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.PROJECT_ID
: your project ID.BASE64_ENCODED_AUDIO
: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar toZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv
. For more information, seeRecognitionAudio
.ENCODING
: the encoding of the audio data sent in the request, such asLINEAR16
. For more information, seeAudioEncoding
.RATE_HERTZ
: sample rate in Hertz of the audio data sent in the request, such as16000
. For more information, seeRecognitionConfig
.LANGUAGE_CODE
: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.CHANNEL_COUNT
: the number of channels in the input audio data, such as1
. For more information, seeRecognitionConfig
.
Save the Python script.
Run the Python script to transcribe audio:
python SCRIPT_NAME
Replace
SCRIPT_NAME
with the name you gave to your Python script, for example,speech.py
.
Transcribe audio with Chirp
Similar to the Speech-to-Text default model, you
must provide a RecognitionConfig
configuration object when making a speech
recognition request. To use Chirp,
you must explicitly specify this model in this configuration object by setting
the value chirp
in the model
field.
The following example transcribes speech from an audio file using the Chirp model:
Python
Follow these steps to use Chirp from a Python script to transcribe speech from an audio file:
Install the latest version of the Speech-to-Text client library.
Add the following code to the Python script you created:
import base64 # Import the client library. from google.cloud import speech_v1p1beta1 from google.cloud.speech_v1p1beta1.services.speech import client from google.api_core.client_options import ClientOptions api_endpoint="ENDPOINT:443" def get_client(creds): opts = ClientOptions(api_endpoint=api_endpoint) return client.SpeechClient(credentials=creds, client_options=opts) # Specify the audio to transcribe. tc = get_client(creds) content = "BASE64_ENCODED_AUDIO" audio = speech_v1p1beta1.RecognitionAudio() audio.content = base64.standard_b64decode(content) config = speech_v1p1beta1.RecognitionConfig( encoding=speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING, sample_rate_hertz=RATE_HERTZ, audio_channel_count=CHANNEL_COUNT, language_code="LANGUAGE_CODE", model="chirp" ) # Detect speech in the audio file. metadata = (("x-goog-user-project", "projects/PROJECT_ID"),) response = tc.recognize(config=config, audio=audio, metadata=metadata) for result in response.results: print("Transcript: {}".format(result.alternatives[0].transcript))
Replace the following:
ENDPOINT
: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.BASE64_ENCODED_AUDIO
: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar toZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv
. For more information, seeRecognitionAudio
.ENCODING
: the encoding of the audio data sent in the request, such asLINEAR16
. For more information, seeAudioEncoding
.RATE_HERTZ
: sample rate in Hertz of the audio data sent in the request, such as16000
. For more information, seeRecognitionConfig
.CHANNEL_COUNT
: the number of channels in the input audio data, such as1
. For more information, seeRecognitionConfig
.LANGUAGE_CODE
: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.PROJECT_ID
: your project ID.
Save the Python script.
Run the Python script to transcribe audio:
python SCRIPT_NAME
Replace
SCRIPT_NAME
with the name you gave to your Python script, for example,speech.py
.