Transcribe audio

The Speech-to-Text service of Vertex AI on Google Distributed Cloud (GDC) air-gapped recognizes speech from audio files. Speech-to-Text converts the detected audio into text transcriptions using its pre-trained API.

Speech-to-Text includes Chirp, an advanced model for speech trained on millions of hours of audio data and billions of text sentences. This model contrasts with conventional speech recognition techniques in that it focuses on large amounts of language-specific supervised data. These techniques give users improved recognition and transcription for spoken languages and accents.

This page shows you how to transcribe audio files into text using the Speech-to-Text API on Distributed Cloud.

Before you begin

Before you can start using the Speech-to-Text API, you must have a project with the Speech-to-Text API enabled and have the appropriate credentials. You can also install client libraries to help you make calls to the API. For more information, see Set up a speech recognition project.

Transcribe audio with the default model

Speech-to-Text performs speech recognition. You send the audio file from which you want to recognize speech directly as content in the API request. The system returns the resulting transcribed text in the API response.

You must provide a RecognitionConfig configuration object when making a speech recognition request. This object tells the API how to process your audio data and what kind of output you expect. If a model is not explicitly specified in this configuration object, Speech-to-Text selects a default model.

For more information, see the Speech API documentation.

The following example transcribes speech from an audio file using the default Speech-to-Text model:

Python

Follow these steps to use the Speech-to-Text service from a Python script to transcribe speech from an audio file:

Install the latest version of the Speech-to-Text client library.
Set the required environment variables on a Python script.
Authenticate your API request.

Add the following code to the Python script you created:

import base64

from google.cloud import speech_v1p1beta1
import google.auth
from google.auth.transport import requests
from google.api_core.client_options import ClientOptions

audience="https://ENDPOINT:443"
api_endpoint="ENDPOINT:443"

def get_client(creds):
  opts = ClientOptions(api_endpoint=api_endpoint)
  return speech_v1p1beta1.SpeechClient(credentials=creds, client_options=opts)

def main():
  creds = None
  try:
    creds, project_id = google.auth.default()
    creds = creds.with_gdch_audience(audience)
    req = requests.Request()
    creds.refresh(req)
    print("Got token: ")
    print(creds.token)
  except Exception as e:
    print("Caught exception" + str(e))
    raise e
  return creds

def speech_func(creds):
  tc = get_client(creds)

  content="BASE64_ENCODED_AUDIO"

  audio = speech_v1p1beta1.RecognitionAudio()
  audio.content = base64.standard_b64decode(content)
  config = speech_v1p1beta1.RecognitionConfig()
  config.encoding= speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING
  config.sample_rate_hertz=RATE_HERTZ
  config.language_code="LANGUAGE_CODE"
  config.audio_channel_count=CHANNEL_COUNT

  metadata = [("x-goog-user-project", "projects/PROJECT_ID")]
  resp = tc.recognize(config=config, audio=audio, metadata=metadata)
  print(resp)

if __name__=="__main__":
  creds = main()
  speech_func(creds)

Replace the following:

ENDPOINT: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.
PROJECT_ID: your project ID.
BASE64_ENCODED_AUDIO: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar to ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv. For more information, see RecognitionAudio.
ENCODING: the encoding of the audio data sent in the request, such as LINEAR16. For more information, see AudioEncoding.
RATE_HERTZ: sample rate in Hertz of the audio data sent in the request, such as 16000. For more information, see RecognitionConfig.
LANGUAGE_CODE: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.
CHANNEL_COUNT: the number of channels in the input audio data, such as 1. For more information, see RecognitionConfig.

Save the Python script.
Run the Python script to transcribe audio:
```
python SCRIPT_NAME
```
Replace SCRIPT_NAME with the name you gave to your Python script, for example, speech.py.

Transcribe audio with Chirp

Similar to the Speech-to-Text default model, you must provide a RecognitionConfig configuration object when making a speech recognition request. To use Chirp, you must explicitly specify this model in this configuration object by setting the value chirp in the model field.

The following example transcribes speech from an audio file using the Chirp model:

Python

Follow these steps to use Chirp from a Python script to transcribe speech from an audio file:

Install the latest version of the Speech-to-Text client library.
Set the required environment variables on a Python script.
Authenticate your API request.

Add the following code to the Python script you created:

import base64

# Import the client library.
from google.cloud import speech_v1p1beta1
from google.cloud.speech_v1p1beta1.services.speech import client
from google.api_core.client_options import ClientOptions

api_endpoint="ENDPOINT:443"

def get_client(creds):
  opts = ClientOptions(api_endpoint=api_endpoint)
  return client.SpeechClient(credentials=creds, client_options=opts)

# Specify the audio to transcribe.
tc = get_client(creds)
content = "BASE64_ENCODED_AUDIO"

audio = speech_v1p1beta1.RecognitionAudio()
audio.content = base64.standard_b64decode(content)

config = speech_v1p1beta1.RecognitionConfig(
    encoding=speech_v1p1beta1.RecognitionConfig.AudioEncoding.ENCODING,
    sample_rate_hertz=RATE_HERTZ,
    audio_channel_count=CHANNEL_COUNT,
    language_code="LANGUAGE_CODE",
    model="chirp"
)

# Detect speech in the audio file.
metadata = (("x-goog-user-project", "projects/PROJECT_ID"),)
response = tc.recognize(config=config, audio=audio, metadata=metadata)

for result in response.results:
    print("Transcript: {}".format(result.alternatives[0].transcript))

Replace the following:

ENDPOINT: the Speech-to-Text endpoint that you use for your organization. For more information, view service status and endpoints.
BASE64_ENCODED_AUDIO: the audio data bytes encoded in a Base64 representation. This string begins with characters that look similar to ZkxhQwAAACIQABAAAAUJABtAA+gA8AB+W8FZndQvQAyjv. For more information, see RecognitionAudio.
ENCODING: the encoding of the audio data sent in the request, such as LINEAR16. For more information, see AudioEncoding.
RATE_HERTZ: sample rate in Hertz of the audio data sent in the request, such as 16000. For more information, see RecognitionConfig.
CHANNEL_COUNT: the number of channels in the input audio data, such as 1. For more information, see RecognitionConfig.
LANGUAGE_CODE: the language of the supplied audio as a BCP-47 language tag. See the list of supported languages and their respective language codes.
PROJECT_ID: your project ID.

Save the Python script.
Run the Python script to transcribe audio:
```
python SCRIPT_NAME
```
Replace SCRIPT_NAME with the name you gave to your Python script, for example, speech.py.