Audio understanding (speech only)

You can add audio to Gemini requests to perform tasks that involve understanding the contents of the included audio. This page shows you how to add audio to your requests to Gemini in Vertex AI by using the Google Cloud console and the Vertex AI API.

Supported models

The following table lists the models that support audio understanding:

Model Audio modality details

Gemini 1.5 Flash

Go to the Gemini 1.5 Flash model card

Maximum audio length per prompt: ~8.4 hours or up to 1 million tokens.

Speech can be understood for audio summarization, transcription, and translation.

Gemini 1.5 Pro

Go to the Gemini 1.5 Pro model card

Maximum audio length per prompt: ~8.4 hours or up to 1 million tokens.

Speech can be understood for audio summarization, transcription, and translation.

For a list of languages supported by Gemini models, see model information Google models. To learn more about how to design multimodal prompts, see Design multimodal prompts. If you're looking for a way to use Gemini directly from your mobile and web apps, see the Google AI SDKs for Android, Swift, and web.

Add audio to a request

You can add audio files in your requests to Gemini.

Single audio

The following shows you how to use an audio file to summarize a podcast.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the stream parameter in generate_content.

  response = model.generate_content(contents=[...], stream = True)
  

For a non-streaming response, remove the parameter, or set the parameter to False.

Sample code


  import vertexai
  from vertexai.generative_models import GenerativeModel, Part

  # TODO(developer): Update and un-comment below lines
  # project_id = "PROJECT_ID"

  vertexai.init(project=project_id, location="us-central1")

  model = GenerativeModel(model_name="gemini-1.5-flash-001")

  prompt = """
  Please provide a summary for the audio.
  Provide chapter titles, be concise and short, no need to provide chapter summaries.
  Do not make up any information that is not part of the audio and do not be verbose.
"""

  audio_file_uri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3"
  audio_file = Part.from_uri(audio_file_uri, mime_type="audio/mpeg")

  contents = [audio_file, prompt]

  response = model.generate_content(contents)
  print(response.text)

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI Java SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the generateContentStream method.

  public ResponseStream generateContentStream(Content content)
  

For a non-streaming response, use the generateContent method.

  public GenerateContentResponse generateContent(Content content)
  

Sample code

import com.google.cloud.vertexai.VertexAI;
import com.google.cloud.vertexai.api.GenerateContentResponse;
import com.google.cloud.vertexai.generativeai.ContentMaker;
import com.google.cloud.vertexai.generativeai.GenerativeModel;
import com.google.cloud.vertexai.generativeai.PartMaker;
import com.google.cloud.vertexai.generativeai.ResponseHandler;
import java.io.IOException;

public class AudioInputSummarization {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-google-cloud-project-id";
    String location = "us-central1";
    String modelName = "gemini-1.5-flash-001";

    summarizeAudio(projectId, location, modelName);
  }

  // Analyzes the given audio input.
  public static String summarizeAudio(String projectId, String location, String modelName)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created once, and can be reused for multiple requests.
    try (VertexAI vertexAI = new VertexAI(projectId, location)) {
      String audioUri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3";

      GenerativeModel model = new GenerativeModel(modelName, vertexAI);
      GenerateContentResponse response = model.generateContent(
          ContentMaker.fromMultiModalData(
              "Please provide a summary for the audio.\n"
                  + "Provide chapter titles with timestamps, be concise and short, "
                  + "no need to provide chapter summaries.\n"
                  + "Do not make up any information that is not part of the audio "
                  + "and do not be verbose.",
              PartMaker.fromMimeTypeAndData("audio/mp3", audioUri)
          ));

      String output = ResponseHandler.getText(response);
      System.out.println(output);

      return output;
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Generative AI quickstart using the Node.js SDK. For more information, see the Node.js SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the generateContentStream method.

  const streamingResp = await generativeModel.generateContentStream(request);
  

For a non-streaming response, use the generateContent method.

  const streamingResp = await generativeModel.generateContent(request);
  

Sample code

const {VertexAI} = require('@google-cloud/vertexai');

/**
 * TODO(developer): Update these variables before running the sample.
 */
async function summarize_audio(projectId = 'PROJECT_ID') {
  const vertexAI = new VertexAI({project: projectId, location: 'us-central1'});

  const generativeModel = vertexAI.getGenerativeModel({
    model: 'gemini-1.5-flash-001',
  });

  const filePart = {
    file_data: {
      file_uri: 'gs://cloud-samples-data/generative-ai/audio/pixel.mp3',
      mime_type: 'audio/mpeg',
    },
  };
  const textPart = {
    text: `
    Please provide a summary for the audio.
    Provide chapter titles with timestamps, be concise and short, no need to provide chapter summaries.
    Do not make up any information that is not part of the audio and do not be verbose.`,
  };

  const request = {
    contents: [{role: 'user', parts: [filePart, textPart]}],
  };

  const resp = await generativeModel.generateContent(request);
  const contentResponse = await resp.response;
  console.log(JSON.stringify(contentResponse));
}

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI Go SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the GenerateContentStream method.

  iter := model.GenerateContentStream(ctx, genai.Text("Tell me a story about a lumberjack and his giant ox. Keep it very short."))
  

For a non-streaming response, use the GenerateContent method.

  resp, err := model.GenerateContent(ctx, genai.Text("What is the average size of a swallow?"))
  

Sample code

import (
	"context"
	"errors"
	"fmt"
	"io"
	"mime"
	"path/filepath"

	"cloud.google.com/go/vertexai/genai"
)

// audioPrompt is a sample prompt type consisting of one audio asset, and a text question.
type audioPrompt struct {
	// audio is a Google Cloud Storage path starting with "gs://"
	audio string
	// question asked to the model
	question string
}

// summarizeAudio shows how to send an audio asset and a text question to a model, writing the response to the
// provided io.Writer.
func summarizeAudio(w io.Writer, prompt audioPrompt, projectID, location, modelName string) error {
	// prompt := audioPrompt{
	// 	audio: "gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
	// 	question: `
	// 		Please provide a summary for the audio.
	// 		Provide chapter titles with timestamps, be concise and short, no need to provide chapter summaries.
	// 		Do not make up any information that is not part of the audio and do not be verbose.
	// 	`,
	// }
	// location := "us-central1"
	// modelName := "gemini-1.5-flash-001"
	ctx := context.Background()

	client, err := genai.NewClient(ctx, projectID, location)
	if err != nil {
		return fmt.Errorf("unable to create client: %w", err)
	}
	defer client.Close()

	model := client.GenerativeModel(modelName)
	model.SetTemperature(0.4)

	// Given an audio file URL, prepare audio file as genai.Part
	part := genai.FileData{
		MIMEType: mime.TypeByExtension(filepath.Ext(prompt.audio)),
		FileURI:  prompt.audio,
	}

	res, err := model.GenerateContent(ctx, part, genai.Text(prompt.question))
	if err != nil {
		return fmt.Errorf("unable to generate contents: %w", err)
	}

	if len(res.Candidates) == 0 ||
		len(res.Candidates[0].Content.Parts) == 0 {
		return errors.New("empty response from model")
	}

	fmt.Fprintf(w, "generated summary:\n%s\n", res.Candidates[0].Content.Parts[0])
	return nil
}

C#

Before trying this sample, follow the C# setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI C# reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the StreamGenerateContent method.

  public virtual PredictionServiceClient.StreamGenerateContentStream StreamGenerateContent(GenerateContentRequest request)
  

For a non-streaming response, use the GenerateContentAsync method.

  public virtual Task<GenerateContentResponse> GenerateContentAsync(GenerateContentRequest request)
  

For more information on how the server can stream responses, see Streaming RPCs.

Sample code


using Google.Cloud.AIPlatform.V1;
using System;
using System.Threading.Tasks;

public class AudioInputSummarization
{
    public async Task<string> SummarizeAudio(
        string projectId = "your-project-id",
        string location = "us-central1",
        string publisher = "google",
        string model = "gemini-1.5-flash-001")
    {
        var predictionServiceClient = new PredictionServiceClientBuilder
        {
            Endpoint = $"{location}-aiplatform.googleapis.com"
        }.Build();

        string prompt = @"Please provide a summary for the audio.
Provide chapter titles with timestamps, be concise and short, no need to provide chapter summaries.
Do not make up any information that is not part of the audio and do not be verbose.";

        var generateContentRequest = new GenerateContentRequest
        {
            Model = $"projects/{projectId}/locations/{location}/publishers/{publisher}/models/{model}",
            Contents =
            {
                new Content
                {
                    Role = "USER",
                    Parts =
                    {
                        new Part { Text = prompt },
                        new Part { FileData = new() { MimeType = "audio/mp3", FileUri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3" } }
                    }
                }
            }
        };

        GenerateContentResponse response = await predictionServiceClient.GenerateContentAsync(generateContentRequest);

        string responseText = response.Candidates[0].Content.Parts[0].Text;
        Console.WriteLine(responseText);

        return responseText;
    }
}

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: The region to process the request. Enter a supported region. For the full list of supported regions, see Available locations.

    Click to expand a partial list of available regions

    • us-central1
    • us-west4
    • northamerica-northeast1
    • us-east4
    • us-west1
    • asia-northeast3
    • asia-southeast1
    • asia-northeast1
  • PROJECT_ID: Your project ID.
  • FILE_URI: The Cloud Storage URI of the file to include in the prompt. The bucket object must either be publicly readable or reside in the same Google Cloud project that's sending the request. You must also specify the media type (mimeType) of the file.

    If you don't have an audio file in Cloud Storage, then you can use the following publicly available file: gs://cloud-samples-data/generative-ai/audio/pixel.mp3 with a mime type of audio/mp3. To listen to this audio, open the sample MP3 file.

  • MIME_TYPE: The media type of the file specified in the data or fileUri fields. Acceptable values include the following:

    Click to expand MIME types

    • application/pdf
    • audio/mpeg
    • audio/mp3
    • audio/wav
    • image/png
    • image/jpeg
    • text/plain
    • video/mov
    • video/mpeg
    • video/mp4
    • video/mpg
    • video/avi
    • video/wmv
    • video/mpegps
    • video/flv
  • TEXT
    The text instructions to include in the prompt. For example, Please provide a summary for the audio. Provide chapter titles, be concise and short, no need to provide chapter summaries. Do not make up any information that is not part of the audio and do not be verbose.

To send your request, choose one of these options:

curl

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json << 'EOF'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
EOF

Then execute the following command to send your REST request:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-flash:generateContent"

PowerShell

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-flash:generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Note the following in the URL for this sample:
  • Use the generateContent method to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using the streamGenerateContent method.
  • The multimodal model ID is located at the end of the URL before the method (for example, gemini-1.5-flash or gemini-1.0-pro-vision). This sample may support other models as well.

Console

To send a multimodal prompt by using the Google Cloud console, do the following:

  1. In the Vertex AI section of the Google Cloud console, go to the Vertex AI Studio page.

    Go to Vertex AI Studio

  2. Under Prompt design (single turn), click Open.
  3. Configure the model and parameters:

    • Model: Select a model.
    • Region: Select the region that you want to use.
    • Temperature: Use the slider or textbox to enter a value for temperature.

      The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

      If the model returns a response that's too generic, too short, or the model gives a fallback response, try increasing the temperature.

    • Token limit: Use the slider or textbox to enter a value for the max output limit.

      Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

      Specify a lower value for shorter responses and a higher value for potentially longer responses.

    • Add stop sequence (optional): Enter a stop sequence, which is a series of characters (including spaces) that stops response generation if the model encounters it. The sequence is not included as part of the response. You can add up to five stop sequences.
  4. Optional: To configure advanced parameters, click Advanced and configure as follows:
  5. Click to expand advanced configurations

    • Top-K: Use the slider or textbox to enter a value for top-K (not supported for Gemini 1.5).

      Top-K changes how the model selects tokens for output. A top-K of 1 means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of 3 means that the next token is selected from among the three most probable tokens by using temperature.

      For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.

      Specify a lower value for less random responses and a higher value for more random responses.

    • Top-P: Use the slider or textbox to enter a value for top-P. Tokens are selected from most probable to the least until the sum of their probabilities equals the value of top-P. For the least variable results, set top-P to 0.
    • Enable Grounding: Grounding isn't supported for multimodal prompts.
  6. To upload media, such as MP3 and WAV files, do the following:
    1. Click Insert Media, and select a source. If you choose Google Drive as your source, you must choose an account and give consent to Vertex AI Studio to access your account the first time you select this option. You can upload multiple images that have a total size of up to 10 MB. A single file can't exceed 7 MB.
    2. Click the file that you want to add.
    3. Click Select. The file thumbnail displays in the Prompt pane.
  7. Enter your text prompt in the Prompt pane.
  8. Click Submit, and the response is generated.
  9. Optional: To save your prompt to My prompts, click Save.
  10. Optional: To get the Python code or a curl command for your prompt, click Get code.

Audio transcription

The following shows you how to use an audio file to transcribe an interview.

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Vertex AI SDK for Python API reference documentation.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the stream parameter in generate_content.

  response = model.generate_content(contents=[...], stream = True)
  

For a non-streaming response, remove the parameter, or set the parameter to False.

Sample code


  import vertexai
  from vertexai.generative_models import GenerativeModel, Part

  # TODO(developer): Update and un-comment below lines
  # project_id = "PROJECT_ID"

  vertexai.init(project=project_id, location="us-central1")

  model = GenerativeModel(model_name="gemini-1.5-flash-001")

  prompt = """
  Can you transcribe this interview, in the format of timecode, speaker, caption.
  Use speaker A, speaker B, etc. to identify speakers.
"""

  audio_file_uri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3"
  audio_file = Part.from_uri(audio_file_uri, mime_type="audio/mpeg")

  contents = [audio_file, prompt]

  response = model.generate_content(contents)
  print(response.text)

Java

Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI Java SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the generateContentStream method.

  public ResponseStream generateContentStream(Content content)
  

For a non-streaming response, use the generateContent method.

  public GenerateContentResponse generateContent(Content content)
  

Sample code

import com.google.cloud.vertexai.VertexAI;
import com.google.cloud.vertexai.api.GenerateContentResponse;
import com.google.cloud.vertexai.generativeai.ContentMaker;
import com.google.cloud.vertexai.generativeai.GenerativeModel;
import com.google.cloud.vertexai.generativeai.PartMaker;
import com.google.cloud.vertexai.generativeai.ResponseHandler;
import java.io.IOException;

public class AudioInputTranscription {

  public static void main(String[] args) throws IOException {
    // TODO(developer): Replace these variables before running the sample.
    String projectId = "your-google-cloud-project-id";
    String location = "us-central1";
    String modelName = "gemini-1.5-flash-001";

    transcribeAudio(projectId, location, modelName);
  }

  // Analyzes the given audio input.
  public static String transcribeAudio(String projectId, String location, String modelName)
      throws IOException {
    // Initialize client that will be used to send requests. This client only needs
    // to be created once, and can be reused for multiple requests.
    try (VertexAI vertexAI = new VertexAI(projectId, location)) {
      String audioUri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3";

      GenerativeModel model = new GenerativeModel(modelName, vertexAI);
      GenerateContentResponse response = model.generateContent(
          ContentMaker.fromMultiModalData(
              "Can you transcribe this interview, in the format of timecode, speaker, caption.\n"
                  + "Use speaker A, speaker B, etc. to identify speakers.",
              PartMaker.fromMimeTypeAndData("audio/mp3", audioUri)
          ));

      String output = ResponseHandler.getText(response);
      System.out.println(output);

      return output;
    }
  }
}

Node.js

Before trying this sample, follow the Node.js setup instructions in the Generative AI quickstart using the Node.js SDK. For more information, see the Node.js SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the generateContentStream method.

  const streamingResp = await generativeModel.generateContentStream(request);
  

For a non-streaming response, use the generateContent method.

  const streamingResp = await generativeModel.generateContent(request);
  

Sample code

const {VertexAI} = require('@google-cloud/vertexai');

/**
 * TODO(developer): Update these variables before running the sample.
 */
async function transcript_audio(projectId = 'PROJECT_ID') {
  const vertexAI = new VertexAI({project: projectId, location: 'us-central1'});

  const generativeModel = vertexAI.getGenerativeModel({
    model: 'gemini-1.5-flash-001',
  });

  const filePart = {
    file_data: {
      file_uri: 'gs://cloud-samples-data/generative-ai/audio/pixel.mp3',
      mime_type: 'audio/mpeg',
    },
  };
  const textPart = {
    text: `
    Can you transcribe this interview, in the format of timecode, speaker, caption?
    Use speaker A, speaker B, etc. to identify speakers.`,
  };

  const request = {
    contents: [{role: 'user', parts: [filePart, textPart]}],
  };

  const resp = await generativeModel.generateContent(request);
  const contentResponse = await resp.response;
  console.log(JSON.stringify(contentResponse));
}

Go

Before trying this sample, follow the Go setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI Go SDK for Gemini reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the GenerateContentStream method.

  iter := model.GenerateContentStream(ctx, genai.Text("Tell me a story about a lumberjack and his giant ox. Keep it very short."))
  

For a non-streaming response, use the GenerateContent method.

  resp, err := model.GenerateContent(ctx, genai.Text("What is the average size of a swallow?"))
  

Sample code

import (
	"context"
	"errors"
	"fmt"
	"io"
	"mime"
	"path/filepath"

	"cloud.google.com/go/vertexai/genai"
)

// audioPrompt is a sample prompt type consisting of one audio asset, and a text question.
type audioPrompt struct {
	// audio is a Google Cloud Storage path starting with "gs://"
	audio string
	// question asked to the model
	question string
}

// transcribeAudio generates a response into w, based upon the prompt
// and audio provided.
// audio is a Google Cloud Storage path starting with "gs://"
func transcribeAudio(w io.Writer, prompt audioPrompt, projectID, location, modelName string) error {
	// prompt := audioPrompt{
	// 	audio: "gs://cloud-samples-data/generative-ai/audio/pixel.mp3",
	// 	question: `
	// 		Can you transcribe this interview, in the format of timecode, speaker, caption.
	// 		Use speaker A, speaker B, etc. to identify speakers.
	// 	`,
	// },
	// location := "us-central1"
	// modelName := "gemini-1.5-flash-001"
	ctx := context.Background()

	client, err := genai.NewClient(ctx, projectID, location)
	if err != nil {
		return fmt.Errorf("unable to create client: %w", err)
	}
	defer client.Close()

	model := client.GenerativeModel(modelName)

	// Optional: set an explicit temperature
	model.SetTemperature(0.4)

	// Given an audio file URL, prepare audio file as genai.Part
	img := genai.FileData{
		MIMEType: mime.TypeByExtension(filepath.Ext(prompt.audio)),
		FileURI:  prompt.audio,
	}

	res, err := model.GenerateContent(ctx, img, genai.Text(prompt.question))
	if err != nil {
		return fmt.Errorf("unable to generate contents: %w", err)
	}

	if len(res.Candidates) == 0 ||
		len(res.Candidates[0].Content.Parts) == 0 {
		return errors.New("empty response from model")
	}

	fmt.Fprintf(w, "generated transcript:\n%s\n", res.Candidates[0].Content.Parts[0])
	return nil
}

C#

Before trying this sample, follow the C# setup instructions in the Vertex AI quickstart. For more information, see the Vertex AI C# reference documentation.

To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

Streaming and non-streaming responses

You can choose whether the model generates streaming responses or non-streaming responses. For streaming responses, you receive each response as soon as its output token is generated. For non-streaming responses, you receive all responses after all of the output tokens are generated.

For a streaming response, use the StreamGenerateContent method.

  public virtual PredictionServiceClient.StreamGenerateContentStream StreamGenerateContent(GenerateContentRequest request)
  

For a non-streaming response, use the GenerateContentAsync method.

  public virtual Task<GenerateContentResponse> GenerateContentAsync(GenerateContentRequest request)
  

For more information on how the server can stream responses, see Streaming RPCs.

Sample code


using Google.Cloud.AIPlatform.V1;
using System;
using System.Threading.Tasks;

public class AudioInputTranscription
{
    public async Task<string> TranscribeAudio(
        string projectId = "your-project-id",
        string location = "us-central1",
        string publisher = "google",
        string model = "gemini-1.5-flash-001")
    {

        var predictionServiceClient = new PredictionServiceClientBuilder
        {
            Endpoint = $"{location}-aiplatform.googleapis.com"
        }.Build();

        string prompt = @"Can you transcribe this interview, in the format of timecode, speaker, caption.
Use speaker A, speaker B, etc. to identify speakers.";

        var generateContentRequest = new GenerateContentRequest
        {
            Model = $"projects/{projectId}/locations/{location}/publishers/{publisher}/models/{model}",
            Contents =
            {
                new Content
                {
                    Role = "USER",
                    Parts =
                    {
                        new Part { Text = prompt },
                        new Part { FileData = new() { MimeType = "audio/mp3", FileUri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3" } }
                    }
                }
            }
        };

        GenerateContentResponse response = await predictionServiceClient.GenerateContentAsync(generateContentRequest);

        string responseText = response.Candidates[0].Content.Parts[0].Text;
        Console.WriteLine(responseText);

        return responseText;
    }
}

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

Before using any of the request data, make the following replacements:

  • LOCATION: The region to process the request. Enter a supported region. For the full list of supported regions, see Available locations.

    Click to expand a partial list of available regions

    • us-central1
    • us-west4
    • northamerica-northeast1
    • us-east4
    • us-west1
    • asia-northeast3
    • asia-southeast1
    • asia-northeast1
  • PROJECT_ID: Your project ID.
  • FILE_URI: The Cloud Storage URI of the file to include in the prompt. The bucket object must either be publicly readable or reside in the same Google Cloud project that's sending the request. You must also specify the media type (mimeType) of the file.

    If you don't have an audio file in Cloud Storage, then you can use the following publicly available file: gs://cloud-samples-data/generative-ai/audio/pixel.mp3 with a mime type of audio/mp3. To listen to this audio, open the sample MP3 file.

  • MIME_TYPE: The media type of the file specified in the data or fileUri fields. Acceptable values include the following:

    Click to expand MIME types

    • application/pdf
    • audio/mpeg
    • audio/mp3
    • audio/wav
    • image/png
    • image/jpeg
    • text/plain
    • video/mov
    • video/mpeg
    • video/mp4
    • video/mpg
    • video/avi
    • video/wmv
    • video/mpegps
    • video/flv
  • TEXT
    The text instructions to include in the prompt. For example, Can you transcribe this interview, in the format of timecode, speaker, caption. Use speaker A, speaker B, etc. to identify speakers.

To send your request, choose one of these options:

curl

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

cat > request.json << 'EOF'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
EOF

Then execute the following command to send your REST request:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-flash:generateContent"

PowerShell

Save the request body in a file named request.json. Run the following command in the terminal to create or overwrite this file in the current directory:

@'
{
  "contents": {
    "role": "USER",
    "parts": [
      {
        "fileData": {
          "fileUri": "FILE_URI",
          "mimeType": "MIME_TYPE"
        }
      },
      {
        "text": "TEXT"
      }
    ]
  }
}
'@  | Out-File -FilePath request.json -Encoding utf8

Then execute the following command to send your REST request:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/google/models/gemini-1.5-flash:generateContent" | Select-Object -Expand Content

You should receive a JSON response similar to the following.

Note the following in the URL for this sample:
  • Use the generateContent method to request that the response is returned after it's fully generated. To reduce the perception of latency to a human audience, stream the response as it's being generated by using the streamGenerateContent method.
  • The multimodal model ID is located at the end of the URL before the method (for example, gemini-1.5-flash or gemini-1.0-pro-vision). This sample may support other models as well.

Console

To send a multimodal prompt by using the Google Cloud console, do the following:

  1. In the Vertex AI section of the Google Cloud console, go to the Vertex AI Studio page.

    Go to Vertex AI Studio

  2. Under Prompt design (single turn), click Open.
  3. Configure the model and parameters:

    • Model: Select a model.
    • Region: Select the region that you want to use.
    • Temperature: Use the slider or textbox to enter a value for temperature.

      The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

      If the model returns a response that's too generic, too short, or the model gives a fallback response, try increasing the temperature.

    • Token limit: Use the slider or textbox to enter a value for the max output limit.

      Maximum number of tokens that can be generated in the response. A token is approximately four characters. 100 tokens correspond to roughly 60-80 words.

      Specify a lower value for shorter responses and a higher value for potentially longer responses.

    • Add stop sequence (optional): Enter a stop sequence, which is a series of characters (including spaces) that stops response generation if the model encounters it. The sequence is not included as part of the response. You can add up to five stop sequences.
  4. Optional: To configure advanced parameters, click Advanced and configure as follows:
  5. Click to expand advanced configurations

    • Top-K: Use the slider or textbox to enter a value for top-K (not supported for Gemini 1.5).

      Top-K changes how the model selects tokens for output. A top-K of 1 means the next selected token is the most probable among all tokens in the model's vocabulary (also called greedy decoding), while a top-K of 3 means that the next token is selected from among the three most probable tokens by using temperature.

      For each token selection step, the top-K tokens with the highest probabilities are sampled. Then tokens are further filtered based on top-P with the final token selected using temperature sampling.

      Specify a lower value for less random responses and a higher value for more random responses.

    • Top-P: Use the slider or textbox to enter a value for top-P. Tokens are selected from most probable to the least until the sum of their probabilities equals the value of top-P. For the least variable results, set top-P to 0.
    • Enable Grounding: Grounding isn't supported for multimodal prompts.
  6. To upload media, such as MP3 and WAV files, do the following:
    1. Click Insert Media, and select a source. If you choose Google Drive as your source, you must choose an account and give consent to Vertex AI Studio to access your account the first time you select this option. You can upload multiple images that have a total size of up to 10 MB. A single file can't exceed 7 MB.
    2. Click the file that you want to add.
    3. Click Select. The file thumbnail displays in the Prompt pane.
  7. Enter your text prompt in the Prompt pane.
  8. Click Submit, and the response is generated.
  9. Optional: To save your prompt to My prompts, click Save.
  10. Optional: To get the Python code or a curl command for your prompt, click Get code.

Set model parameters

The following model parameters can be set on multimodal models:

Top-P

Top-P changes how the model selects tokens for output. Tokens are selected from the most (see top-K) to least probable until the sum of their probabilities equals the top-P value. For example, if tokens A, B, and C have a probability of 0.3, 0.2, and 0.1 and the top-P value is 0.5, then the model will select either A or B as the next token by using temperature and excludes C as a candidate.

Specify a lower value for less random responses and a higher value for more random responses.

Temperature

The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Temperature controls the degree of randomness in token selection. Lower temperatures are good for prompts that require a less open-ended or creative response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 means that the highest probability tokens are always selected. In this case, responses for a given prompt are mostly deterministic, but a small amount of variation is still possible.

If the model returns a response that's too generic, too short, or the model gives a fallback response, try increasing the temperature.

Valid parameter values

Parameter Gemini 1.5 Pro Gemini 1.5 Flash
Top-P 0 - 1.0 (default 0.95) 0 - 1.0 (default 0.95)
Temperature 0 - 2.0 (default 1.0) 0 - 2.0 (default 1.0)

Audio requirements

Gemini 1.5 Flash and Gemini 1.5 Pro support the following audio MIME types.

Audio MIME type Gemini 1.5 Flash Gemini 1.5 Pro
AAC - audio/aac
FLAC - audio/flac
MP3 - audio/mp3
MPA - audio/m4a
MPEG - audio/mpeg
MPGA - audio/mpga
MP4 - audio/mp4
OPUS - audio/opus
PCM - audio/pcm
WAV - audio/wav
WEBM - audio/webm

Limitations

While Gemini multimodal models are powerful in many multimodal user cases, it's important to understand the limitations of the models:

  • Non-speech sound recognition: The models that support audio might make mistakes recognizing sound that's not speech.
  • Audio-only timestamps: The models that support audio can't accurately generate timestamps for requests with audio files. This includes segmentation and temporal localization timestamps. Timestamps can be generated accurately for input that includes a video that contains audio.
  • Transcription punctuation: Transcriptions returned by Gemini 1.5 Flash might not include punctuation.

What's next