Use reservations with training

To ensure that VM resources are available when your custom training jobs need them, you can use Compute Engine reservations. Reservations provide a high level of assurance in obtaining capacity for Compute Engine resources. To learn more, see Reservations of Compute Engine zonal resources.

Overview

Your Vertex AI custom training and prediction jobs can consume Compute Engine reservations. Your reservation must specify an A2 or A3 machine type. If resources from those reservations are eligible for any committed use discounts (CUDs), then, when your VMs consume those reservations, you get those resources at the discounted prices. See CUDs for your reserved resources.

Limitations and requirements

Consider the following limitations and requirements when using Compute Engine reservations with Vertex AI:

  • Vertex AI can only consume reservations with the following machine series:

    • A2
    • A3
  • Using Compute Engine reservations with Vertex AI is only supported for custom training and prediction.
  • Ensure that sufficient quota is available for your Vertex AI jobs. See Additional quota requirements for shared reservations.
  • To support regular updates of your Vertex AI deployments, we recommend increasing your VM count by at least 1 additional VM for each concurrent deployment.
  • Ensure that your organization policy constraints allow shared reservations. See Allow and restrict projects from creating and modifying shared reservations.
  • Your reservation's VM instance properties must match exactly with your Vertex AI workload to use the reservation. For example, if a Vertex AI workload has an a2-megagpu-16g machine type, the reservation's machine type must match. See Requirements.
  • The following services and capabilities aren't supported when using Compute Engine reservations with Vertex AI training:

  • Your custom training job must use a custom service account. See Use a custom service account.

Billing

When you use Compute Engine reservations, you're billed for the following:

  • Compute Engine pricing for the Compute Engine resources, including any applicable committed use discounts (CUDs). See Compute Engine pricing.
  • Vertex AI custom training management fees in addition to your infrastructure usage. See Custom-trained models pricing.

Before you begin

Allow a reservation to be consumed

Before consuming a reservation of A2 or A3 VMs, you must set its sharing policy to allow Vertex AI to consume the reservation. To do so, use one of the following methods:

Allow consumption while creating a reservation

While creating a single-project or shared reservation of A2 or A3 VMs, you can specify to allow Vertex AI to consume the reservation as follows:

  • If you're using the Google Cloud console, then, in the Google Cloud services section, select Share reservation.
  • If you're using the Google Cloud CLI, then include the --reservation-sharing-policy flag set to ALLOW_ALL.
  • If you're using the REST API, then include the serviceShareType field set to ALLOW_ALL.

Allow consumption in an existing reservation

To allow Vertex AI to consume an existing reservation of A2 or A3 VMs, see Modify the sharing policy of a reservation.

Create a custom training job with a reservation

You can create a custom training job that consumes a Compute Engine reservation by using the REST API.

REST

Before using any of the request data, make the following replacements:

  • LOCATION: The region where the container or Python package will be run.
  • PROJECT_ID: Your project ID.
  • JOB_NAME: Required. A display name for the CustomJob.
  • Define the custom training job:
    • MACHINE_TYPE: The type of the machine. Refer to available machine types for training.
    • RESERVATION_AFFINITY_TYPE: Must be ANY, SPECIFIC_RESERVATION, or NONE.

      • ANY means that the VMs of your customJob can automatically consume any reservation with matching properties.
      • SPECIFIC_RESERVATION means that the VMs of your customJob can consume only a reservation that the VMs specifically target by name.
      • NONE means that the VMs of your customJob can't consume any reservation. Specifying NONE has the same effect as omitting a reservation affinity specification.
    • RESERVATION_NAME: The name of your reservation.
    • DISK_TYPE: Optional. The type of the boot disk to use for the job, either pd-standard (default) or pd-ssd. Learn more about disk types.
    • DISK_SIZE: Optional. The size in GB of the boot disk to use for the job. The default value is 100.
    • REPLICA_COUNT: The number of worker replicas to use. In most cases, set this to 1 for your first worker pool.
    • If your training application runs in a custom container, specify the following:
      • CUSTOM_CONTAINER_IMAGE_URI: The URI of a container image in Artifact Registry or Docker Hub that is to be run on each worker replica.
      • CUSTOM_CONTAINER_COMMAND: Optional. The command to be invoked when the container is started. This command overrides the container's default entrypoint.
      • CUSTOM_CONTAINER_ARGS: Optional. The arguments to be passed when starting the container.
    • If your training application is a Python package that runs in a prebuilt container, specify the following:
      • EXECUTOR_IMAGE_URI: The URI of the container image that runs the provided code. Refer to the available prebuilt containers for training.
      • PYTHON_PACKAGE_URIS: Comma-separated list of Cloud Storage URIs specifying the Python package files which are the training program and its dependent packages. The maximum number of package URIs is 100.
      • PYTHON_MODULE: The Python module name to run after installing the packages.
      • PYTHON_PACKAGE_ARGS: Optional. Command-line arguments to be passed to the Python module.
    • TIMEOUT: Optional. The maximum running time for the job.
  • Specify the LABEL_NAME and LABEL_VALUE for any labels that you want to apply to this custom job.

HTTP method and URL:

POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/customJobs

Request JSON body:

{
  "displayName": "JOB_NAME",
  "jobSpec": {
    "workerPoolSpecs": [
      {
        "machineSpec": {
          "machineType": "MACHINE_TYPE",
          "reservationAffinity": {
            "reservationAffinityType": "RESERVATION_AFFINITY_TYPE",
            "key": "compute.googleapis.com/reservation-name",
            "values": [
              "projects/PROJECT_ID/reservations/RESERVATION_NAME"
            ]
          }
        },
        "replicaCount": REPLICA_COUNT,
        "diskSpec": {
          "bootDiskType": DISK_TYPE,
          "bootDiskSizeGb": DISK_SIZE
        },

        // Union field task can be only one of the following:
        "containerSpec": {
          "imageUri": CUSTOM_CONTAINER_IMAGE_URI,
          "command": [
            CUSTOM_CONTAINER_COMMAND
          ],
          "args": [
            CUSTOM_CONTAINER_ARGS
          ]
        },
        "pythonPackageSpec": {
          "executorImageUri": EXECUTOR_IMAGE_URI,
          "packageUris": [
            PYTHON_PACKAGE_URIS
          ],
          "pythonModule": PYTHON_MODULE,
          "args": [
            PYTHON_PACKAGE_ARGS
          ]
        }
        // End of list of possible types for union field task.
      }
      // Specify one workerPoolSpec for single replica training, or multiple workerPoolSpecs
      // for distributed training.
    ],
    "scheduling": {
      "timeout": TIMEOUT
    }
  },
  "labels": {
    LABEL_NAME_1": LABEL_VALUE_1,
    LABEL_NAME_2": LABEL_VALUE_2
  }
}

To send your request, choose one of these options:

curl

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/customJobs"

PowerShell

Save the request body in a file named request.json, and execute the following command:

$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }

Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/customJobs" | Select-Object -Expand Content

The response contains information about specifications as well as the TRAININGPIPELINE_ID.

What's next