Difference between online and batch predictions
Online predictions are synchronous requests made to a model endpoint. Use online predictions when you are making requests in response to application input or in situations that require timely inference.
Batch predictions are asynchronous requests. You request batch predictions directly from the model resource without needing to deploy the model to an endpoint. For image data, use batch predictions when you don't require an immediate response and want to process accumulated data by using a single request.
Get online predictions
Deploy a model to an endpoint
You must deploy a model to an endpoint before that model can be used to serve online predictions. Deploying a model associates physical resources with the model so it can serve online predictions with low latency.
You can deploy more than one model to an endpoint, and you can deploy a model to more than one endpoint. For more information about options and use cases for deploying models, see About deploying models.
Use one of the following methods to deploy a model:
Google Cloud console
In the Google Cloud console, in the Vertex AI section, go to the Models page.
Click the name of the model you want to deploy to open its Model description page.
In the Version ID column, click the ID of the version of the model you want to deploy
Click Deploy & Test.
If your model is already deployed to any endpoints, they are listed in the Deploy your model section.
Click Deploy to endpoint.
To deploy your model to a new endpoint, click
Create new endpoint and enter a name for the new endpoint. To deploy your model to an existing endpoint, click Add to existing endpoint and select the endpoint Endpoint name.You can add more than one model to an endpoint, and you can add a model to multiple endpoints. Learn more.
If you deploy to a new endpoint, choose how your endpoint can be accessed:
Click Standard for the endpoint endpoint to be available for prediction through a REST API.
Click Private for the endpoint to use a private connection.
If you deploy to an existing endpoint that has one or more models deployed to it, update the Traffic split percentage for the model you are deploying and the already deployed models so their percentages add up to 100%.
Select AutoML Image and configure as follows:
If you're deploying your model to a new endpoint, accept 100 for the Traffic split. Otherwise, adjust the traffic split values for all models on the endpoint so they add up to 100.
Enter the Number of compute nodes you want to provide for your model.
This is the number of nodes available to this model at all times. You are charged for the nodes, even without prediction traffic. See the pricing page.
Learn how to change the default settings for prediction logging.
Classification models only (optional): In the Explainability options section, select Vertex Explainable AI. Accept existing visualization settings or choose new values and click Done.
Enable feature attributions for this model to enableDeploying AutoML image classification models with Vertex Explainable AI configured and performing predictions with explanations is optional. Enabling Vertex Explainable AI at deployment time incurs additional costs based on the deployed node count and deployment time. See Pricing for more information.
Click Done for your model, and when all the Traffic split percentages are correct, click Continue.
The region where your model deploys is displayed. This must be the region where you created your model.
Click Deploy to deploy your model to the endpoint.
API
When you deploy a model using the Vertex AI API, you complete the following steps:
- Create an endpoint if needed.
- Get the endpoint ID.
- Deploy the model to the endpoint.
Create an endpoint
If you are deploying a model to an existing endpoint, you can skip this step.
gcloud
The following example uses the gcloud ai endpoints create
command:
gcloud ai endpoints create \
--region=LOCATION \
--display-name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
The Google Cloud CLI tool might take a few seconds to create the endpoint.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: Your region.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints
Request JSON body:
{ "display_name": "ENDPOINT_NAME" }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.CreateEndpointOperationMetadata", "genericMetadata": { "createTime": "2020-11-05T17:45:42.812656Z", "updateTime": "2020-11-05T17:45:42.812656Z" } } }
"done": true
.
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Retrieve the endpoint ID
You need the endpoint ID to deploy the model.
gcloud
The following example uses the gcloud ai endpoints list
command:
gcloud ai endpoints list \
--region=LOCATION \
--filter=display_name=ENDPOINT_NAME
Replace the following:
- LOCATION_ID: The region where you are using Vertex AI.
- ENDPOINT_NAME: The display name for the endpoint.
Note the number that appears in the ENDPOINT_ID
column. Use this ID in the
following step.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_NAME: The display name for the endpoint.
HTTP method and URL:
GET https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints?filter=display_name=ENDPOINT_NAME
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "endpoints": [ { "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/endpoints/ENDPOINT_ID", "displayName": "ENDPOINT_NAME", "etag": "AMEw9yPz5pf4PwBHbRWOGh0PcAxUdjbdX2Jm3QO_amguy3DbZGP5Oi_YUKRywIE-BtLx", "createTime": "2020-04-17T18:31:11.585169Z", "updateTime": "2020-04-17T18:35:08.568959Z" } ] }
Deploy the model
Select the tab below for your language or environment:
gcloud
The following examples use the gcloud ai endpoints deploy-model
command.
The following example deploys a Model
to an Endpoint
without splitting
traffic between multiple DeployedModel
resources:
Before using any of the command data below, make the following replacements:
- ENDPOINT_ID: The ID for the endpoint.
- LOCATION_ID: The region where you are using Vertex AI.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
-
MAX_REPLICA_COUNT: The maximum number of nodes for this deployment.
The node count can be increased or decreased as required by the prediction load,
up to this number of nodes and never fewer than the minimum number of nodes.
If you omit the
--max-replica-count
flag, then maximum number of nodes is set to the value of--min-replica-count
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=100
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME ` --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=100
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME ^ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=100
Splitting traffic
The --traffic-split=0=100
flag in the preceding examples sends 100% of prediction
traffic that the Endpoint
receives to the new DeployedModel
, which is
represented by the temporary ID 0
. If your Endpoint
already has other
DeployedModel
resources, then you can split traffic between the new
DeployedModel
and the old ones.
For example, to send 20% of traffic to the new DeployedModel
and 80% to an older one,
run the following command.
Before using any of the command data below, make the following replacements:
- OLD_DEPLOYED_MODEL_ID: the ID of the existing
DeployedModel
.
Execute the gcloud ai endpoints deploy-model command:
Linux, macOS, or Cloud Shell
gcloud ai endpoints deploy-model ENDPOINT_ID\ --region=LOCATION_ID \ --model=MODEL_ID \ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT \ --max-replica-count=MAX_REPLICA_COUNT \ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (PowerShell)
gcloud ai endpoints deploy-model ENDPOINT_ID` --region=LOCATION_ID ` --model=MODEL_ID ` --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ` --max-replica-count=MAX_REPLICA_COUNT ` --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
Windows (cmd.exe)
gcloud ai endpoints deploy-model ENDPOINT_ID^ --region=LOCATION_ID ^ --model=MODEL_ID ^ --display-name=DEPLOYED_MODEL_NAME \ --min-replica-count=MIN_REPLICA_COUNT ^ --max-replica-count=MAX_REPLICA_COUNT ^ --traffic-split=0=20,OLD_DEPLOYED_MODEL_ID=80
REST
Deploy the model.
Before using any of the request data, make the following replacements:
- LOCATION_ID: The region where you are using Vertex AI.
- PROJECT_ID: Your project ID.
- ENDPOINT_ID: The ID for the endpoint.
- MODEL_ID: The ID for the model to be deployed.
-
DEPLOYED_MODEL_NAME: A name for the
DeployedModel
. You can use the display name of theModel
for theDeployedModel
as well. - MIN_REPLICA_COUNT: The minimum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to the maximum number of nodes and never fewer than this number of nodes.
- MAX_REPLICA_COUNT: The maximum number of nodes for this deployment. The node count can be increased or decreased as required by the prediction load, up to this number of nodes and never fewer than the minimum number of nodes.
- TRAFFIC_SPLIT_THIS_MODEL: The percentage of the prediction traffic to this endpoint to be routed to the model being deployed with this operation. Defaults to 100. All traffic percentages must add up to 100. Learn more about traffic splits.
- DEPLOYED_MODEL_ID_N: Optional. If other models are deployed to this endpoint, you must update their traffic split percentages so that all percentages add up to 100.
- TRAFFIC_SPLIT_MODEL_N: The traffic split percentage value for the deployed model id key.
- PROJECT_NUMBER: Your project's automatically generated project number
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID:deployModel
Request JSON body:
{ "deployedModel": { "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID", "displayName": "DEPLOYED_MODEL_NAME", "automaticResources": { "minReplicaCount": MIN_REPLICA_COUNT, "maxReplicaCount": MAX_REPLICA_COUNT } }, "trafficSplit": { "0": TRAFFIC_SPLIT_THIS_MODEL, "DEPLOYED_MODEL_ID_1": TRAFFIC_SPLIT_MODEL_1, "DEPLOYED_MODEL_ID_2": TRAFFIC_SPLIT_MODEL_2 }, }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_ID/locations/LOCATION_ID/endpoints/ENDPOINT_ID/operations/OPERATION_ID", "metadata": { "@type": "type.googleapis.com/google.cloud.aiplatform.v1.DeployModelOperationMetadata", "genericMetadata": { "createTime": "2020-10-19T17:53:16.502088Z", "updateTime": "2020-10-19T17:53:16.502088Z" } } }
Java
Before trying this sample, follow the Java setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Java API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Vertex AI quickstart using client libraries. For more information, see the Vertex AI Node.js API reference documentation.
To authenticate to Vertex AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Learn how to change the default settings for prediction logging.
Get operation status
Some requests start long-running operations that require time to complete. These requests return an operation name, which you can use to view the operation's status or cancel the operation. Vertex AI provides helper methods to make calls against long-running operations. For more information, see Working with long-running operations.
Make an online prediction using your deployed model
To make an online prediction, submit one or more test items to a model for analysis, and the model returns results that are based on your model's objective. For more information about prediction results, see the Interpret results page.
Console
Use the Google Cloud console to request an online prediction. Your model must be deployed to an endpoint.
In the Google Cloud console, in the Vertex AI section, go to the Models page.
From the list of models, click the name of the model to request predictions from.
Select the Deploy & test tab.
Under the Test your model section, add test items to request a prediction.
AutoML models for image objectives require you to upload an image to request a prediction.
For information about local feature importance, see Get explanations.
After the prediction is complete, Vertex AI returns the results in the console.
API
Use the Vertex AI API to request an online prediction. Your model must be deployed to an endpoint.
Get batch predictions
To make a batch prediction request, you specify an input source and an output format where Vertex AI stores predictions results. Batch predictions for the AutoML image model type require an input JSON Lines file and the name of a Cloud Storage bucket to store the output.
Input data requirements
The input for batch requests specifies the items to send to your model for prediction. For image classification models, you can use a JSON Lines file to specify a list of images to make predictions about and then store the JSON Lines file in a Cloud Storage bucket. The following sample shows a single line in an input JSON Lines file:
{"content": "gs://sourcebucket/datasets/images/source_image.jpg", "mimeType": "image/jpeg"}
Request a batch prediction
For batch prediction requests, you can use the Google Cloud console or the Vertex AI API. Depending on the number of input items that you've submitted, a batch prediction task can take some time to complete.
Google Cloud console
Use the Google Cloud console to request a batch prediction.
In the Google Cloud console, in the Vertex AI section, go to the Batch predictions page.
Click Create to open the New batch prediction window and complete the following steps:
- Enter a name for the batch prediction.
- For Model name, select the name of the model to use for this batch prediction.
- For Source path, specify the Cloud Storage location where your JSON Lines input file is located.
- For the Destination path, specify a Cloud Storage location where the batch prediction results are stored. The Output format is determined by your model's objective. AutoML models for image objectives output JSON Lines files.
API
Use the Vertex AI API to send batch prediction requests.
REST
Before using any of the request data, make the following replacements:
- LOCATION_ID: Region where Model is stored and batch prediction job is executed. For
example,
us-central1
. - PROJECT_ID: Your project ID
- BATCH_JOB_NAME: Display name for the batch job
- MODEL_ID: The ID for the model to use for making predictions
- THRESHOLD_VALUE (optional): Vertex AI returns only
predictions that have confidence scores with at least this value. The
default is
0.0
. - MAX_PREDICTIONS (optional): Vertex AI returns up to this
many predictions starting with the predictions that have highest confidence
scores. The default is
10
. - URI: Cloud Storage URI where your input JSON Lines file is located.
- BUCKET: Your Cloud Storage bucket
- PROJECT_NUMBER: Your project's automatically generated project number
HTTP method and URL:
POST https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs
Request JSON body:
{ "displayName": "BATCH_JOB_NAME", "model": "projects/PROJECT/locations/LOCATION/models/MODEL_ID", "modelParameters": { "confidenceThreshold": THRESHOLD_VALUE, "maxPredictions": MAX_PREDICTIONS }, "inputConfig": { "instancesFormat": "jsonl", "gcsSource": { "uris": ["URI"], }, }, "outputConfig": { "predictionsFormat": "jsonl", "gcsDestination": { "outputUriPrefix": "OUTPUT_BUCKET", }, }, }
To send your request, choose one of these options:
curl
Save the request body in a file named request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs"
PowerShell
Save the request body in a file named request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION_ID-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION_ID/batchPredictionJobs" | Select-Object -Expand Content
You should receive a JSON response similar to the following:
{ "name": "projects/PROJECT_NUMBER/locations/LOCATION_ID/batchPredictionJobs/BATCH_JOB_ID", "displayName": "BATCH_JOB_NAME", "model": "projects/PROJECT_ID/locations/LOCATION_ID/models/MODEL_ID", "inputConfig": { "instancesFormat": "jsonl", "gcsSource": { "uris": [ "CONTENT" ] } }, "outputConfig": { "predictionsFormat": "jsonl", "gcsDestination": { "outputUriPrefix": "BUCKET" } }, "state": "JOB_STATE_PENDING", "createTime": "2020-05-30T02:58:44.341643Z", "updateTime": "2020-05-30T02:58:44.341643Z", "modelDisplayName": "MODEL_NAME", "modelObjective": "MODEL_OBJECTIVE" }
You can poll for the status of the batch job using
the BATCH_JOB_ID until the job state
is
JOB_STATE_SUCCEEDED
.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Retrieve batch prediction results
Vertex AI sends batch prediction output to your specified destination.
When a batch prediction task is complete, the output of the prediction is stored in the Cloud Storage bucket that you specified in your request.
Example batch prediction results
The following an example batch prediction results from an image classification model.
{ "instance": {"content": "gs://bucket/image.jpg", "mimeType": "image/jpeg"}, "prediction": { "ids": [1, 2], "displayNames": ["cat", "dog"], "confidences": [0.7, 0.5] } }