Learn about online predictions

Vertex AI offers online predictions on Google Distributed Cloud (GDC) air-gapped through the Online Prediction API. A prediction is the output of a trained machine-learning model. Specifically, online predictions are synchronous requests made to your model endpoint.

Online Prediction lets you upload, deploy, serve, and make requests using your own prediction models on a set of supported containers. Use Online Prediction when making requests in response to application input or in situations requiring timely inference.

You can use the Online Prediction API by applying Kubernetes custom resources to the dedicated prediction cluster that your Infrastructure Operator (IO) creates for you.

Before getting online predictions, you must export model artifacts and deploy the model to an endpoint. This action associates compute resources with the model to serve online predictions with low latency.

Then, you can get online predictions from a custom-trained model by formatting and sending a request.

Available container images

The following table contains the list of supported containers for Online Prediction in Distributed Cloud:

ML framework	Version	Supported accelerators	Supported images
TensorFlow	2.14	CPU	tf2-cpu.2-14
TensorFlow	2.14	GPU	tf2-gpu.2-14
PyTorch	2.1	CPU	pytorch-cpu.2-1
PyTorch	2.1	GPU	pytorch-gpu.2-1