This page describes GPUs in Google Kubernetes Engine (GKE), including use cases, supported features and GPU types, and the differences between Autopilot and Standard modes. For instructions on how to attach and use GPUs in your workloads, refer to Deploy GPU workloads on Autopilot or Run GPUs on Standard node pools.
GPU availability in GKE
In GKE Autopilot you request GPU hardware by specifying GPU resources in your workloads. In GKE Standard you can attach GPU hardware to nodes in your clusters, and then allocate GPU resources to containerized workloads running on those nodes. You can use these accelerators to perform resource-intensive tasks, such as the following:
- Machine learning (ML) inference and training
- Large-scale data processing
The GPU hardware that's available for use in GKE is a subset of the Compute Engine GPUs for compute workloads. GKE offers some GPU-specific features, such as time-sharing and multi-instance GPUs, that can improve the efficiency with which your workloads use the GPU resources on your nodes.
The specific hardware that's available depends on the Compute Engine region or zone of your cluster. For specific availability, refer to GPU regions and zones.
GPU quota
Your GPU quota is the maximum number of GPUs that can run in your Google Cloud project. To use GPUs in your GKE clusters, your project must have enough GPU quota.
Your GPU quota should be at least equal to the total number of GPUs you intend to run in your cluster. If you enable cluster autoscaling, you should request GPU quota at least equivalent to your cluster's maximum number of nodes multiplied by the number of GPUs per node.
For example, if you expect to utilize three nodes with two GPUs each, your project requires at least six GPU quota.
To request additional GPU quota, follow the instructions in
Requesting a higher quota limit, using
gpus
as the metric.
GPU support in Autopilot and Standard
GPUs are available in Autopilot and Standard clusters. The following table describes the differences between Autopilot and Standard GPU support:
Description | Autopilot | Standard |
---|---|---|
GPU hardware availability |
|
All GPU types that are supported by Compute Engine |
Selecting a GPU | You request a GPU quantity and type in your workload specification. By default, Autopilot installs the default driver for that GKE version and manages your nodes. To select a specific driver version in Autopilot, see NVIDIA drivers selection for Autopilot GPU Pods |
For instructions, refer to Run GPUs on Standard node pools |
Improve GPU utilization | ||
Security | ||
Pricing | Autopilot GPU Pod pricing | Compute Engine GPU pricing |
In Autopilot, GKE manages driver installation, node scaling, Pod isolation, and node provisioning. We recommend choosing a cluster mode for your GPUs based on the flexibility and level of control you want over your nodes, as follows:
- If you want to focus on deploying your GPU-based workloads without needing to manage the nodes, and if the available GPU types suit your needs, use Autopilot.
- If you prefer to manage your nodes, scaling, isolation, and underlying machines yourself, use Standard.
GPU features in GKE
GKE provides additional features that you can use to optimize the resource usage of your GPU workloads, so that you aren't wasting GPU resources on your nodes. By default, Kubernetes only supports assigning GPUs as whole units to containers, even if a container only needs a fraction of the available GPU, or if the container doesn't always use the resources.
The following features are available in GKE to reduce the amount of underutilized GPU resources:
GPU features | |
---|---|
Multi-instance GPUs |
Available on: Autopilot and Standard Split a single GPU into up to seven hardware-separated instances that can be assigned as individual GPUs to containers on a node. Each assigned container gets the resources available to that instance. |
Time-sharing GPUs |
Available on: Autopilot and Standard Present a single GPU as multiple units to multiple containers on a node. The GPU driver context-switches and allocates the full GPU resources to each assigned container as needed over time. |
NVIDIA MPS |
Available on: Standard Share a single physical NVIDIA GPU across multiple containers. NVIDIA MPS is an alternative, binary-compatible implementation of the CUDA API designed to transparently enable co-operative multi-process CUDA applications to run concurrently on a single GPU device. |
About the NVIDIA CUDA-X libraries
In Autopilot clusters, GKE manages the driver version selection and installation.
CUDA is NVIDIA's parallel computing platform and programming model for GPUs. To use CUDA applications, the image that you use must have the libraries. To add the NVIDIA CUDA-X libraries, use any of the following methods:
- Recommended: Use an image with the NVIDIA CUDA-X libraries pre-installed. For example, you can use Deep Learning Containers. These containers pre-install the key data science frameworks, the NVIDIA CUDA-X libraries, and tools. Alternatively, the NVIDIA CUDA image contains only the NVIDIA CUDA-X libraries.
- Build and use your own image. In this case, include the following
values in the
LD_LIBRARY_PATH
environment variable in your container specification:/usr/local/cuda-CUDA_VERSION/lib64
: the location of the NVIDIA CUDA-X libraries on the node. ReplaceCUDA_VERSION
with the CUDA-X image version that you used. Some versions also contain debug utilities in/usr/local/nvidia/bin
. For details, see the NVIDIA CUDA image on DockerHub./usr/local/nvidia/lib64
: the location of the NVIDIA device drivers.
To check the minimum GPU driver version required for your version of CUDA, see CUDA Toolkit and Compatible Driver Versions. Ensure that the GKE patch version running on your nodes includes a GPU driver version that's compatible with your chosen CUDA version. For a list of GPU driver versions associated with GKE version, refer to the corresponding Container-Optimized OS page linked in the GKE current versions table.
Monitor GPU nodes
If your GKE cluster has system metrics enabled, then the following metrics are available in Cloud Monitoring to monitor your GPU workload performance:
- Duty Cycle (
container/accelerator/duty_cycle
): Percentage of time over the past sample period (10 seconds) during which the accelerator was actively processing. Between 1 and 100. - Memory Usage (
container/accelerator/memory_used
): Amount of accelerator memory allocated in bytes. - Memory Capacity (
container/accelerator/memory_total
): Total accelerator memory in bytes.
You can use predefined dashboards to monitor your clusters with GPU nodes. For more information, see View observability metrics. For general information about monitoring your clusters and their resources, refer to Observability for GKE.
View usage metrics for workloads
You view your workload GPU usage metrics from the Workloads dashboard in the Google Cloud console.
To view your workload GPU usage, perform the following steps:
Go to the Workloads page in the Google Cloud console.
Go to Workloads- Select a workload.
The Workloads dashboard displays charts for GPU memory usage and capacity, and GPU duty cycle.
View NVIDIA Data Center GPU Manager (DCGM) metrics
You can collect and visualize NVIDIA DCGM metrics by using Google Cloud Managed Service for Prometheus. For Standard clusters, you must install the NVIDIA drivers. For Autopilot clusters, GKE installs the drivers.
For instructions on how to deploy DCGM and the Prometheus DCGM exporter, see NVIDIA Data Center GPU Manager (DCGM) in the Google Cloud Observability documentation.
Handle disruption due to node maintenance
The GKE nodes that host the GPUs are subject to maintenance
events or other disruptions that might cause node shutdown. You can reduce
disruption to workloads running in GKE clusters with the control
plane running version 1.29.1-gke.1425000 and later. GKE alerts
the nodes of an imminent shutdown by sending a SIGTERM
signal to the node up
to 60 minutes before evictions.
You can configure GKE to terminate your workloads gracefully. In
your Pod manifest, set the spec.terminationGracePeriodSeconds
field to a value
up to a maximum of 3600
seconds (one hour). GKE makes a best
effort to terminate these Pods gracefully and to execute the termination action
that you define, for example, saving a training state. GKE respects any configuration
of up to 60 minutes for the
PodDisruptionBudget
or
terminationGracePeriodSeconds
settings.
To learn more, see Configure GPU node graceful termination.
What's next
- Learn how to select GPUs in Autopilot Pods.
- Learn how to run GPUs in Standard node pools.
- Learn about the default, minimum, and maximum resource requests for Autopilot.