Troubleshoot GPUs in GKE


This page shows you how to resolve issues related to GPUs in Google Kubernetes Engine (GKE).

If you need additional assistance, reach out to Cloud Customer Care.

GPU driver installation

This section provides troubleshooting information for automatic NVIDIA device driver installation in GKE.

Driver installation fails in Ubuntu nodes

If you use Ubuntu nodes that have attached L4 GPUs or H100 GPUs, the default GPU driver that GKE installs might not be at or later than the required version for those GPUs. As a result, the GPU device plugin Pod remains stuck in the Pending state and your GPU workloads on those nodes might experience issues.

To resolve this issue, manually install driver version 500 or later by running the following command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R535.yaml

GPU device plugins fail with CrashLoopBackOff errors

The following issue occurs if you used the manual driver installation method in your node pool prior to January 25, 2023 and later upgraded your node pool to a GKE version that supports automatic driver installation. Both installation workloads exist at the same time and try to install conflicting driver versions on your nodes.

The GPU device plugin init container fails with the Init:CrashLoopBackOff status. The logs for the container are similar to the following:

failed to verify installation: failed to verify GPU driver installation: exit status 18

To resolve this issue, try the following methods:

  • Remove the manual driver installation DaemonSet from your cluster. This deletes the conflicting installation workload and lets GKE automatically install a driver to your nodes.

    kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    
  • Re-apply the manual driver installation DaemonSet manifest to your cluster. On January 25, 2023, we updated the manifest to ignore nodes that use automatic driver installation.

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    
  • Disable automatic driver installation for your node pool. The existing driver installation DaemonSet should work as expected after the update operation completes.

    gcloud container node-pools update POOL_NAME \
        --accelerator=type=GPU_TYPE,count=GPU_COUNT,gpu-driver-version=disabled \
        --cluster=CLUSTER_NAME \
        --location=LOCATION
    

    Replace the following:

    • POOL_NAME: the name of the node pool.
    • GPU_TYPE: the GPU type that the node pool already uses.
    • GPU_COUNT: the number of GPUs that are already attached to the node pool.
    • CLUSTER_NAME: the name of the GKE cluster that contains the node pool.
    • LOCATION: the Compute Engine location of the cluster.

What's next

If you need additional assistance, reach out to Cloud Customer Care.