Troubleshooting Autopilot clusters

Autopilot

This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.

If you need additional assistance, reach out to Cloud Customer Care.

Cluster issues

Cannot create a cluster: 0 nodes registered

The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:

All cluster resources were brought up, but: only 0 nodes out of 2 have registered.

To resolve the issue, do the following:

Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:
```
gcloud iam service-accounts describe SERVICE_ACCOUNT
```
Replace SERVICE_ACCOUNT with service account email address, such as my-iam-account@my-first-project.iam.gserviceaccount.com.

If the service account is disabled, the output is similar to the following:
```
disabled: true
displayName: my-service-account
email: my-service-account@my-project.iam.gserviceaccount.com
...
```

If the service account is disabled, enable it:

gcloud iam service-accounts enable SERVICE_ACCOUNT

If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:SERVICE_ACCOUNT" \
    --role roles/container.nodeServiceAccount

Namespace stuck in the Terminating state when cluster has 0 nodes

The following issue occurs when you delete a namespace in a cluster after the cluster scales down to zero nodes. The metrics-server component can't accept the namespace deletion request because the component has zero replicas.

To diagnose this issue, run the following command:

kubectl describe ns/NAMESPACE_NAME

Replace NAMESPACE_NAME with the name of the namespace.

The output is the following:

Discovery failed for some groups, 1 failing: unable to retrieve the complete
list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to
handle the request

To resolve this issue, scale any workload up to trigger GKE to create a new node. When the node is ready, the namespace deletion request automatically completes. After GKE deletes the namespace, scale the workload back down.

Scaling issues

Node scale up failed: Pod is at risk of not being scheduled

The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.

The error message in your Kubernetes event log is similar to the following:

LAST SEEN   TYPE      REASON          OBJECT                          MESSAGE
12s         Warning   FailedScaleUp   pod/pod-test-5b97f7c978-h9lvl   Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled

Serial port logging might be disabled at the organization level through an organization policy that enforces the compute.disableSerialPortLogging constraint. Serial port logging could also be disabled at the project or virtual machine (VM) instance level.

To resolve this issue, do the following:

Ask your Google Cloud organization policy administrator to remove the compute.disableSerialPortLogging constraint in the project with your Autopilot cluster.
If you don't have an organization policy that enforces this constraint, try to enable serial port logging in your project metadata. This action requires the compute.projects.setCommonInstanceMetadata IAM permission.

Node scale up failed: GCE out of resources

The following issue occurs when your workloads request more resources than are available to use in that Compute Engine region or zone. Your Pods might remain in the Pending state.

Check your Pod events:

kubectl events --for='pod/POD_NAME' --types=Warning

Replace RESOURCE_NAME with the name of the pending Kubernetes resource. For example pod/example-pod.

The output is similar to the following:

LAST SEEN         TYPE            REASON                  OBJECT                   Message
19m               Warning         FailedScheduling        pod/example-pod          gke.io/optimize-utilization-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
14m               Warning         FailedScheduling        pod/example-pod          gke.io/optimize-utilization-scheduler  0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling.
12m (x2 over 18m) Warning         FailedScaleUp           cluster-autoscaler       Node scale up in zones us-central1-f associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
34s (x3 over 17m) Warning         FailedScaleUp           cluster-autoscaler       Node scale up in zones us-central1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.

To resolve this issue, try the following:

Deploy the Pod in a different region or zone. If your Pod has a zonal restriction such as a topology selector, remove the restriction if you can. For instructions, see Place GKE Pods in specific zones.
Create a cluster in a different region and retry the deployment.
Try using a different compute class. Compute classes that are backed by smaller Compute Engine machine types are more likely to have available resources. For example, the default machine type for Autopilot has the highest availability. For a list of compute classes and the corresponding machine types, see When to use specific compute classes.
If you run GPU workloads, the requested GPU might not be available in your node location. Try deploying your workload in a different location or requesting a different type of GPU.

To avoid scale-up issues caused by resource availability in the future, consider the following approaches:

Use Kubernetes PriorityClasses to consistently provision extra compute capacity in your cluster. For details, see Provision extra compute capacity for rapid Pod scaling.
Use Compute Engine capacity reservations with the Performance or the Accelerator compute classes. For details, see Consume reserved zonal resources.

Nodes fail to scale up: Pod zonal resources exceeded

The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.

The error message in your logs is similar to the following:

    "napFailureReasons": [
            {
              "messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
              ...

This error refers to a noScaleUp event, where node auto-provisioning did not provision any node group for the Pod in the zone.

If you encounter this error, confirm the following:

Your Pods have sufficient memory and CPU.
The Pod IP address CIDR range is large enough to support your anticipated maximum cluster size.

Workload issues

Workloads stuck with ephemeral storage error

GKE won't create Pods if your Pod ephemeral storage requests exceed the Autopilot maximum of 10GiB in GKE version 1.28.6-gke.1317000 and later.

To diagnose this issue, describe the workload controller, like the Deployment or the Job:

kubectl describe CONTROLLER_TYPE/CONTROLLER_NAME

Replace the following:

CONTROLLER_TYPE: the type of workload controller, like replicaset or daemonset. For a list of controller types, see Workload management.
CONTROLLER_NAME: the name of the stuck workload.

If the Pod is not created because of the ephemeral storage request exceeding the maximum, the output is similar to the following:

# lines omitted for clarity

Events:

{"[denied by autogke-pod-limit-constraints]":["Max ephemeral-storage requested by init containers for workload '' is higher than the Autopilot maximum of '10Gi'.","Total ephemeral-storage requested by containers for workload '' is higher than the Autopilot maximum of '10Gi'."]}

To resolve this issue, update your ephemeral storage requests so that the total ephemeral storage requested by workload containers and by containers that webhooks inject is at less than or equal to the allowed maximum. For more information about the maximum, see Resource requests in Autopilot. for the workload configuration.

Pods stuck in Pending state

A Pod might get stuck in the Pending status if you select a specific node for your Pod to use, but the sum of resource requests in the Pod and in DaemonSets that must run on the node exceeds the maximum allocatable capacity of the node. This might cause your Pod to get a Pending status and remain unscheduled.

To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.

You can also try scheduling your DaemonSets before you schedule your regular workload Pods.

Consistently unreliable workload performance on a specific node

In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:

kubectl drain NODE_NAME --ignore-daemonsets

Replace NODE_NAME with the name of the problematic node. You can find the node name by running kubectl get nodes.

GKE does the following:

Evicts existing workloads from the node and stops scheduling workloads on that node.
Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
Terminates any workloads that remain on the node and repairs or recreates the node over time.
If you use Autopilot, GKE shuts down and replaces the node immediately and ignores any configured PodDisruptionBudgets.

Pods take longer than expected to schedule on empty clusters

This event occurs when you deploy a workload to an Autopilot cluster that has no other workloads. Autopilot clusters start with zero usable nodes and scale to zero nodes if the cluster is empty to avoid having unutilized compute resources in the cluster. Deploying a workload in a cluster that has zero nodes triggers a scale-up event.

If you experience this, Autopilot is functioning as intended, and no action is necessary. Your workload will deploy as expected after the new nodes boot up.

Check whether your Pods are waiting for new nodes:

Describe your pending Pod:
```
kubectl describe pod POD_NAME
```
Replace POD_NAME with the name of your pending Pod.

Check the Events section of the output. If the Pod is waiting for new nodes, the output is similar to the following:

Events:
  Type     Reason            Age   From                                   Message
  ----     ------            ----  ----                                   -------
  Warning  FailedScheduling  11s   gke.io/optimize-utilization-scheduler  no nodes available to schedule pods
  Normal   TriggeredScaleUp  4s    cluster-autoscaler                     pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-9293c6db-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-d99371e7-grp 0->1 (max: 1000)}]

The TriggeredScaleUp event shows that your cluster is scaling up from zero nodes to as many nodes are required to run your deployed workload.

Error related to permission when trying to run tcpdump from a Pod in GKE Autopilot

Access to underlying nodes is prohibited in a GKE Autopilot cluster. Thus, it is required to run tcpdump utility from within a Pod and then copy it using kubectl cp command. If you generally run tcpdump utility from within a Pod in a GKE Autopilot cluster, you might see the following error:

    tcpdump: eth0: You don't have permission to perform this capture on that device
    (socket: Operation not permitted)

This happens because GKE Autopilot, by default, applies a security context to all Pods that drops the NET_RAW capability to mitigate potential vulnerabilities. For example:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app: tcpdump
  name: tcpdump
spec:
  containers:
  - image: nginx
    name: nginx
    resources:
      limits:
        cpu: 500m
        ephemeral-storage: 1Gi
        memory: 2Gi
      requests:
        cpu: 500m
        ephemeral-storage: 1Gi
        memory: 2Gi
    securityContext:
      capabilities:
        drop:
        - NET_RAW

As a solution, if your workload requires the NET_RAW capability, you can re-enable it:

Add the NET_RAW capability to the securityContext section of your Pod's YAML specification:
```
securityContext:
  capabilities:
    add:
    - NET_RAW
```

Run tcpdump from within a Pod:

tcpdump port 53 -w packetcap.pcap
tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes

Use kubectl cp command to copy it to your local machine for further analysis:
```
kubectl cp POD_NAME:/PATH_TO_FILE/FILE_NAME/PATH_TO_FILE/FILE_NAME
```
Use kubectl exec to run the tcpdump command to perform network packet capture and redirect the output:
```
kubectl exec -it POD_NAME -- bash -c "tcpdump port 53 -w -" > packet-new.pcap
```

What's next

If you need additional assistance, reach out to Cloud Customer Care.