This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.
If you need additional assistance, reach out to Cloud Customer Care.Cluster issues
Cannot create a cluster: 0 nodes registered
The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:
All cluster resources were brought up, but: only 0 nodes out of 2 have registered.
To resolve the issue, do the following:
Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:
gcloud iam service-accounts describe SERVICE_ACCOUNT
Replace
SERVICE_ACCOUNT
with service account email address, such asmy-iam-account@my-first-project.iam.gserviceaccount.com
.If the service account is disabled, the output is similar to the following:
disabled: true displayName: my-service-account email: my-service-account@my-project.iam.gserviceaccount.com ...
If the service account is disabled, enable it:
gcloud iam service-accounts enable SERVICE_ACCOUNT
If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SERVICE_ACCOUNT" \
--role roles/container.nodeServiceAccount
Namespace stuck in the Terminating state when cluster has 0 nodes
The following issue occurs when you delete a namespace in a cluster after the
cluster scales down to zero nodes. The metrics-server
component can't accept
the namespace deletion request because the component has zero replicas.
To diagnose this issue, run the following command:
kubectl describe ns/NAMESPACE_NAME
Replace NAMESPACE_NAME
with the name of the namespace.
The output is the following:
Discovery failed for some groups, 1 failing: unable to retrieve the complete
list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
To resolve this issue, scale any workload up to trigger GKE to create a new node. When the node is ready, the namespace deletion request automatically completes. After GKE deletes the namespace, scale the workload back down.
Scaling issues
Node scale up failed: Pod is at risk of not being scheduled
The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.
The error message in your Kubernetes event log is similar to the following:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning FailedScaleUp pod/pod-test-5b97f7c978-h9lvl Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled
Serial port logging might be disabled at the organization level through an
organization policy that enforces the compute.disableSerialPortLogging
constraint. Serial port logging could also be disabled at the project or virtual
machine (VM) instance level.
To resolve this issue, do the following:
- Ask your Google Cloud organization policy administrator to
remove the
compute.disableSerialPortLogging
constraint in the project with your Autopilot cluster. - If you don't have an organization policy that enforces this constraint, try
to
enable serial port logging in your project metadata.
This action requires the
compute.projects.setCommonInstanceMetadata
IAM permission.
Node scale up failed: GCE out of resources
The following issue occurs when your workloads request more resources than are
available to use in that Compute Engine region or zone. Your Pods might remain
in the Pending
state.
Check your Pod events:
kubectl events --for='pod/POD_NAME' --types=Warning
Replace
RESOURCE_NAME
with the name of the pending Kubernetes resource. For examplepod/example-pod
.The output is similar to the following:
LAST SEEN TYPE REASON OBJECT Message 19m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. 14m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. 12m (x2 over 18m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-f associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled. 34s (x3 over 17m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
To resolve this issue, try the following:
- Deploy the Pod in a different region or zone. If your Pod has a zonal restriction such as a topology selector, remove the restriction if you can. For instructions, see Place GKE Pods in specific zones.
- Create a cluster in a different region and retry the deployment.
- Try using a different compute class. Compute classes that are backed by smaller Compute Engine machine types are more likely to have available resources. For example, the default machine type for Autopilot has the highest availability. For a list of compute classes and the corresponding machine types, see When to use specific compute classes.
- If you run GPU workloads, the requested GPU might not be available in your node location. Try deploying your workload in a different location or requesting a different type of GPU.
To avoid scale-up issues caused by resource availability in the future, consider the following approaches:
- Use Kubernetes PriorityClasses to consistently provision extra compute capacity in your cluster. For details, see Provision extra compute capacity for rapid Pod scaling.
- Use Compute Engine capacity reservations with the Performance or the Accelerator compute classes. For details, see Consume reserved zonal resources.
Nodes fail to scale up: Pod zonal resources exceeded
The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.
The error message in your logs is similar to the following:
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
...
This error refers to a noScaleUp
event, where node auto-provisioning did not provision any node group for the Pod in the zone.
If you encounter this error, confirm the following:
- Your Pods have sufficient memory and CPU.
- The Pod IP address CIDR range is large enough to support your anticipated maximum cluster size.
Workload issues
Workloads stuck with ephemeral storage error
GKE won't create Pods if your Pod ephemeral storage requests exceed the Autopilot maximum of 10GiB in GKE version 1.28.6-gke.1317000 and later.
To diagnose this issue, describe the workload controller, like the Deployment or the Job:
kubectl describe CONTROLLER_TYPE/CONTROLLER_NAME
Replace the following:
CONTROLLER_TYPE
: the type of workload controller, likereplicaset
ordaemonset
. For a list of controller types, see Workload management.CONTROLLER_NAME
: the name of the stuck workload.
If the Pod is not created because of the ephemeral storage request exceeding the maximum, the output is similar to the following:
# lines omitted for clarity
Events:
{"[denied by autogke-pod-limit-constraints]":["Max ephemeral-storage requested by init containers for workload '' is higher than the Autopilot maximum of '10Gi'.","Total ephemeral-storage requested by containers for workload '' is higher than the Autopilot maximum of '10Gi'."]}
To resolve this issue, update your ephemeral storage requests so that the total ephemeral storage requested by workload containers and by containers that webhooks inject is at less than or equal to the allowed maximum. For more information about the maximum, see Resource requests in Autopilot. for the workload configuration.
Pods stuck in Pending state
A Pod might get stuck in the Pending
status if you select a specific node
for your Pod to use, but the sum of resource requests in the Pod and in
DaemonSets that must run on the node exceeds the maximum allocatable capacity of
the node. This might cause your Pod to get a Pending
status and remain
unscheduled.
To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.
You can also try scheduling your DaemonSets before you schedule your regular workload Pods.
Consistently unreliable workload performance on a specific node
In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:
kubectl drain NODE_NAME --ignore-daemonsets
Replace NODE_NAME
with the name of the problematic node.
You can find the node name by running kubectl get nodes
.
GKE does the following:
- Evicts existing workloads from the node and stops scheduling workloads on that node.
- Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
- Terminates any workloads that remain on the node and repairs or recreates the node over time.
- If you use Autopilot, GKE shuts down and replaces the node immediately and ignores any configured PodDisruptionBudgets.
Pods take longer than expected to schedule on empty clusters
This event occurs when you deploy a workload to an Autopilot cluster that has no other workloads. Autopilot clusters start with zero usable nodes and scale to zero nodes if the cluster is empty to avoid having unutilized compute resources in the cluster. Deploying a workload in a cluster that has zero nodes triggers a scale-up event.
If you experience this, Autopilot is functioning as intended, and no action is necessary. Your workload will deploy as expected after the new nodes boot up.
Check whether your Pods are waiting for new nodes:
Describe your pending Pod:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of your pending Pod.Check the
Events
section of the output. If the Pod is waiting for new nodes, the output is similar to the following:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 11s gke.io/optimize-utilization-scheduler no nodes available to schedule pods Normal TriggeredScaleUp 4s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-9293c6db-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-d99371e7-grp 0->1 (max: 1000)}]
The
TriggeredScaleUp
event shows that your cluster is scaling up from zero nodes to as many nodes are required to run your deployed workload.
Error related to permission when trying to run tcpdump from a Pod in GKE Autopilot
Access to underlying nodes is prohibited in a GKE Autopilot cluster. Thus, it is required to run tcpdump
utility from within a Pod and then copy it using kubectl cp command.
If you generally run tcpdump utility from within a Pod in a GKE Autopilot cluster, you might see the following error:
tcpdump: eth0: You don't have permission to perform this capture on that device
(socket: Operation not permitted)
This happens because GKE Autopilot, by default, applies a security context to all Pods that drops the NET_RAW
capability to mitigate potential vulnerabilities. For example:
apiVersion: v1
kind: Pod
metadata:
labels:
app: tcpdump
name: tcpdump
spec:
containers:
- image: nginx
name: nginx
resources:
limits:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
requests:
cpu: 500m
ephemeral-storage: 1Gi
memory: 2Gi
securityContext:
capabilities:
drop:
- NET_RAW
As a solution, if your workload requires the NET_RAW
capability, you can re-enable it:
Add the
NET_RAW
capability to thesecurityContext
section of your Pod's YAML specification:securityContext: capabilities: add: - NET_RAW
Run
tcpdump
from within a Pod:tcpdump port 53 -w packetcap.pcap tcpdump: listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
Use
kubectl cp
command to copy it to your local machine for further analysis:kubectl cp POD_NAME:/PATH_TO_FILE/FILE_NAME/PATH_TO_FILE/FILE_NAME
Use
kubectl exec
to run thetcpdump
command to perform network packet capture and redirect the output:kubectl exec -it POD_NAME -- bash -c "tcpdump port 53 -w -" > packet-new.pcap