This page shows you how to resolve issues with Google Kubernetes Engine (GKE) Autopilot clusters.
If you need additional assistance, reach out to Cloud Customer Care.Cluster issues
Cannot create a cluster: 0 nodes registered
The following issue occurs when you try to create an Autopilot cluster with an IAM service account that's disabled or doesn't have the required permissions. Cluster creation fails with the following error message:
All cluster resources were brought up, but: only 0 nodes out of 2 have registered.
To resolve the issue, do the following:
Check whether the default Compute Engine service account or the custom IAM service account that you want to use is disabled:
gcloud iam service-accounts describe SERVICE_ACCOUNT
Replace
SERVICE_ACCOUNT
with service account email address, such asmy-iam-account@my-first-project.iam.gserviceaccount.com
.If the service account is disabled, the output is similar to the following:
disabled: true displayName: my-service-account email: my-service-account@my-project.iam.gserviceaccount.com ...
If the service account is disabled, enable it:
gcloud iam service-accounts enable SERVICE_ACCOUNT
If the service account is enabled and the error persists, grant the service account the minimum permissions required for GKE:
gcloud projects add-iam-policy-binding PROJECT_ID \
--member "serviceAccount:SERVICE_ACCOUNT" \
--role roles/container.nodeServiceAccount
Namespace stuck in the Terminating state when cluster has 0 nodes
The following issue occurs when you delete a namespace in a cluster after the
cluster scales down to zero nodes. The metrics-server
component can't accept
the namespace deletion request because the component has zero replicas.
To diagnose this issue, run the following command:
kubectl describe ns/NAMESPACE_NAME
Replace NAMESPACE_NAME
with the name of the namespace.
The output is the following:
Discovery failed for some groups, 1 failing: unable to retrieve the complete
list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to
handle the request
To resolve this issue, scale any workload up to trigger GKE to create a new node. When the node is ready, the namespace deletion request automatically completes. After GKE deletes the namespace, scale the workload back down.
Scaling issues
Node scale up failed: Pod is at risk of not being scheduled
The following issue occurs when serial port logging is disabled in your Google Cloud project. GKE Autopilot clusters require serial port logging to effectively debug node issues. If serial port logging is disabled, Autopilot can't provision nodes to run your workloads.
The error message in your Kubernetes event log is similar to the following:
LAST SEEN TYPE REASON OBJECT MESSAGE
12s Warning FailedScaleUp pod/pod-test-5b97f7c978-h9lvl Node scale up in zones associated with this pod failed: Internal error. Pod is at risk of not being scheduled
Serial port logging might be disabled at the organization level through an
organization policy that enforces the compute.disableSerialPortLogging
constraint. Serial port logging could also be disabled at the project or virtual
machine (VM) instance level.
To resolve this issue, do the following:
- Ask your Google Cloud organization policy administrator to
remove the
compute.disableSerialPortLogging
constraint in the project with your Autopilot cluster. - If you don't have an organization policy that enforces this constraint, try
to
enable serial port logging in your project metadata.
This action requires the
compute.projects.setCommonInstanceMetadata
IAM permission.
Node scale up failed: GCE out of resources
The following issue occurs when your workloads request more resources than are
available to use in that Compute Engine region or zone. Your Pods might remain
in the Pending
state.
Check your Pod events:
kubectl get events --for='POD_NAME' --types=Warning
Replace
RESOURCE_NAME
with the name of the pending Kubernetes resource. For examplepod/example-pod
.The output is similar to the following:
LAST SEEN TYPE REASON OBJECT Message 19m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. 14m Warning FailedScheduling pod/example-pod gke.io/optimize-utilization-scheduler 0/2 nodes are available: 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling. 12m (x2 over 18m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-f associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled. 34s (x3 over 17m) Warning FailedScaleUp cluster-autoscaler Node scale up in zones us-central1-b associated with this pod failed: GCE out of resources. Pod is at risk of not being scheduled.
To resolve this issue, try the following:
- Deploy the Pod in a different region or zone. If your Pod has a zonal restriction such as a topology selector, remove the restriction if you can. For instructions, see Place GKE Pods in specific zones.
- Create a cluster in a different region and retry the deployment.
- Try using a different compute class. Compute classes that are backed by smaller Compute Engine machine types are more likely to have available resources. For example, the default machine type for Autopilot has the highest availability. For a list of compute classes and the corresponding machine types, see When to use specific compute classes.
- If you run GPU workloads, the requested GPU might not be available in your node location. Try deploying your workload in a different location or requesting a different type of GPU.
To avoid scale-up issues caused by resource availability in the future, consider the following approaches:
- Use Kubernetes PriorityClasses to consistently provision extra compute capacity in your cluster. For details, see Provision extra compute capacity for rapid Pod scaling.
- Use Compute Engine capacity reservations with the Performance or the Accelerator compute classes. For details, see Consume reserved zonal resources.
Nodes fail to scale up: Pod zonal resources exceeded
The following issue occurs when Autopilot doesn't provision new nodes for a Pod in a specific zone because a new node would violate resource limits.
The error message in your logs is similar to the following:
"napFailureReasons": [
{
"messageId": "no.scale.up.nap.pod.zonal.resources.exceeded",
...
This error refers to a noScaleUp
event, where node auto-provisioning did not provision any node group for the Pod in the zone.
If you encounter this error, confirm the following:
- Your Pods have sufficient memory and CPU.
- The Pod IP address CIDR range is large enough to support your anticipated maximum cluster size.
Workload issues
Pods stuck in Pending state
A Pod might get stuck in the Pending
status if you select a specific node
for your Pod to use, but the sum of resource requests in the Pod and in
DaemonSets that must run on the node exceeds the maximum allocatable capacity of
the node. This might cause your Pod to get a Pending
status and remain
unscheduled.
To avoid this issue, evaluate the sizes of your deployed workloads to ensure that they're within the supported maximum resource requests for Autopilot.
You can also try scheduling your DaemonSets before you schedule your regular workload Pods.
Pods stuck during termination or creation
A known issue causes Pods to occasionally become stuck in one of the following states:
Terminating
CreateContainerError
This issue has a small chance of occurring when you use burstable Pods in GKE environments that meet all of the following conditions:
- Your node GKE version is 1.29.2-gke.1060000 or later
- Your Pod uses one of the following compute classes:
- The default general-purpose compute class
- The
Balanced
compute class - The
Scale-Out
compute class
To mitigate this issue, we temporarily disabled bursting in GKE Autopilot clusters that were created or upgraded to version 1.29.2-gke.1060000 and later on or after April 24, 2024. Clusters that enabled bursting prior to April 24, 2024 continue to support bursting.
If your Pods are already stuck in the Terminating
status or the
CreateContainerError
status, do the following steps:
Describe the stuck Pod:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of the stuck Pod.If the Pod is stuck because of this issue, the
Events
field in the output won't display an event that explains theTerminating
orCreateContainerError
state, like in the following example output:# Fields omitted for readability Containers: startup-script: State: Waiting Reason: CreateContainerError Last State: Terminated Reason: Unknown Exit Code: 255 Started: Sun, 14 Apr 2024 20:04:08 +0000 Finished: Sun, 14 Apr 2024 20:04:17 +0000 Ready: False # Fields omitted for readability Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 2m49s (x12236 over 46h) kubelet Pulling image "gcr.io/google-containers/startup-script:v1"
Drain the affected nodes using the steps in the Consistently unreliable workload performance on a specific node section.
To request an exemption so that you can use bursting in affected GKE versions, or to disable bursting in a cluster that still supports bursting, contact Cloud Customer Care.
Consistently unreliable workload performance on a specific node
In GKE version 1.24 and later, if your workloads on a specific node consistently experience disruptions, crashes, or similar unreliable behavior, you can tell GKE about the problematic node by cordoning it using the following command:
kubectl drain NODE_NAME --ignore-daemonsets
Replace NODE_NAME
with the name of the problematic node.
You can find the node name by running kubectl get nodes
.
GKE does the following:
- Evicts existing workloads from the node and stops scheduling workloads on that node.
- Automatically recreates any evicted workloads that are managed by a controller, such as a Deployment or a StatefulSet, on other nodes.
- Terminates any workloads that remain on the node and repairs or recreates the node over time.
- If you use Autopilot, GKE shuts down and replaces the node immediately and ignores any configured PodDisruptionBudgets.
Pods take longer than expected to schedule on empty clusters
This event occurs when you deploy a workload to an Autopilot cluster that has no other workloads. Autopilot clusters start with zero usable nodes and scale to zero nodes if the cluster is empty to avoid having unutilized compute resources in the cluster. Deploying a workload in a cluster that has zero nodes triggers a scale-up event.
If you experience this, Autopilot is functioning as intended, and no action is necessary. Your workload will deploy as expected after the new nodes boot up.
Check whether your Pods are waiting for new nodes:
Describe your pending Pod:
kubectl describe pod POD_NAME
Replace
POD_NAME
with the name of your pending Pod.Check the
Events
section of the output. If the Pod is waiting for new nodes, the output is similar to the following:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 11s gke.io/optimize-utilization-scheduler no nodes available to schedule pods Normal TriggeredScaleUp 4s cluster-autoscaler pod triggered scale-up: [{https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-9293c6db-grp 0->1 (max: 1000)} {https://www.googleapis.com/compute/v1/projects/example-project/zones/example-zone/instanceGroups/gk3-example-cluster-pool-2-d99371e7-grp 0->1 (max: 1000)}]
The
TriggeredScaleUp
event shows that your cluster is scaling up from zero nodes to as many nodes are required to run your deployed workload.