This page shows you how to resolve issues with the Kubernetes API server
(kube-apiserver
) for Google Distributed Cloud.
Webhook timeouts and failed webhook calls
These errors might be seen in a few different ways. If you experience any of the following symptoms, it's possible that webhook calls are failing:
Connection refused: If
kube-apiserver
reports timeout errors for calling the webhook, the following error is reported in the logs:failed calling webhook "server.system.private.gdc.goog": failed to call webhook: Post "https://root-admin-webhook.gpc-system.svc:443/mutate-system-private-gdc-goog-v1alpha1-server?timeout=10s": dial tcp 10.202.1.18:443: connect: connection refused
Context deadline exceeded: You might also see the following error reported in the logs:
failed calling webhook "namespaces.hnc.x-k8s.io": failed to call webhook: Post "https://hnc-webhook-service.hnc-system.svc:443/validate-v1-namespace?timeout=10s\": context deadline exceeded"
If you think that you are experiencing webhook timeouts or failed webhook calls, use one of the following methods to confirm the issue:
Check the API server log to see if there is network issue.
- Check the log for network-related errors like
TLS handshake error
. - Check if the IP/Port matches what the API server is configured to respond on.
- Check the log for network-related errors like
Monitor webhook latency with the following steps:
In the console, go to the Cloud Monitoring page.
Select Metrics explorer.
Select the
apiserver_admission_webhook_admission_duration_seconds
metric.
To resolve this issue, review the following suggestions:
Additional firewall rules might be required for the webhook. For more information, see how to add firewall rules for specific use cases.
If the webhook requires more time to complete, you can configure a custom timeout value. The webhooks latency adds to API request latency, so should be evaluated as quickly as possible.
If the webhook error blocks cluster availability or the webhook is harmless to remove and mitigates the situation, check if it's possible to temporarily set the
failurePolicy
toIgnore
or remove the offending webhook.
API server dial failure or latency
This error might be seen in a few different ways:
External name resolution errors: An external client might return errors that contain
lookup
in the message, such as:dial tcp: lookup kubernetes.example.com on 127.0.0.1:53: no such host
This error doesn't apply to a client running within the cluster. The Kubernetes Service IP is injected, so no resolution is required.
Network errors: The client might print a generic network error when trying to dial the API server, like the following examples:
dial tcp 10.96.0.1:443: connect: no route to host dial tcp 10.96.0.1:443: connect: connection refused dial tcp 10.96.0.1:443: connect: i/o timeout
High latency connecting to API server: The connection to API server might be successful, but the requests timeout on the client side. In this scenario, the client usually prints error messages containing
context deadline exceeded
.
If the connection to the API server fails completely, try the connection within the same environment where the client reports the error. Kubernetes ephemeral containers can be used to inject a debugging container to the existing namespaces as follows:
From where the problematic client runs, use
kubectl
to perform a request with high verbosity. For example, aGET
request to/healthz
usually requires no authentication:kubectl get -v999 --raw /healthz
If the request fails or
kubectl
is unavailable, you can obtain the URL from the output and manually perform the request withcurl
. For example, if the service host obtained from the previous output washttps://192.0.2.1:36917/
, you can send a similar request as follows:# Replace "--ca-cert /path/to/ca.pem" to "--insecure" if you are accessing # a local cluster and you trust the connection cannot be tampered. # The output is always "ok" and thus contains no sensentive information. curl -v --cacert /path/to/ca.pem https://192.0.2.1:36917/healthz
The output from this command usually indicates the root cause of a failed connection.
If the connection is successful but is slow or times out, it indicates an overloaded API server. To confirm, in the console look at
API Server Request Rate
and request latency metrics inCloud Kubernetes > Anthos > Cluster > K8s Control Plane
.
To resolve these connection failures or latency problems, review the following remediation options:
If a network error occurs within the cluster, there might be problem with the Container Network Interface (CNI) plugin. This problem is usually transient and resolves itself after a Pod recreation or reschedule.
If the network error is from outside the cluster, check if the client is properly configured to access the cluster, or generate the client configuration again. If the connection goes through a proxy or gateway, check if another connection that goes through the same mechanism works.
If the API server is overloaded, it usually means that many clients access the API server at the same time. A single client can't overload an API server due to throttling and the Priority and Fairness feature. Review the workload for the following areas:
- Works at Pod level. It's more common to mistakenly create and forget Pods than higher level resources.
- Adjust the number of replicas through erroneous calculation.
- A webhook that loops back the request to itself or amplifies the load by creating more requests than it handles.