This page shows you how to resolve issues with syncing configs to your cluster.
Troubleshoot KNV 2009 errors
KNV2009
errors indicate that Config Sync failed to sync some configs to
the cluster. The following sections explain some of the most common causes and
how to resolve them.
Operation on certain resources is forbidden
Because you need to grant RepoSync
objects their RBAC, it might be
missing the permissions needed to apply resources.
You can verify that the permissions are missing by getting the RepoSync
resource
status:
kubectl get reposync repo-sync -n NAMESPACE -o yaml
Replace NAMESPACE
with the namespace that you created
your namespace repository in.
You can also use the
nomos status
command.
If you see the following messages in the status, it means that the reconciler in
NAMESPACE
lacks the permission needed to apply the
resource:
KNV2009: deployments.apps "nginx-deployment" is forbidden: User "system:serviceaccount:config-management-system:ns-reconciler-default" cannot get resource "deployments" in API group "apps" in the namespace "default"
To fix this issue, you need to declare a RoleBinding configuration that grants the service account for the reconciler permission to manage the failed resource in that namespace. Details on how to add a RoleBinding are included in Configure syncing from multiple repositories.
This issue can also affect RootSync objects if you've used
spec.override.roleRefs
to change the roles granted to the RootSync
object. If you haven't set this field,
RootSync
objects are granted the cluster-admin
role by default.
ResourceGroup object exceeds the etcd
object size limit
If you receive the following error when a reconciler tries to apply configurations
to the cluster, the ResourceGroup object exceeds the etcd
object size
limit:
KNV2009: too many declared resources causing ResourceGroup.kpt.dev, config-management-system/root-sync failed to be applied: task failed (action: "Inventory", name: "inventory-add-0"): Request entity too large: limit is 3145728. To fix, split the resources into multiple repositories.
We recommend that you split your Git repository into multiple repositories. If you're not able to break up the Git
repository, because the object is already too big and changes are not being
persisted, you can mitigate it by configuring the RootSync
or RepoSync
to
temporarily disable writing object status to the ResourceGroup. You can do this
by setting the .spec.override.statusMode
field of the RootSync
or RepoSync
object to disabled
. By doing so, Config Sync stops updating the managed
resources status in the ResourceGroup object. This action reduces the size of
the ResourceGroup object. However, you cannot view the status for managed
resources from either nomos status
or gcloud alpha anthos config sync
.
If you don't see any error from the RootSync
or RepoSync
object, then the
objects from your source of truth have been synced to the cluster. To check if
the ResourceGroup resource exceeds the etcd
object size limit, check both the
ResourceGroup resource status and the log of the ResourceGroup controller:
Check the ResourceGroup status:
To check the
RootSync
object, run the following command:kubectl get resourcegroup root-sync -n config-management-system
To check the
RepoSync
object, run the following command:kubectl get resourcegroup repo-sync -n NAMESPACE
Replace
NAMESPACE
with the namespace that you created your namespace repository in.
The output is similar to the following example:
NAME RECONCILING STALLED AGE root-sync True False 35m
If the value in the
RECONCILING
column isTrue
, it means that the ResourceGroup resource is still reconciling.Check the logs for the ResourceGroup controller:
kubectl logs deployment resource-group-controller-manager -c manager -n resource-group-system
If you see an error similar to the following example in the output, the ResourceGroup resource is too large and exceeds the
etcd
object size limit:"error":"etcdserver: request is too large"
To prevent the ResourceGroup from getting too large, reduce the number of resources in your Git repository. You can split one root repository into multiple root repositories.
Dependency apply reconcile timeout
If you were syncing objects with dependencies, you might receive an error
similar to the following example when the reconciler tries to apply objects with
the config.kubernetes.io/depends-on
annotation to the cluster:
KNV2009: skipped apply of Pod, bookstore/pod4: dependency apply reconcile timeout: bookstore_pod3__Pod For more information, see https://g.co/cloud/acm-errors#knv2009
This error means the dependency object did not reconcile within the default
reconcile timeout of five minutes. Config Sync cannot apply the dependent
object because with the config.kubernetes.io/depends-on
annotation,
Config Sync only applies objects in the order you want. You can override
the default reconcile timeout to a longer time by setting
spec.override.reconcileTimeout
.
It's also possible that the dependency might reconcile after the initial sync attempt has completed. In this case, the dependency should be detected as reconciled on the next sync retry attempt, unblocking the apply of any dependents. When this happens, the error may be reported briefly and then removed. Lengthening the reconcile timeout might help avoid the error being reported intermittently.
Inventory info is nil
If you receive the following error when reconciler tries to apply configurations to the cluster, it's likely that your inventory has no resources or the manifest has an unmanaged annotation:
KNV2009: inventory info is nil\n\nFor more information, see https://g.co/cloud/acm-errors#knv2009
To resolve this issue, try the following steps:
- Avoid setting up syncs where all resources have the
configmanagement.gke.io/managed: disabled
annotation, by ensuring at least one resource is managed by Config Sync. - Add the annotation
configmanagement.gke.io/managed: disabled
only after completing an initial sync of the resource without this annotation.
Multiple inventory object templates
If you receive the following error when the reconciler tries to apply configurations to the cluster, it is likely that you have an inventory config generated by kpt in the source of truth, for example a Git repository:
KNV2009: Package has multiple inventory object templates. The package should have one and only one inventory object template. For more information, see https://g.co/cloud/acm-errors#knv2009
The issue happens because Config Sync manages its own inventory config. To resolve this issue, delete the inventory config in your source of truth.
Cannot make changes to immutable fields
You can't change any immutable field in a config by changing the value in the source of truth. Attempting such a change causes an error similar to the following:
KNV2009: failed to apply RESOURCE: admission webhook "deny-immutable-field-updates.cnrm.cloud.google.com" denied the request: cannot make changes to immutable field(s):
If you need to update an immutable field, manually delete the object in the cluster. Config Sync can then re-create the object with the new field value.
API discovery failed
If you see an error message similar to the following, you might be experiencing an API discovery error:
KNV2002: API discovery failed: APIServer error: unable to retrieve the complete list of server APIs: external.metrics.k8s.io/v1beta1: received empty response for: external.metrics.k8s.io/v1beta1
Config Sync uses Kubernetes API discovery to look up which resources are supported by the cluster. This lets Config Sync validate the resource types specified in your source and watch those resources for changes in the cluster.
Prior to Kubernetes version 1.28, any time any APIService backend was unhealthy or returned an empty list result, API Discovery would fail, causing Config Sync and multiple other Kubernetes components to error. Many common APIService backends are not highly available, so this can happen relatively frequently, just by updating the backend or having it be rescheduled onto another node.
Examples of APIService backends with a single replica include metrics-server
and custom-metrics-stackdriver-adapter
. Some APIService backends always return
empty list results, like custom-metrics-stackdriver-adapter
. Another common
cause of API discovery failure is unhealthy webhooks.
After Kubernetes version 1.28, with the Aggregated Discovery feature enabled, an unhealthy APIService backend no longer causes unhandled errors. Instead the resource group handled by that APIService is shown to have no resources. This lets syncing continue, as long as the unhealthy resource isn't specified in your source.
Delayed self-healing
Self-healing watches managed resources, detects drift from the source of truth, and reverts that drift.
Self-healing is paused while syncing is being attempted. This behavior means that self-healing might be delayed, especially if there are sync errors preventing the reconciler from completing. To re-enable self-healing, fix all reported sync errors.
High number of Kubernetes API requests
Config Sync uses a multi-instancing strategy to scale and isolate tenants
and fault domains. Because of this, each RootSync
and RepoSync
gets its own
reconciler instance. In addition to syncing every time changes are made to the
source, each reconciler instance also syncs periodically as part of its
self-healing behavior, to revert any changes missed by the active drift
remediation. When you add RootSync
or RepoSync
objects, this causes a linear
increase in the number of API requests made by the reconcilers syncing resources
to Kubernetes. So if you have many RootSync
and RepoSync
objects, this can
sometimes cause significant traffic load on the Kubernetes API.
To perform syncing, Config Sync uses Server-Side Apply. This replaces the normal GET and PATCH request flow with a single PATCH request, reducing the total number of API calls, but increasing the number of PATCH calls. This ensures that the changes made are correct, even when the resource group version in source doesn't match the default resource group version on the cluster. However, you may see PATCH requests in the audit log, even when there hasn't been any change to the source or drift from the state that you want. This is normal, but can be surprising.
When syncing is erroring, it is retried until it succeeds. However, if this
requires human interaction, Config Sync might be erroring and retrying for a
while, increasing the amount of requests made to the Kubernetes API. The retries
back off exponentially, but if many RootSync
or RepoSync
objects are failing
to sync simultaneously, this can cause significant traffic load on the
Kubernetes API.
To mitigate these issues, try one of the following options:
- Fix configuration errors quickly, so they don't pile up.
- Combine multiple
RootSync
orRepoSync
objects, to reduce the number of reconcilers making Kubernetes API requests.
KubeVirt uninstall blocked by finalizers
KubeVirt is a Kubernetes package that uses multiple finalizers, requiring precise deletion ordering to facilitate cleanup. If the KubeVirt objects are deleted in the wrong order, deletion of other KubeVirt objects might stall or stop responding indefinitely.
If you tried to uninstall KubeVirt and it became blocked, follow the instructions for manually deleting KubeVirt.
To mitigate this issue in the future, declare dependencies between resource objects to ensure they are deleted in reverse dependency order.
Object deletion blocked by finalizers
Kubernetes finalizers are metadata entries that tell Kubernetes not to allow an object to be removed until after a specific controller has performed cleanup. This can cause syncing or reconciling to fail, if the conditions for cleanup are not satisfied or the controller that performs the cleanup for that resource is unhealthy or has been deleted.
To mitigate this problem, identify which resource is still finalizing and which controller should be performing the cleanup.
If the controller is unhealthy, fixing the root cause should allow the resource cleanup to complete, unblocking removal of the object.
If the controller is healthy, the controller should have applied a status condition to the object being deleted to explain why cleanup has stalled. If not, check the controller logs for indications of the root cause.
Often, having an object stuck deleting is an indication that objects were deleted in the wrong order. To prevent this kind of problem in the future, declare dependencies between resource objects, to ensure they are deleted in reverse dependency order.
ResourceGroup fields keep changing
When syncing is attempted, the inventory is updated to change resource status
to pending. When syncing fails, the inventory is updated to change resource
status to failed. When syncing is retried after failure, this pattern repeats,
causing periodic updates to the inventory. This causes the ResourceGroup
resourceVersion
to increase with each update and the syncing status to flip
back and forth. This is normal, but can be surprising.
Sync failure can be caused by a number of issues. One of the most common is
insufficient permissions to manage the resources specified in the source. To fix
this error, add the appropriate RoleBindings or ClusterRoleBinding to grant the
RepoSync
or RootSync
reconciler permission to manage the resources that are
failing to sync.
Server-side apply doesn't remove or revert fields not specified in the source
Config Sync uses
Server-Side Apply
to apply manifests from the source to Kubernetes. This is required to
allow other controllers to manage metadata
and spec
fields. One example of
this is the Horizontal Pod Autoscaler, which updates the number of replicas in a
Deployment. Because of this, Config Sync only manages fields specified in
the source manifest. This has the side effect that when adopting existing
resource objects, any fields unspecified in the source aren't changed, which can
sometimes cause the merged configuration to be invalid or incorrect.
To avoid this problem when adopting a resource, use the exact same fields in the source when initially adopting and then change the fields in the source after syncing, so that Config Sync correctly removes the fields it previously applied and replaces them with the new fields from the source. Another way to avoid this problem is to delete the resource from the cluster first and allow Config Sync to apply the new version.
What's next
- If you're still experiencing issues, check to see if your problem is a known issue.