Problem
Google Kubernetes Engine nodes are out of resources in the cloud console. Either memory usage or CPU usage is at 100%. Pods are being killed randomly with:
Warning: OOMKilling Memory cgroup out of memory
Environment
Solution
Option 1
- Kill workloads.
Option 2
- Set memory or CPU limits.
While Kubernetes (including Google Kubernetes Engine) is a scheduling/orchestration system, the workloads can exceed the available resources (such as memory or CPU) on the worker nodes. When this happens with memory, pods will be killed. When this happens with CPU, the node's performance suffers.
Setting limits will protect the node from a pod going over its memory or CPU limit.
You can delete problematic workloads manually, or you can upscale your cluster.
Cause
Kubernetes schedules around requests, not limits. So if the limits are higher than the requests (sometimes called overcommit) or there are no limits, the workloads can exceed the resources of the system (such as memory or CPU). When this happens with CPU, the node will slow down until services begin to fail. (eventually the node will fail a health check).
When this happens with memory, the oom_killer will be invoked, and will kill processes, such as containers, to attempt to preserve node health.
See https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ for more info.
Alternatively, the out of resource behavior can be adjusted, see https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
The oom_killer behavior is explained at https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior