Ensure that the Auto-Repair feature is enabled for all your GKE cluster nodes in order to help you keep the cluster nodes healthy. Google Kubernetes Engine (GKE) uses the node's health status to determine if a cluster node needs to be repaired. GKE triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. The unhealthy status is reported when:
- A cluster node broadcasts a "NotReady" status on consecutive checks over the given time threshold.
- A cluster node does not broadcast any status at all over the given time threshold.
- A cluster node's boot disk is out of disk space for an extended period of time.
Auto-Repair helps you keep the nodes in your GKE cluster in a healthy, running state. When the feature is enabled, Google Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over a given time threshold, GKE service initiates a repair process for that cluster node.
Audit
To determine if your Google Kubernetes Engine (GKE) clusters are using auto-repairing nodes, perform the following operations:
Remediation / Resolution
To enable the Auto-Repair feature for your Google Kubernetes Engine (GKE) cluster nodes, perform the following operations:
Note: Auto-repair can be enabled on a per-node pool basis only.References
- Google Cloud Platform (GCP) Documentation
- Google Kubernetes Engine
- Auto-repairing nodes
- GCP Command Line Interface (CLI) Documentation
- gcloud projects list
- gcloud container clusters list
- gcloud container node-pools list
- gcloud container node-pools describe
- gcloud container node-pools update