Use the Conformity Knowledge Base AI to help improve your Cloud Posture

Enable Auto-Repair for GKE Cluster Nodes

Trend Cloud One™ – Conformity is a continuous assurance tool that provides peace of mind for your cloud infrastructure, delivering over 1000 automated best practice checks.

Risk Level: Medium (should be achieved)

Ensure that the Auto-Repair feature is enabled for all your GKE cluster nodes in order to help you keep the cluster nodes healthy. Google Kubernetes Engine (GKE) uses the node's health status to determine if a cluster node needs to be repaired. GKE triggers a repair action if a node reports consecutive unhealthy status reports for a given time threshold. The unhealthy status is reported when:

  • A cluster node broadcasts a "NotReady" status on consecutive checks over the given time threshold.
  • A cluster node does not broadcast any status at all over the given time threshold.
  • A cluster node's boot disk is out of disk space for an extended period of time.
Reliability

Auto-Repair helps you keep the nodes in your GKE cluster in a healthy, running state. When the feature is enabled, Google Kubernetes Engine makes periodic checks on the health state of each node in your cluster. If a node fails consecutive health checks over a given time threshold, GKE service initiates a repair process for that cluster node.


Audit

To determine if your Google Kubernetes Engine (GKE) clusters are using auto-repairing nodes, perform the following operations:

Using GCP Console

01 Sign in to the Google Cloud Management Console.

02 Select the GCP project that you want to examine from the console top navigation bar.

03 Navigate to Google Kubernetes Engine (GKE) console at https://console.cloud.google.com/kubernetes.

04 In the main navigation panel, under Kubernetes Engine, select Clusters to access the list with the GKE clusters provisioned within the selected project.

05 Click on the name (link) of the GKE cluster that you want to examine.

06 Select the NODES tab to access the node pools created for the selected cluster.

07 Click on the name (link) of the cluster node pool that you want to examine.

08 In the Management section, check the Auto-repair feature status. If Auto-repair is set to Disabled, the Auto-Repair feature is not enabled for the nodes running within the selected Google Kubernetes Engine (GKE) cluster node pool.

09 Repeat steps no. 7 and 8 for each node pool provisioned for the selected GKE cluster.

10 Repeat steps no. 5 – 9 for each GKE cluster created for the selected GCP project.

11 Repeat steps no. 2 – 10 for each project deployed within your Google Cloud account.

Using GCP CLI

01 Run projects list command (Windows/macOS/Linux) with custom query filters to list the ID of each project available in your Google Cloud account:

gcloud projects list
  --format="table(projectId)"

02 The command output should return the requested GCP project ID(s):

PROJECT_ID
cc-bigdata-project-123123
cc-iot-app-project-112233

03 Run container clusters list command (Windows/macOS/Linux) using the ID of the GCP project that you want to examine as the identifier parameter and custom query filters to describe the name and the region of each GKE cluster created for the selected project:

gcloud container clusters list
  --project cc-bigdata-project-123123
  --format="(NAME,LOCATION)"

04 The command output should return the requested cluster names and their regions:

NAME                     LOCATION
cc-gke-frontend-cluster  us-central1
cc-gke-backend-cluster   us-central1

05 Run container node-pools list command (Windows/macOS/Linux) using the name of the GKE cluster that you want to examine as the identifier parameter, to describe the name of each node pool provisioned for the selected cluster:

gcloud container node-pools list
  --cluster=cc-gke-frontend-cluster
  --region=us-central1
  --format="(NAME)"

06 The command output should return the requested cluster node pool name(s):

NAME
cc-gke-frontend-pool-001
cc-gke-frontend-pool-002
cc-gke-frontend-pool-003

07 Run container node-pools describe command (Windows/macOS/Linux) using the name of the cluster node pool that you want to examine as the identifier parameter and custom output filtering to describe the Auto-Repair feature status for the selected node pool:

gcloud container node-pools describe cc-gke-frontend-pool-001
  --cluster=cc-gke-frontend-cluster
  --region=us-central1
  --format="yaml(management.autoRepair)"

08 The command output should return the requested feature status:

management: {}

If the container node-pools describe command output returns null, or an empty object for the management configuration attribute (i.e. {}), as shown in the output example above, the Auto-Repair feature is not enabled for the nodes provisioned within the selected Google Kubernetes Engine (GKE) cluster node pool.

09 Repeat step no. 7 and 8 for each node pool provisioned for the selected GKE cluster.

10 Repeat steps no. 5 – 9 for each GKE cluster created for the selected GCP project.

11 Repeat steps no. 3 – 10 for each GCP project deployed in your Google Cloud account.

Remediation / Resolution

To enable the Auto-Repair feature for your Google Kubernetes Engine (GKE) cluster nodes, perform the following operations:

Note: Auto-repair can be enabled on a per-node pool basis only.

Using GCP Console

01 Sign in to the Google Cloud Management Console.

02 Select the GCP project that you want to examine from the console top navigation bar.

03 Navigate to Google Kubernetes Engine (GKE) console at https://console.cloud.google.com/kubernetes.

04 In the main navigation panel, under Kubernetes Engine, select Clusters to access the list with the GKE clusters provisioned within the selected project.

05 Click on the name (link) of the GKE cluster that you want to access.

06 Select the NODES tab to access the node pools created for the selected cluster.

07 Click on the name (link) of the cluster node pool that you want to reconfigure.

08 Choose EDIT from the console top menu to modify the configuration settings available for the selected node pool.

09 On the Edit node pool configuration page, perform the following actions:

  1. In the Management section, select the Enable auto-repair checkbox to enable the Auto-Repair feature for the selected GKE cluster node pool.
  2. Choose SAVE to apply the changes.

10 Repeat steps no. 7 – 9 to enable Auto-Repair for other node pools provisioned for the selected GKE cluster.

11 Repeat steps no. 5 – 10 for each GKE cluster created for the selected GCP project.

12 Repeat steps no. 2 – 11 for each project deployed within your Google Cloud account.

Using GCP CLI

01 Run container node-pools update command (Windows/macOS/Linux) using the name of the GKE cluster node pool that you want to reconfigure as the identifier parameter, to enable the Auto-Repair feature for the selected node pool:

gcloud container node-pools update cc-gke-frontend-pool-001
  --cluster=cc-gke-frontend-cluster
  --region=us-central1
  --enable-autorepair

02 The command output should return the URL of the reconfigured GKE cluster node pool:

Updating node pool cc-gke-frontend-pool-001...done.
Updated [https://container.googleapis.com/v1/projects/cc-bigdata-project-123123/zones/us-central1/clusters/cc-gke-frontend-cluster/nodePools/cc-gke-frontend-pool-001].

03 Repeat steps no. 1 and 2 to enable Auto-Repair for other node pools created for the selected GKE cluster.

04 Repeat steps no. 1 – 3 for each GKE cluster available for the selected GCP project.

05 Repeat steps no. 1 – 4 for each project deployed within your Google Cloud account.

References

Publication date May 10, 2021