Use the Conformity Knowledge Base AI to help improve your Cloud Posture

Publicly Accessible Dataproc Clusters

Trend Cloud One™ – Conformity is a continuous assurance tool that provides peace of mind for your cloud infrastructure, delivering over 1000 automated best practice checks.

Risk Level: High (not acceptable risk)

When external IP addresses are assigned to Dataproc clusters, the cluster instances are exposed directly to the Internet. This increases the attack surface and risks accidental data exposure if firewall rules are misconfigured. By using private, internal IP addresses only for your Dataproc instances, you limit access and force traffic through secure channels.

Security

When external IP addresses are assigned to Dataproc clusters, the cluster instances are exposed directly to the Internet. This increases the attack surface and risks accidental data exposure if firewall rules are misconfigured. By using private, internal IP addresses only for your Dataproc instances, you limit access and force traffic through secure channels.


Audit

To determine if your Google Cloud Dataproc cluster instances are accessible from the Internet, perform the following operations:

Using GCP Console

01 Sign in to Google Cloud Management Console.

02 Select the Google Cloud Platform (GCP) project that you want to access from the console top navigation bar.

03 Navigate to Dataproc console available at https://console.cloud.google.com/dataproc.

04 In the navigation panel, select Clusters to access the list of the Dataproc clusters deployed for the selected project.

05 Click on the name (link) of the Dataproc cluster that you want to examine.

06 Select the CONFIGURATION tab, and check the Internal IP only configuration attribute value. If the Internal IP only value is set to No, the cluster instances are using public IP addresses instead of internal IP addresses, therefore, the selected Google Cloud Dataproc cluster is considered publicly accessible.

07 Repeat steps no. 5 and 6 for each Dataproc cluster provisioned for the selected GCP project.

08 Repeat steps no. 2 – 7 for each project deployed within your Google Cloud account.

Using GCP CLI

01 Run projects list command (Windows/macOS/Linux) using custom query filters to list the IDs of all the Google Cloud Platform (GCP) projects available in your cloud account:

gcloud projects list
  --format="table(projectId)"

02 The command output should return the requested GCP project identifiers:

PROJECT_ID
cc-bigdata-project-123123
cc-web-app-project-112233

03 Run dataproc clusters list command (Windows/macOS/Linux) using custom query filters to describe the name of each Dataproc cluster provisioned for the selected Google Cloud project:

gcloud dataproc clusters list
  --project cc-bigdata-project-123123
  --region=us-central1
  --format="(NAME)"

04 The command output should return the requested Dataproc cluster names:

NAME
tm-prod-dataproc-cluster 
tm-dataproc-test-cluster
tm-dataproc-hda1-cluster

05 Run dataproc clusters describe command (Windows/macOS/Linux) using the name of the Google Cloud Dataproc cluster that you want to examine as the identifier parameter and custom query filters to determine if the instances running within the selected cluster are using public or private (internal) IP addresses:

gcloud dataproc clusters describe tm-prod-dataproc-cluster 
  --region=us-central1 
  --format=json | jq '.config.gceClusterConfig.internalIpOnly'

06 The command output should return true if the cluster instances are using internal IPs and false otherwise:

false

If the dataproc clusters describe command output returns false, the cluster instances are using public IP addresses instead of internal IP addresses, therefore, the selected Google Cloud Dataproc cluster is considered publicly accessible.

07 Repeat steps no. 5 and 6 for each Dataproc cluster created within the selected project.

08 Repeat steps no. 3 – 7 for each GCP project deployed in your Google Cloud account.

Remediation / Resolution

To ensure that your Dataproc cluster instances are not accessible from the Internet, you must re-create your Dataproc clusters with internal IP addresses only. To redeploy your clusters, perform the following operations:

Using GCP Console

01 Sign in to Google Cloud Management Console.

02 Select the Google Cloud Platform (GCP) project that you want to access from the console top navigation bar.

03 Navigate to Dataproc console available at https://console.cloud.google.com/dataproc.

04 In the navigation panel, select Clusters to access the list of the Dataproc clusters deployed for the selected project.

05 Click on the name (link) of the Dataproc cluster that you want to re-create and collect all the configuration information available for the selected resource.

06 Go back to the Clusters console and choose CREATE CLUSTER from the console top menu to initiate the Dataproc cluster setup process. When prompted to select the infrastructure service, choose Cluster on Compute Engine.

07 On the Create a Dataproc cluster on Compute Engine page, perform the following actions:

  1. For Set up cluster, provide a unique identifier for the new cluster in the Cluster name box and use the information collected at step no. 5 to configure the cluster settings such as cluster type and location, image type and version, autoscaling policy, network configuration, and cluster components.
  2. For Configure nodes (optional), select the appropriate hardware configurations, including the machine family, series, machine type, GPU type (if applicable), and primary disk size and type for both master and worker nodes (must match the hardware configuration used by the source cluster).
  3. For Customize cluster (optional) panel, select the Configure all instances to have only internal IP addresses checkbox to assign private, internal IP addresses to all your Dataproc cluster instances. With internal IPs, your Dataproc cluster will be isolated from the public Internet and its instances will communicate over a private IP subnetwork (the cluster will not assign public IP addresses). Configure additional settings including labels, metadata, properties, and scheduled deletion settings.
  4. For Manage security (optional), configure the cluster security settings, including encryption and project access.
  5. Choose CREATE to launch your new Google Cloud Dataproc cluster.

08 If required, migrate the source cluster data to the newly created cluster.

09 Update your application to reference the new Dataproc cluster.

10 Once the new cluster is operating successfully, you can remove the source cluster in order to stop adding charges to your Google Cloud bill. Click on the name (link) of the cluster that you want to remove and choose DELETE from the console top menu.

11 In the confirmation box, choose DELETE to confirm the cluster deletion.

12 Repeat steps no. 5 – 11 for each Dataproc cluster that you want to redeploy, available within the selected GCP project.

13 Repeat steps no. 2 – 12 for each GCP project available in your Google Cloud account.

Using GCP CLI

01 Run dataproc clusters describe command (Windows/macOS/Linux) using the name of the Google Cloud Dataproc cluster that you want to re-create as the identifier parameter, to describe the configuration information available for the selected cluster:

gcloud dataproc clusters describe tm-prod-dataproc-cluster
  --region=us-central1
  --format=json

02 The command output should return the requested configuration information:

{
	"clusterName": "tm-prod-dataproc-cluster",
	"config": {
		"configBucket": "dataproc-staging-us-central1-123456789012-abcdabcd",
		"masterConfig": {
			"diskConfig": {
				"bootDiskSizeGb": 500,
				"bootDiskType": "pd-standard"
			},
			"machineTypeUri": "https://www.googleapis.com/compute/v1/projects/cc-bigdata-project-123123/zones/us-central1-a/machineTypes/n1-standard-4",
			"minCpuPlatform": "AUTOMATIC",
		},


		...
	

		"tempBucket": "dataproc-temp-us-central1-6123456789012-abcdabcd"
	},
	"projectId": "cc-bigdata-project-123123",
	"status": {
		"state": "RUNNING",
		"stateStartTime": "2024-03-04T08:20:00.000Z"
	},
	"statusHistory": [
		{
			"state": "CREATING",
			"stateStartTime": "2024-03-04T08:20:00.000Z"
		}
	]
}

03 Run dataproc clusters create command (Windows/macOS/Linux) with the information returned at the previous step as the configuration data for the new cluster, to create a new Google Cloud Dataproc cluster. Use the ‑‑no-address parameter with the ‑‑network flag to create a Dataproc cluster that will utilize a subnetwork with the same name as the network in the region where the cluster is created. With ‑‑no-address, the cluster will assign private, internal IP addresses to all its instances:

gcloud dataproc clusters create tm-new-dataproc-cluster 
  --project=cc-bigdata-project-123123 
  --region=us-central1
  --single-node 
  --master-machine-type=n1-standard-4 
  --master-boot-disk-size=500GB 
  --master-boot-disk-type=pd-standard 
  --network default 
  --no-address

04 The command output should return the information (region and URL) available for the new Dataproc cluster:

Waiting for cluster creation operation...done.
Created [https://dataproc.googleapis.com/v1/projects/cc-bigdata-project-123123/regions/us-central1/clusters/cc-new-dataproc-cluster] Cluster placed in zone [us-central1-c].

05 If required, migrate the source cluster data to the newly created (target) cluster.

06 Update your application to reference the new Google Cloud Dataproc cluster.

07 Once the new cluster is operating successfully, you can remove the source cluster in order to stop adding charges to your Google Cloud bill. Run dataproc clusters delete command (Windows/macOS/Linux) using the name of the resource that you want to remove as the identifier parameter, to delete the specified Dataproc cluster:

gcloud dataproc clusters delete tm-prod-dataproc-cluster --region=us-central1

08 Type Y and press Enter to confirm the resource removal. All the cluster disks will be permanently deleted, therefore make sure that your data has been successfully exported to the new cluster before removal:

The cluster 'tm-prod-dataproc-cluster' and all attached disks will be deleted.
Do you want to continue (Y/n)? Y

09 The output should return the dataproc clusters delete command request status:

Waiting for cluster deletion operation...done.
Deleted [https://dataproc.googleapis.com/v1/projects/cc-bigdata-project-123123/regions/us-central1/clusters/tm-prod-dataproc-cluster].

10 Repeat steps no. 1 – 9 for each Dataproc cluster that you want to re-create, available in the selected GCP project.

11 Repeat steps no. 1 – 11 for each GCP project deployed in your Google Cloud account.

References

Publication date May 3, 2022