When you use Dataproc, cluster and job data is stored on persistent disks associated with the Compute Engine VMs in your cluster and in a Cloud Storage staging bucket. This persistent disk and bucket data is encrypted using a Google-generated data encryption key (DEK) and key encryption key (KEK).
The CMEK feature lets you create, use, and revoke the key encryption key (KEK). Google still controls the data encryption key (DEK). For more information on Google data encryption keys, see Encryption at Rest.
Use CMEK with cluster data
You can use customer-managed encryption keys (CMEK) to encrypt the following cluster data:
- Data on the persistent disks attached to VMs in your Dataproc cluster
- Job argument data submitted to your cluster, such as a query string submitted with a Spark SQL job
- Cluster metadata, job driver output, and other data written to a Dataproc staging bucket that you create
Follow these steps to use CMEK with the encryption of cluster data:
- Create one or more keys using the Cloud Key Management Service.
The resource name, also called the resource ID of a key, which you use in the next steps,
is constructed as follows:
projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
Assign the following roles to the following service accounts:
- Follow item #5 in Compute Engine→Protecting Resources with Cloud KMS Keys→Before you begin to assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Compute Engine service agent service account.
Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Cloud Storage service agent service account.
Assign the Cloud KMS CryptoKey Encrypter/Decrypter role to the Dataproc service agent service account. You can use the Google Cloud CLI to assign the role:
gcloud projects add-iam-policy-binding KMS_PROJECT_ID \ --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyEncrypterDecrypter
Replace the following:
KMS_PROJECT_ID
: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Dataproc resources.PROJECT_NUMBER
: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.Enable the Cloud KMS API on the project that runs Dataproc resources.
If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, then add the
serviceusage.services.use
permission to the custom role attached to the Dataproc Service Agent service account. If the Dataproc Service Agent role is attached to the Dataproc Service Agent service account, you can skip this step.
Pass the resource ID of your key to the Google Cloud CLI or the Dataproc API to use with cluster data encryption.
gcloud CLI
- To encrypt cluster persistent disk data using your key, pass
the resource ID of your key to the
--gce-pd-kms-key
flag when you create the cluster.gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --gce-pd-kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \ other arguments ...
You can verify the key setting from the
gcloud
command-line tool.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION
Command output snippet:
... configBucket: dataproc- ... encryptionConfig: gcePdKmsKeyName: projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name ...
- To encrypt cluster persistent disk data and job argument data
using your key, pass the resource ID of the key to the
--kms-key
flag when you create the cluster. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the--kms-key
flag.gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --kms-key='projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME' \ other arguments ...
You can verify key settings with the gcloud CLI
dataproc clusters describe
command. The key resource ID is set ongcePdKmsKeyName
andkmsKey
to use your key with the encryption of cluster persistent disk and job argument data.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION
Command output snippet:
... configBucket: dataproc- ... encryptionConfig: gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/key-KEY_RING_NAME-name/cryptoKeys/KEY_NAME ...
- To encrypt cluster metadata, job driver, and other output data written to your
Dataproc staging bucket in Cloud Storage:
- Create your own bucket with CMEK. When adding the key to the bucket, use a key that you created in Step 1.
- Pass the bucket name to the
--bucket
flag when you create the cluster.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --bucket=CMEK_BUCKET_NAME \ other arguments ...
You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:
gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \ --region=region \ --cluster=cluster-name \ -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
REST API
- To encrypt cluster VM persistent disk data using your key, include the
ClusterConfig.EncryptionConfig.gcePdKmsKeyName
field as part of a
cluster.create
request.
You can verify the key setting with the gcloud CLI
dataproc clusters describe
command.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION
Command output snippet:
... configBucket: dataproc- ... encryptionConfig: gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME ...
- To encrypt cluster VM persistent disk data and job argument data using
your key, include the
Cluster.EncryptionConfig.kmsKey
field as part of a cluster.create request. See Cluster.EncryptionConfig.kmsKey for a list of job types and arguments that are encrypted with the--kms-key
field.You can verify key settings with the gcloud CLI
dataproc clusters describe
command. The key resource ID is set ongcePdKmsKeyName
andkmsKey
to use your key with the encryption of cluster persistent disk and job argument data.gcloud dataproc clusters describe CLUSTER_NAME \ --region=REGION
Command output snippet:
... configBucket: dataproc- ... encryptionConfig: gcePdKmsKeyName: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME
- To encrypt cluster metadata, job driver, and other output data written to your
Dataproc staging bucket in Cloud Storage:
- Create your own bucket with CMEK. When adding the key to the bucket, use a key that you created in Step 1.
- Pass the bucket name to the ClusterConfig.configBucket field as part of a cluster.create request.
gcloud dataproc clusters create CLUSTER_NAME \ --region=REGION \ --bucket=CMEK_BUCKET_NAMEt \ other arguments ...
You can also pass CMEK-enabled buckets to the `gcloud dataproc jobs submit` command if your job takes bucket arguments, as shown in the following `cmek-bucket` example:
gcloud dataproc jobs submit pyspark gs://cmek-bucket/wordcount.py \ --region=region \ --cluster=cluster-name \ -- gs://cmek-bucket/shakespeare.txt gs://cmek-bucket/counts
- To encrypt cluster persistent disk data using your key, pass
the resource ID of your key to the
Use CMEK with workflow template data
Dataproc workflow template job argument data, such as the query string of a Spark SQL job, can be encrypted using CMEK. Follow steps 1, 2, and 3 in this section to use CMEK with your Dataproc workflow template. See WorkflowTemplate.EncryptionConfig.kmsKey for a list of workflow template job types and arguments that are encrypted using CMEK when this feature is enabled.
- Create a key using the Cloud Key Management Service (Cloud KMS).
The resource name of the key, which you use in the next steps,
name is constructed as follows:
projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name
To enable the Dataproc service accounts to use your key:
Assign the Cloud KMS
CryptoKey Encrypter/Decrypter
role to the Dataproc Service Agent service account. You can use the gcloud CLI to assign the role:gcloud projects add-iam-policy-binding KMS_PROJECT_ID \ --member serviceAccount:service-PROJECT_NUMBER@dataproc-accounts.iam.gserviceaccount.com \ --role roles/cloudkms.cryptoKeyEncrypterDecrypter
Replace the following:
KMS_PROJECT_ID
: the ID of your Google Cloud project that runs Cloud KMS. This project can also be the project that runs Dataproc resources.PROJECT_NUMBER
: the project number (not the project ID) of your Google Cloud project that runs Dataproc resources.Enable the Cloud KMS API on the project that runs Dataproc resources.
If the Dataproc Service Agent role is not attached to the Dataproc Service Agent service account, then add the
serviceusage.services.use
permission to the custom role attached to the Dataproc Service Agent service account. If the Dataproc Service Agent role is attached to the Dataproc Service Agent service account, you can skip this step.
You can use the Google Cloud CLI or the Dataproc API to set the key you created in Step 1 on a workflow. Once the key is set on a workflow, all the workflow job arguments and queries are encrypted using the key for any of the job types and arguments listed in WorkflowTemplate.EncryptionConfig.kmsKey.
gcloud CLI
Pass resource ID of your key to the
--kms-key
flag when you create the workflow template with the gcloud dataproc workflow-templates create command.Example:
You can verify the key setting from thegcloud dataproc workflow-templates create my-template-name \ --region=region \ --kms-key='projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name' \ other arguments ...
gcloud
command-line tool.gcloud dataproc workflow-templates describe TEMPLATE_NAME \ --region=REGION
... id: my-template-name encryptionConfig: kmsKey: projects/PROJECT_ID/locations/REGION/keyRings/KEY_RING_NAME/cryptoKeys/KEY_NAME ...
REST API
Use WorkflowTemplate.EncryptionConfig.kmsKey as part of a workflowTemplates.create request.
You can verify the key setting by issuing a workflowTemplates.get request. The returned JSON contains lists the
kmsKey
:... "id": "my-template-name", "encryptionConfig": { "kmsKey": "projects/project-id/locations/region/keyRings/key-ring-name/cryptoKeys/key-name" },
Cloud External Key Manager
Cloud External Key Manager (Cloud EKM) (EKM) lets you protect Dataproc data using keys managed by a supported external key management partner. The steps you follow to use EKM in Dataproc are the same as as those you use to set up CMEK keys, with the following difference: your key points to a URI for the externally managed key (see Cloud EKM Overview).
Cloud EKM errors
When you use Cloud EKM, an attempt to create a cluster can fail due to errors associated with inputs, Cloud EKM, the external key management partner system, or communications between EKM and the external system. If you use the REST API or the Google Cloud console, errors are logged in Logging. You can examine the failed cluster's errors from the View Log tab.