When creating a Dataproc cluster, you can specify initialization actions in executables or scripts that Dataproc will run on all nodes in your Dataproc cluster immediately after the cluster is set up. Initialization actions often set up job dependencies, such as installing Python packages, so that jobs can be submitted to the cluster without having to install dependencies when the jobs are run.
You can find sample initialization action scripts at the following locations: Note: Google does not support these samples.
- GitHub repository
- Cloud Storage—in the regional
gs://goog-dataproc-initialization-actions-REGION
public buckets
Important considerations and guidelines
Don't create production clusters that reference initialization actions located in the
gs://goog-dataproc-initialization-actions-REGION
public buckets. These scripts are provided as reference implementations. They are synchronized with ongoing GitHub repository changes, and updates to these scripts can break your cluster creation. Instead, copy the initialization action from the public bucket into a versioned Cloud Storage bucket folder, as shown in the following example: Then, create the cluster by referencing the copy in Cloud Storage:REGION=COMPUTE_REGION
gcloud storage cp gs://goog-dataproc-initialization-actions-${REGION}/cloud-sql-proxy/cloud-sql-proxy.sh \ gs://my-bucket/cloud-sql-proxy/v1.0/cloud-sql-proxy.sh
gcloud dataproc clusters create CLUSTER_NAME \ --region=${REGION} \ --initialization-actions=gs://my-bucket/cloud-sql-proxy/v1.0/cloud-sql-proxy.sh \ ...other flags...
Initialization actions are executed on each node in series during cluster creation. They are also executed on each added node when scaling or autoscaling clusters up.
When you update initialization actions—for example, when you sync your Cloud Storage initialization actions to changes made to public bucket or GitHub repository initialization actions—create a new (preferably version-named) folder to receive the updated initialization actions. If, instead, you update the initialization action in place, new nodes, such as those added by the autoscaler, will run the updated-in-place initialization action, not the prior-version initialization action that ran on existing nodes. Such initialization action differences can result in inconsistent or broken cluster nodes.
Initialization actions run as the
root
user. You do not need to usesudo
.Use absolute paths in initialization actions.
Use a shebang line in initialization actions to indicate how the script should be interpreted (such as
#!/bin/bash
or#!/usr/bin/python
).If an initialization action terminates with a non-zero exit code, the cluster create operation will report an "ERROR" status. To debug the initialization action, use SSH to connect into the cluster's VM instances, and then examine the logs. After fixing the initialization action problem, you can delete, then re-create the cluster.
If you create a Dataproc cluster with internal IP addresses only, attempts to access
github.com
over the internet in an initialization action will fail unless you have configured routes to direct the traffic through Cloud NAT or a Cloud VPN. Without access to the internet, you can enable Private Google Access and place job dependencies in Cloud Storage; cluster nodes can download the dependencies from Cloud Storage from internal IPs.You can use Dataproc custom images instead of initialization actions to set up job dependencies.
Initialization processing:
- Pre-2.0 image clusters:
- Master: To allow initialization actions run on masters to write files to HDFS, master node initialization actions do not start until HDFS is writeable (until HDFS has exited safemode and at least two HDFS DataNodes have joined).
- Worker: If you set the
dataproc:dataproc.worker.custom.init.actions.mode
cluster property toRUN_BEFORE_SERVICES
, each worker runs its initialization actions before it starts its HDFS datanode and YARN nodemanager daemons. Since Dataproc does not run master initialization actions until HDFS is writeable, which requires 2 HDFS datanode daemons to be running, setting this property may increase cluster creation time.
2.0+ image clusters:
- Master: Master node initialization actions may run before
HDFS is writeable. If you run initialization actions that stage
files in HDFS or depend on the availability of HDFS-dependent services,
such as Ranger, set the
dataproc.master.custom.init.actions.mode
cluster property toRUN_AFTER_SERVICES
. Note: since this property setting can increase cluster creation time—see the explanation for cluster creation delay for pre-2.0 image cluster workers—use it only when necessary (as a general practice, rely on the defaultRUN_BEFORE_SERVICES
setting for this property). - Worker: The
dataproc:dataproc.worker.custom.init.actions.mode
cluster property is set toRUN_BEFORE_SERVICES
and cannot be passed to the cluster when the cluster is created (you cannot change the property setting). Each worker runs its initialization actions before it starts its HDFS datanode and YARN nodemanager daemons. Since Dataproc does not wait for HDFS to be writeable before running master initialization actions, master and worker initialization actions run in parallel.
- Master: Master node initialization actions may run before
HDFS is writeable. If you run initialization actions that stage
files in HDFS or depend on the availability of HDFS-dependent services,
such as Ranger, set the
Recommendations:
- Use metadata to determine a node's role to conditionally execute an initialization action on nodes (see Using cluster metadata).
- Fork a copy of an initialization action to a Cloud Storage bucket for stability (see How initialization actions are used).
- Add retries when you download from the internet to help stabilize the initialization action.
- Pre-2.0 image clusters:
Using initialization actions
Cluster initialization actions can be specified regardless of how you create a cluster:
- Through the Google Cloud console
- Using the gcloud CLI
- Programmatically with the Dataproc clusters.create API (see NodeInitializationAction)
Gcloud command
When creating a cluster with the
gcloud dataproc clusters create
command, specify one or more comma-separated Cloud Storage locations (URIs)
of the initialization executables or scripts with the
--initialization-actions
flag. Note: Multiple consecutive
"/"s in a Cloud Storage location URI after the initial "gs://", such as
"gs://bucket/my//object//name", are not supported. Run
gcloud dataproc clusters create --help
for command information.
gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \ --initialization-action-timeout=timeout-value (default=10m) \ ... other flags ...
- Use the
--initialization-action-timeout
flag to specify a timeout period for the initialization action. The default timeout value is 10 minutes. If the initialization executable or script has not completed by the end of the timeout period, Dataproc cancels the initialization action. -
Use the
dataproc:dataproc.worker.custom.init.actions.mode
cluster property to run the initialization action on primary workers before the node manager and datanode daemons are started.
REST API
Specify one or more scripts or executables in a ClusterConfig.initializationActions array as part a clusters.create API request.
Example
POST /v1/projects/my-project-id/regions/us-central1/clusters/ { "projectId": "my-project-id", "clusterName": "example-cluster", "config": { "configBucket": "", "gceClusterConfig": { "subnetworkUri": "default", "zoneUri": "us-central1-b" }, "masterConfig": { "numInstances": 1, "machineTypeUri": "n1-standard-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "workerConfig": { "numInstances": 2, "machineTypeUri": "n1-standard-4", "diskConfig": { "bootDiskSizeGb": 500, "numLocalSsds": 0 } }, "initializationActions": [ { "executableFile": "gs://cloud-example-bucket/my-init-action.sh" } ] } }
Console
Passing arguments to initialization actions
Dataproc sets special metadata values for the instances that run in your clusters. You can set your own custom metadata as a way to pass arguments to initialization actions.
gcloud dataproc clusters create cluster-name \ --region=${REGION} \ --initialization-actions=Cloud Storage URI(s) (gs://bucket/...) \ --metadata=name1=value1,name2=value2... \ ... other flags ...
Metadata values can be read within initialization actions as follows:
var1=$(/usr/share/google/get_metadata_value attributes/name1)
Node selection
If you want to limit initialization actions to master, driver or worker nodes, you can add simple node-selection logic to your executable or script.
ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then ... master specific actions ... else if [[ "${ROLE}" == 'Driver' ]]; then ... driver specific actions ... else ... worker specific actions ... fi
Staging binaries
A common cluster initialization scenario is the staging of job binaries on a
cluster to eliminate the need to stage the binaries each time a job is
submitted. For example, assume that the following initialization script is stored in
gs://my-bucket/download-job-jar.sh
, a Cloud Storage bucket
location:
#!/bin/bash ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) if [[ "${ROLE}" == 'Master' ]]; then gcloud storage cp gs://my-bucket/jobs/sessionalize-logs-1.0.jar home/username fi
The location of this script can be passed to the
gcloud dataproc clusters create
command:
gcloud dataproc clusters create my-dataproc-cluster \ --region=${REGION} \ --initialization-actions=gs://my-bucket/download-job-jar.sh
Dataproc will run this script on all nodes, and, as a consequence of the script's node-selection logic, will download the jar to the master node. Submitted jobs can then use the pre-staged jar:
gcloud dataproc jobs submit hadoop \ --cluster=my-dataproc-cluster \ --region=${REGION} \ --jar=file:///home/username/sessionalize-logs-1.0.jar
Initialization actions samples
Frequently used and other sample initialization actions scripts are located in
gs://goog-dataproc-initialization-actions-<REGION>
, a regional public Cloud
Storage buckets, and in a
GitHub repository.
To contribute a script, review the
CONTRIBUTING.md
document, and then file a pull request.
Logging
Output from the execution of each initialization action is logged for each
instance in /var/log/dataproc-initialization-script-X.log
, where X
is the
zero-based index of each successive initialization action script. For example, if your
cluster has two initialization actions, the outputs will be logged
in /var/log/dataproc-initialization-script-0.log
and
/var/log/dataproc-initialization-script-1.log
.
What's Next
Explore GitHub initialization actions.