You set up and run a workflow by:
- Creating a workflow template
- Configuring a managed (ephemeral) cluster or selecting an existing cluster
- Adding jobs
- Instantiating the template to run the workflow
Creating a template
gcloud CLI
Run the following command
to create a Dataproc workflow template resource.
gcloud dataproc workflow-templates create TEMPLATE_ID \ --region=REGION
Notes:
- REGION: Specify the region where your template will run.
- TEMPLATE_ID: Provide an ID for your template, such as, "workflow-template-1".
- CMEK encryption. You can add the --kms-key flag to use CMEK encryption on workflow template job arguments.
REST API
Submit a WorkflowTemplate as part of a workflowTemplates.create request. You can add the WorkflowTemplate.EncryptionConfig.kmsKey field to use CMEK encryption on workflow template job arguments. kmsKey
Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Configuring or selecting a cluster
Dataproc can create and use a new, "managed" cluster for your workflow or an existing cluster.
Existing cluster: See Using cluster selectors with workflows to select an existing cluster for your workflow.
Managed cluster: You must configure a managed cluster for your workflow. Dataproc will create this new cluster to run workflow jobs, then delete the cluster at the end of the workflow.
You can configure a managed cluster for your workflow using the
gcloud
command-line tool or the Dataproc API.gcloud command
Use flags inherited from gcloud dataproc cluster create to configure the managed cluster, such as the number of workers and the master and worker machine type. Dataproc will add a suffix to the cluster name to ensure uniqueness. You can use the
--service-account
flag to specify a VM service account for the managed cluster.gcloud dataproc workflow-templates set-managed-cluster TEMPLATE_ID \ --region=REGION \ --master-machine-type=MACHINE_TYPE \ --worker-machine-type=MACHINE_TYPE \ --num-workers=NUMBER \ --cluster-name=CLUSTER_NAME --service-account=SERVICE_ACCOUNT
REST API
See WorkflowTemplatePlacement.ManagedCluster, which you can provide as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.
You can use the
GceClusterConfig.serviceAccount
field to specify a VM service account for the managed cluster.Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Adding jobs to a template
All jobs run concurrently unless you specify one or more job dependencies. A
job's dependencies are expressed as a list of other jobs that must finish
successfully before the ultimate job can start. You must provide a step-id
for each job. The ID must be unique within the workflow, but does not need to be
unique globally.
gcloud command
Use job type and flags inherited from
gcloud dataproc jobs submit
to define the job to add to the template. You can optionally use the
‑‑start-after job-id of another workflow job
flag to have the job start after the completion of one or more other jobs
in the workflow.
Examples:
Add Hadoop job "foo" to the "my-workflow" template.
gcloud dataproc workflow-templates add-job hadoop \ --region=REGION \ --step-id=foo \ --workflow-template=my-workflow \ -- space separated job args
Add job "bar" to the "my-workflow" template, which will be run after workflow job "foo" has completed successfully.
gcloud dataproc workflow-templates add-job JOB_TYPE \ --region=REGION \ --step-id=bar \ --start-after=foo \ --workflow-template=my-workflow \ -- space separated job args
Add another job "baz" to "my-workflow" template to be run after the successful completion of both "foo" and "bar" jobs.
gcloud dataproc workflow-templates add-job JOB_TYPE \ --region=REGION \ --step-id=baz \ --start-after=foo,bar \ --workflow-template=my-workflow \ -- space separated job args
REST API
See WorkflowTemplate.OrderedJob. This field is provided as part of a completed WorkflowTemplate submitted with a workflowTemplates.create or workflowTemplates.update request.
Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Running a workflow
The instantiation of a workflow template runs the workflow defined by the template. Multiple instantiations of a template are supported—you can run a workflow multiple times.
gcloud command
gcloud dataproc workflow-templates instantiate TEMPLATE_ID \ --region=REGION
The command returns an operation ID, which you can use to track workflow status.
Example command and output:gcloud beta dataproc workflow-templates instantiate my-template-id \ --region=us-central1 ... WorkflowTemplate [my-template-id] RUNNING ... Created cluster: my-template-id-rg544az7mpbfa. Job ID teragen-rg544az7mpbfa RUNNING Job ID teragen-rg544az7mpbfa COMPLETED Job ID terasort-rg544az7mpbfa RUNNING Job ID terasort-rg544az7mpbfa COMPLETED Job ID teravalidate-rg544az7mpbfa RUNNING Job ID teravalidate-rg544az7mpbfa COMPLETED ... Deleted cluster: my-template-id-rg544az7mpbfa. WorkflowTemplate [my-template-id] DONE
REST API
See workflowTemplates.instantiate.Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Workflow job failures
A failure in any job in a workflow will cause the workflow to fail. Dataproc will seek to mitigate the effect of failures by causing all concurrently executing jobs to fail and preventing subsequent jobs from starting.
Monitoring and listing a workflow
gcloud command
To monitor a workflow:
gcloud dataproc operations describe OPERATION_ID \ --region=REGION
Note: The operation-id is returned when you instantiate the workflow
with gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
To list workflow status:
gcloud dataproc operations list \ --region=REGION \ --filter="labels.goog-dataproc-operation-type=WORKFLOW AND status.state=RUNNING"
REST API
To monitor a workflow, use the Dataproc operations.get API.
To list running workflows, use the Dataproc operations.list API with a label filter.
Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Terminating a workflow
You can end a workflow using the Google Cloud CLI or by calling the Dataproc API.
gcloud command
gcloud dataproc operations cancel OPERATION_ID \ --region=REGION
gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
REST API
See the operations.cancel API.
Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Updating a workflow template
Updates do not affect running workflows. The new template version will only apply to new workflows.
gcloud command
Workflow templates can be updated by issuing new gcloud workflow-templates
commands that reference an existing workflow template-id:
REST API
To make an update to a template with the REST API:
- Call workflowTemplates.get, which returns the current template with the
version
field filled in with the current server version. - Make updates to the fetched template.
- Call workflowTemplates.update with the updated template.
Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.
Deleting a workflow template
gcloud command
gcloud dataproc workflow-templates delete TEMPLATE_ID \ --region=REGION
Note: The operation-id that is returned when you instantiate the workflow
with gcloud dataproc workflow-templates instantiate
(see
Running a workflow).
REST API
See workflowTemplates.delete.Console
You can view existing workflow templates and instantiated workflows from the Dataproc Workflows page in Google Cloud console.