Deploy and run pipelines

This page describes the basics about deploying and running pipelines in Cloud Data Fusion.

Deploy pipelines

After you finish designing and debugging a data pipeline and are satisfied with the data you see in Preview, you're ready to deploy the pipeline.

When you deploy the pipeline, the Cloud Data Fusion Studio creates the workflow and corresponding Apache Spark jobs in the background.

Run pipelines

After you deploy a pipeline, you can run a pipeline in the following ways:

  • To run a pipeline on demand, open a deployed pipeline and click Run.
  • To schedule the pipeline to run at a certain time, open a deployed pipeline and click Schedule.
  • To trigger the pipeline based when another pipeline completes, open a deployed pipeline and click Incoming triggers.

The Pipeline Studio saves a pipeline's history each time it runs. You can toggle between different runtime versions of the pipeline.

If the pipeline has macros, set the runtime arguments for each macro. You can also review and change the pipeline configurations before running the deployed pipeline. You can see the status change during the phases of the pipeline run, such as Provisioning, Starting, Running, and Succeeded. You can also stop the pipeline at any time.

If you enable instrumentation, you can explore the metrics generated by the pipeline by clicking Properties on any node in your pipeline, such as a source, transformation, or sink.

For more information about the pipeline runs, click Summary.

View run records

After a pipeline run completes, you can view the run record. By default, you can view the last 30 days of run records. Cloud Data Fusion deletes them after that period. You can extend that period using the REST API.

REST API

To retain run records more than 30 days, update the app.run.records.ttl options using the following command:

curl -X PATCH -H 'Content-Type: application/json' -H "Authorization: Bearer $(gcloud auth print-access-token)" '
https://datafusion.googleapis.com/v1beta1/projects/PROJECT_NAME/locations/REGION_NAME/instances/INSTANCE_NAME?updateMask=options'
-d '{ "options": { "app.run.records.ttl.days": "DAYS", "app.run.records.ttl.frequency.hours": "HOURS" } }'

Replace the following:

  • PROJECT_NAME: the Google Cloud project name
  • REGION_NAME: the Cloud Data Fusion instance's region—for example, us-east4
  • INSTANCE_NAME: the Cloud Data Fusion instance ID
  • DAYS: Amount of time, in days, to retain run records for old pipeline runs—for example, 30.
  • HOURS: frequency, in hours, to check for and delete old run records—for example, 24.

Example:

curl -X PATCH -H 'Content-Type: application/json' -H "Authorization: Bearer $(gcloud auth print-access-token)" '
https://datafusion.googleapis.com/v1beta1/projects/project-1/locations/us-east4/instances/data-fusion-instance-1?updateMask=options'
-d '{ "options": { "app.run.records.ttl.days": "30", "app.run.records.ttl.frequency.hours": "24" } }'

What's next