Manage pipeline configurations

This page describes ways you can manage configurations for deployed pipelines.

Before you begin

This page requires some background knowledge about Compute profiles and pipeline performance.

Compute profile configuration

You can change the compute profile or customize the parameters of the default compute profile that runs the pipeline. For more information, see Manage compute profiles and Dataproc provisioner properties.

Pipeline configuration

For each pipeline, you can enable or disable instrumentation, such as timing metrics. By default, instrumentation is on. If instrumentation is enabled, when you run the pipeline, Cloud Data Fusion generates metrics for each pipeline node. The following metrics display on the Metrics tab of each node. The source, transformation, and sink metrics vary slightly.

  • Records out
  • Records in
  • Total number of errors
  • Records out per second
  • Min process time (one record)
  • Max process time (one record)
  • Standard deviation
  • Average processing time

We recommend you always turn on Instrumentation, unless the environment is short on resources.

For streaming pipelines, you can also set the Batch interval (seconds/minutes) for streaming data.

Engine configuration

Apache Spark is the default execution engine. You can pass custom parameters for Spark. For more information, see Parallel processing.

Resources

You can specify the memory and number of CPUs for the Spark driver and executor. The driver orchestrates the Spark job. The executor handles the data processing in Spark. For more information, see Resource management.

Pipeline alert

You can configure the pipeline to send alerts and start post processing tasks after the pipeline run finishes. You create pipeline alerts when you design the pipeline. After you deploy the pipeline, you can view the alerts. You can edit the pipeline to change alert settings. For more information, see Create alerts.

Transformation pushdown

You can enable Transformation pushdown if you want a pipeline to execute certain transformations in BigQuery. For more information, see the Transformation Pushdown overview.

What's next