Use Dataflow Runner v2

When you use Dataflow to run your pipeline, the Dataflow runner uploads your pipeline code and dependencies to a Cloud Storage bucket and creates a Dataflow job. This Dataflow job runs your pipeline on managed resources in Google Cloud.

  • For batch pipelines that use the Apache Beam Java SDK versions 2.54.0 or later, Runner v2 is enabled by default.
  • For pipelines that use the Apache Beam Java SDK, Runner v2 is required when running multi-language pipelines, using custom containers, or using Spanner or Bigtable change stream pipelines. In other cases, use the default runner.
  • For pipelines that use the Apache Beam Python SDK versions 2.21.0 or later, Runner v2 is enabled by default. For pipelines that use the Apache Beam Python SDK versions 2.45.0 and later, Dataflow Runner v2 is the only Dataflow runner available.
  • For the Apache Beam SDK for Go, Dataflow Runner v2 is the only Dataflow runner available.

Runner v2 uses a services-based architecture that benefits some pipelines:

Limitations and restrictions

Dataflow Runner v2 has the following requirements:

  • Dataflow Runner v2 requires Streaming Engine for streaming jobs.
  • Because Dataflow Runner v2 requires Streaming Engine for streaming jobs, any Apache Beam transform that requires Dataflow Runner v2 also requires the use of Streaming Engine for streaming jobs. For example, the Pub/Sub Lite I/O connector for the Apache Beam SDK for Python is a cross-language transform that requires Dataflow Runner v2. If you try to disable Streaming Engine for a job or template that uses this transform, the job fails.
  • For streaming pipelines that use the Apache Beam Java SDK, the classes MapState and SetState are not supported with Runner v2. To use the MapState and SetState classes with Java pipelines, enable Streaming Engine, disable Runner v2, and use the Apache Beam SDK version 2.58.0 or later.
  • For batch and streaming pipelines that use the Apache Beam Java SDK, the classes OrderedListState and AfterSynchronizedProcessingTime are not supported.

Enable Dataflow Runner v2

To enable Dataflow Runner v2, follow the configuration instructions for your Apache Beam SDK.

Java

Dataflow Runner v2 requires the Apache Beam Java SDK versions 2.30.0 or later, with version 2.44.0 or later being recommended.

For batch pipelines that use the Apache Beam Java SDK versions 2.54.0 or later, Runner v2 is enabled by default.

To enable Runner v2, run your job with the --experiments=use_runner_v2 flag.

To disable Runner v2, use the --experiments=disable_runner_v2 flag. Some pipelines are automatically opted in to Runner v2. To prevent your pipeline from using this feature, use the --experiments=disable_runner_v2 pipeline option.

Python

For pipelines that use the Apache Beam Python SDK versions 2.21.0 or later, Runner v2 is enabled by default.

Dataflow Runner v2 isn't supported with the Apache Beam Python SDK versions 2.20.0 and earlier.

In some cases, your pipeline might not use Runner v2 even though the pipeline runs on a supported SDK version. In such cases, to run the job with Runner v2, use the --experiments=use_runner_v2 flag.

If you want to disable Runner v2 and your job is identified as auto_runner_v2 experiment, use the --experiments=disable_runner_v2 flag. Disabling Runner v2 is not supported with the Apache Beam Python SDK versions 2.45.0 and later.

Go

Dataflow Runner v2 is the only Dataflow runner available for the Apache Beam SDK for Go. Runner v2 is enabled by default.

Monitor your job

Use the monitoring interface to view Dataflow job metrics, such as memory utilization, CPU utilization, and more.

Worker VM logs are available through the Logs Explorer and the Dataflow monitoring interface. Worker VM logs include logs from the runner harness process and logs from the SDK processes. You can use the VM logs to troubleshoot your job.

Troubleshoot Runner v2

To troubleshoot jobs using Dataflow Runner v2, follow standard pipeline troubleshooting steps. The following list provides additional information about how Dataflow Runner v2 works:

  • Dataflow Runner v2 jobs run two types of processes on the worker VM: SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or more SDK processes, but there is only one runner harness process per VM.
  • SDK processes run user code and other language-specific functions. The runner harness process manages everything else.
  • The runner harness process waits for all SDK processes to connect to it before starting to request work from Dataflow.
  • Jobs might be delayed if the worker VM downloads and installs dependencies during the SDK process startup. If issues occur during an SDK process, such as when starting up or installing libraries, the worker reports its status as unhealthy. If the startup times increase, enable the Cloud Build API on your project and submit your pipeline with the following parameter: --prebuild_sdk_container_engine=cloud_build.
  • Because Dataflow Runner v2 uses checkpointing, each worker might wait for up to five seconds while buffering changes before sending the changes for further processing. As a result, latency of approximately six seconds is expected.
  • To diagnose problems in your user code, examine the worker logs from the SDK processes. If you find any errors in the runner harness logs, contact Support to file a bug.
  • To debug common errors related to Dataflow multi-language pipelines, see the Multi-language Pipelines Tips guide.