To write data from Dataflow to Bigtable, use the Apache Beam Bigtable I/O connector.
Parallelism
Parallelism is controlled by the number of nodes in the Bigtable cluster. Each node manages one or more key ranges, although key ranges can move between nodes as part of load balancing. For more information, see Understand performance in the Bigtable documentation.
You are charged for the number of nodes in your instance's clusters. See Bigtable pricing.
Performance
The following table shows performance metrics for Bigtable I/O
write operations. The workloads were run on one e2-standard2
worker, using
the Apache Beam SDK 2.48.0 for Java. They did not use Runner v2.
100M record | 1kB | 1 column | Throughput (bytes) | Throughput (elements) |
---|---|---|
Write | 65 MBps | 60,000 elements per second |
These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, see Beam IO Performance.
Best practices
In general, avoid using transactions. Transactions aren't guaranteed to be idempotent, and Dataflow might invoke them multiple times due to retries, causing unexpected values.
A single Dataflow worker might process data for many key ranges, leading to inefficient writes to Bigtable. Using
GroupByKey
to group data by Bigtable key can significantly improve write performance.If you write large datasets to Bigtable, consider calling
withFlowControl
. This setting automatically rate-limits traffic to Bigtable, to ensure the Bigtable servers have enough resources available to serve data.
What's next
- Read the Bigtable I/O connector documentation.
- See the list of Google-provided templates.