To read data from Bigtable to Dataflow, use the Apache Beam Bigtable I/O connector.
Parallelism
Parallelism is controlled by the number of nodes in the Bigtable cluster. Each node manages one or more key ranges, although key ranges can move between nodes as part of load balancing. For more information, see Reads and performance in the Bigtable documentation.
You are charged for the number of nodes in your instance's clusters. See Bigtable pricing.
Performance
The following table shows performance metrics for Bigtable read
operations. The workloads were run on one e2-standard2
worker, using the
Apache Beam SDK 2.48.0 for Java. They did not use Runner v2.
100 M records | 1 kB | 1 column | Throughput (bytes) | Throughput (elements) |
---|---|---|
Read | 180 MBps | 170,000 elements per second |
These metrics are based on simple batch pipelines. They are intended to compare performance between I/O connectors, and are not necessarily representative of real-world pipelines. Dataflow pipeline performance is complex, and is a function of VM type, the data being processed, the performance of external sources and sinks, and user code. Metrics are based on running the Java SDK, and aren't representative of the performance characteristics of other language SDKs. For more information, see Beam IO Performance.
Best practices
For new pipelines, use the
BigtableIO
connector, notCloudBigtableIO
.Create separate app profiles for each type of pipeline. App profiles enable better metrics for differentiating traffic between pipelines, both for support and for tracking usage.
Monitor the Bigtable nodes. If you experience performance bottlenecks, check whether resources such as CPU utilization are constrained within Bigtable. For more information, see Monitoring.
In general, the default timeouts are well tuned for most pipelines. If a streaming pipeline appears to get stuck reading from Bigtable, try calling
withAttemptTimeout
to adjust the attempt timeout.Consider enabling Bigtable autoscaling, or resize the Bigtable cluster to scale with the size of your Dataflow jobs.
Consider setting
maxNumWorkers
on the Dataflow job to limit load on the Bigtable cluster.If significant processing is done on a Bigtable element before a shuffle, calls to Bigtable might time out. In that case, you can call
withMaxBufferElementCount
to buffer elements. This method converts the read operation from streaming to paginated, which avoids the issue.If you use a single Bigtable cluster for both streaming and batch pipelines, and the performance degrades on the Bigtable side, consider setting up replication on the cluster. Then separate the batch and streaming pipelines, so that they read from different replicas. For more information, see Replication overview.
What's next
- Read the Bigtable I/O connector documentation.
- See the list of Google-provided templates.