Monitoring
You can monitor Bigtable visually, using charts that are available in the Google Cloud console, or you can programmatically call the Cloud Monitoring API.
In the Google Cloud console, monitoring data is available in the following places:
- Bigtable monitoring
- Bigtable instance overview
- Bigtable cluster overview
- Bigtable table overview
- Cloud Monitoring
- Key Visualizer
The monitoring and overview pages provide a high-level view of your Bigtable usage. You can use Key Visualizer to drill down into your access patterns by row key and troubleshoot specific performance issues.
Understand CPU and disk usage
No matter what tools you use to monitor your instance, it's essential to monitor the CPU and disk usage for each cluster in the instance. If a cluster's CPU or disk usage exceeds certain thresholds, the cluster won't perform well, and it might return errors when you try to read or write data.
CPU usage
The nodes in your clusters use CPU resources to handle reads, writes, and administrative tasks. We recommend that you enable autoscaling, which lets Bigtable automatically add and remove nodes to a cluster based on workload. To learn more about how the number of nodes affects a cluster's performance, see Performance for typical workloads.
Bigtable reports the following metrics for CPU usage:
Metric | Description |
---|---|
Average CPU utilization |
The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. The recommended maximum values provide headroom for brief spikes in usage. |
CPU utilization of hottest node |
CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node. |
High-granularity CPU utilization of hottest node |
A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. If the hottest node is frequently above the recommended value, even when your average CPU utilization is reasonable, you might be accessing a small part of your data much more frequently than the rest of your data.
|
Change stream CPU utilization |
The average CPU utilization caused by change stream activity across all nodes in the cluster. |
CPU utilization by app profile, method, and table |
CPU utilization by app profile, method, and table. If you observe higher than expected CPU usage for a cluster, use this metric to determine if the CPU usage of a particular app profile, API method, or table is driving the CPU load. |
Disk usage
For each cluster in your instance, Bigtable stores a separate copy of all of the tables in that instance.
Bigtable tracks disk usage in binary units, such as binary gigabytes (GB), where 1 GB is 230 bytes. This unit of measurement is also known as a gibibyte (GiB).
Bigtable reports the following metrics for disk usage:
Metric | Description |
---|---|
Storage utilization (bytes) |
The amount of data stored in the cluster. Change stream usage is not included for this metric. This value affects your costs. Also, as described below, you might need to add nodes to each cluster as the amount of data increases. |
Storage utilization (% max) |
The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster. Change stream usage is not included for this metric. In general, do not use more than 70% of the hard limit on total storage, so you have room to add more data. If you do not plan to add significant amounts of data to your instance, you can use up to 100% of the hard limit. If you are using more than the recommended percentage of the storage limit, add nodes to the cluster. You can also delete existing data, but deleted data takes up more space, not less, until a compaction occurs. For details about how this value is calculated, see Storage utilization per node. |
Change stream storage utilization (bytes) |
The amount of storage consumed by change stream records for tables in the instance. This storage does not count toward the total storage utilization. You are charged for change stream storage, but it is not included in the calculation of storage utilization (% max). |
Disk load |
The percentage your cluster is using of the maximum possible bandwidth for HDD reads. Available only for HDD clusters. If this value is frequently at 100%, you might experience increased latency. Add nodes to the cluster to reduce the disk load percentage. |
Compaction and replicated instances
Storage metrics reflect the data size on disk as of the last compaction. Because compaction happens on a rolling basis over the course of a week, storage usage metrics for a cluster might sometimes temporarily be different from metrics for other clusters in the instance. Observable impacts of this include the following:
A new cluster that has recently been added to an instance might temporarily show 0 bytes of storage even though all data has successfully been replicated to the new cluster.
A table might be a different size in each cluster, even when replication is working properly.
Storage usage metrics might be different in each cluster, even after replication has finished and no writes have been sent for a few days. The internal storage implementation, including how data is divided and stored in a distributed manner, can be different for each cluster, causing the actual usage of storage to differ.
Instance overview
The instance overview page shows the current values of several key metrics for each cluster:
Metric | Description |
---|---|
CPU utilization average |
The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. |
CPU utilization of hottest node |
CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node. |
High-granularity CPU utilization of hottest node |
A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. Exceeding the recommended maximum for the busiest node can cause latency and other issues for the cluster. |
Rows read | The number of rows read per second. |
Rows written | The number of rows written per second. |
Read throughput | The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied. |
Write throughput | The number of bytes per second that were received when data was written. |
System error rate | The percentage of all requests that failed on the Bigtable server side. |
Replication latency for input | The highest amount of time at the 99th percentile, in seconds, for a write to another cluster to be replicated to this cluster. |
Replication latency for output | The highest amount of time at the 99th percentile, in seconds, for a write to this cluster to be replicated to another cluster. |
To see an overview of these key metrics:
Open the list of Bigtable instances in the Google Cloud console.
Click the instance whose metrics you want to view. The Google Cloud console displays the current metrics for your instance's clusters.
Cluster overview
Use the cluster overview page to understand the current and past status of an individual cluster.
The cluster overview page displays charts showing the following metrics for each cluster:
Metric | Description |
---|---|
Number of nodes | The number of nodes in use for the cluster at a given time. |
Maximum node count target | The maximum number of nodes that Bigtable will scale the cluster up to when autoscaling is enabled. This metric is visible only when autoscaling is enabled for the cluster. You are able to change this value on the Edit cluster page. |
Minimum node count target | The minimum number of nodes that Bigtable will scale the cluster down to when autoscaling is enabled. This metric is visible only when autoscaling is enabled for the cluster. You are able to change this value on the Edit cluster page. |
Recommended number of nodes for CPU target | The number of nodes that Bigtable recommends for the cluster based on the CPU utilization target that you set. This metric is visible only when autoscaling is enabled for the cluster. If this number is higher than the maximum node count target, consider raising your CPU utilization target or increasing the maximum number of nodes for the cluster. If this number is lower than the minimum number of nodes, the cluster might be overprovisioned for your usage, and you should consider lowering the minimum. |
Recommended number of nodes for storage target | The number of nodes that Bigtable recommends for the cluster based on the built-in storage utilization target. This metric is visible only when autoscaling is enabled for the cluster. If this number is higher than the maximum node count target, consider increasing the maximum number of nodes for the cluster. |
CPU utilization |
The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. |
Storage utilization |
The amount of data stored in the cluster. Change stream usage is not included for this metric. This metric reflects the fact that Bigtable compresses your data when it is stored. |
To view a cluster's overview page, do the following:
Open the list of Bigtable instances in the Google Cloud console.
Click the instance whose metrics you want to view.
Go to the section that follows the section that shows the current status of some of the cluster's metrics.
Click the cluster ID to open the cluster's Cluster overview page.
Logs
The Logs chart displays system event log entries for the cluster. System event logs are generated only for clusters that use autoscaling. To learn additional ways to view Bigtable audit logs, see Audit logging.
Table overview
Use the table overview page to understand the current and past status of an individual table.
The table overview page displays charts showing the following metrics for the table. Each chart shows a separate line for each cluster that the table is in.
Metric | Description |
---|---|
Storage utilization (bytes) | The percentage of the cluster's storage capacity that is being used by the table. The capacity is based on the number of nodes in the cluster. For details about how this value is calculated, see Storage utilization per node. |
CPU utilization |
The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. |
Read latency | The time for a read request to return a response. Measurement of read latency begins when Bigtable receives the request and ends when the last byte of data is sent to the client. For requests for large amounts of data, read latency can be affected by the client's ability to consume the response. |
Write latency | The time for a write request to return a response. |
Rows read |
The number of rows read per second. This metric provides a more useful view of Bigtable's overall throughput than the number of read requests, because a single request can read a large number of rows. |
Rows written |
The number of rows written per second. This metric provides a more useful view of Bigtable's overall throughput than the number of write requests, because a single request can write a large number of rows. |
Read requests | The number of random reads and scan requests per second. |
Write requests | The number of write requests per second. |
Read throughput | The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied. |
Write throughput | The number of bytes per second that were received when data was written. |
Automatic failovers |
The number of requests that were automatically rerouted from one cluster to another due to a failover scenario, such as a brief outage or delay. Automatic rerouting can occur if an app profile uses multi-cluster routing. This chart does not include manually rerouted requests. |
The table overview page also shows the table's replication state in each cluster in the instance. For each cluster, the page displays the following:
- Status
- Cluster ID
- Zone
- The amount of cluster storage used by the table
- Encryption key and key status
- Date of the latest backup of the selected table
- A link to the Edit cluster page.
To view a table's overview page, do the following:
Open the list of Bigtable instances in the Google Cloud console.
Click the instance whose metrics you want to view.
In the left pane, click Tables. The Google Cloud console displays a list of all the tables in the instance.
Click a table ID to open the table's Table overview page.
Monitor performance over time
Use your Bigtable instance's monitoring page to understand the past performance of your instance. You can analyze the performance of each cluster, and you can break down the metrics for different types of Bigtable resources. Charts can display a period ranging from the past 1 hour to the past 6 weeks.
Monitoring charts for Bigtable resources
The Bigtable monitoring page provides charts for the following types of Bigtable resources:
- Instances
- Tables
- Application profiles
- Replication
Charts on the monitoring page show the following metrics:
Metric | Available for | Description |
---|---|---|
CPU utilization | Instances Tables App profiles |
The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. |
CPU utilization (hottest node) | Instances |
CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node. |
High-granularity CPU utilization (hottest node) | Instances |
A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. Exceeding the recommended maximum for the busiest node can cause latency and other issues for the cluster. |
Read latency |
Instances Tables App profiles |
The time for a read request to return a response. Measurement of read latency begins when Bigtable receives the request and ends when the last byte of data is sent to the client. For requests for large amounts of data, read latency can be affected by the client's ability to consume the response. |
Write latency |
Instances Tables App profiles |
The time for a write request to return a response. |
User error rate | Instances |
The rate of errors caused by the content of a request, as opposed to errors on the Bigtable server side. The user error rate includes the following status codes:
User errors are typically caused by a configuration issue, such as a request that specifies the wrong cluster, table, or app profile. |
System error rate | Instances |
The percentage of all requests that failed on the Bigtable server side.
The system error rate includes the following
status codes:
|
Automatic failovers |
Instances Tables App profiles |
The number of requests that were automatically rerouted from one cluster to another due to a failover scenario, such as a brief outage or delay. Automatic rerouting can occur if an app profile uses multi-cluster routing. This chart does not include manually rerouted requests. |
Storage utilization (bytes) |
Instances Tables |
The amount of data stored in the cluster. Change stream usage is not included for this metric. This metric reflects the fact that Bigtable compresses your data when it is stored. |
Storage utilization (% max) | Instances |
The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster. Change stream usage is not included for this metric. For details about how this value is calculated, see Storage utilization per node. |
Disk load | Instances | The percentage your cluster is using of the maximum possible bandwidth for HDD reads. Available only for HDD clusters. |
Rows read |
Instances Tables App profiles |
The number of rows read per second. This metric provides a more useful view of Bigtable's overall throughput than the number of read requests, because a single request can read a large number of rows. |
Rows written |
Instances Tables App profiles |
The number of rows written per second. This metric provides a more useful view of Bigtable's overall throughput than the number of write requests, because a single request can write a large number of rows. |
Read requests |
Instances Tables App profiles |
The number of random reads and scan requests per second. |
Write requests |
Instances Tables App profiles |
The number of write requests per second. |
Read throughput |
Instances Tables App profiles |
The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied. |
Write throughput |
Instances Tables App profiles |
The number of bytes per second that were received when data was written. |
Node count | Instances | The number of nodes in the cluster. |
To view metrics for these resources:
Open the list of Bigtable instances in the Google Cloud console.
Click the instance whose metrics you want to view.
In the left pane, click Monitoring. The Google Cloud console displays a series of charts for the instance, as well as a tabular view of the instance's metrics. By default, the Google Cloud console shows metrics for the past hour, and it shows separate metrics for each cluster in the instance.
To view all of the charts, scroll through the pane where the charts are displayed.
To view metrics at the table level, click Tables.
To view metrics for individual app profiles, click Application Profiles.
To view combined metrics for the instance as a whole, find the Group by section above the charts, then click Instance.
To view metrics for a longer period of time, click the arrow next to 1 Hour. Choose a pre-set time range or enter a custom time range, then click Apply.
Charts for replication
The monitoring page provides a chart that shows replication latency over time. You can view the average latency for replicating writes at the 50th, 99th, and 100th percentiles.
To view the replication latency over time:
Open the list of Bigtable instances in the Google Cloud console.
Click the instance whose metrics you want to view.
In the left pane, click Monitoring. The page opens with the Instance tab selected.
Click the Replication tab. The Google Cloud console displays replication latency over time. By default, the Google Cloud console shows replication latency for the past hour.
To toggle between latency charts grouped by table or by cluster, use the Group by menu.
To change which percentile to view, use the Percentile menu.
To view metrics for a longer period of time, click the arrow next to 1 Hour. Choose a pre-set time range or enter a custom time range, then click Apply.
Monitor with Cloud Monitoring
Bigtable exports usage metrics to Cloud Monitoring. You can use these metrics in a variety of ways:
- Monitor programmatically using the Cloud Monitoring API.
- Monitor visually in the Metrics Explorer.
- Set up alerting policies.
- Add Bigtable usage metrics to a custom dashboard.
- Use a graphing library, such as Matplotlib for Python, to plot and analyze the usage metrics for Bigtable.
To view usage metrics in the Metrics Explorer:
Open the Monitoring page in the Google Cloud console.
If you are prompted to choose an account, choose the account that you use to access Google Cloud.
Click Resources, then click Metrics Explorer.
Under Find resource type and metric, type
bigtable
. A list of Bigtable resources and metrics appears.Click a metric to view a chart for that metric.
For additional information about using Cloud Monitoring, see the Cloud Monitoring documentation.
For a complete list of Bigtable metrics, see Metrics.
Create a storage utilization alert
You can set up an alert to notify you when your Bigtable cluster exceeds a specified threshold. For more information about determining your target storage utilization, see Disk usage.
To create an alerting policy that triggers when the storage utilization for your Bigtable cluster is above a recommended threshold, such as 70%, use the following settings.
New condition Field |
Value |
---|---|
Resource and Metric | In the Resources menu, select Cloud Bigtable Cluster. In the Metric categories menu, select Cluster. In the Metrics menu, select Storage utilization. (The metric.type is bigtable.googleapis.com/cluster/storage_utilization ).
|
Filter | cluster = YOUR_CLUSTER_ID |
Configure alert trigger Field |
Value |
---|---|
Condition type | Threshold |
Condition triggers if | Any time series violates |
Threshold position | Above threshold |
Threshold value | 70 |
Retest window | 10 minutes |
What's next
- Find out how to troubleshoot issues with Key Visualizer.
- Read about client-side metrics.
- Try the Cloud Monitoring quickstart.
- Learn about creating alerts based on Bigtable metrics.