Monitoring

You can monitor Bigtable visually, using charts that are available in the Google Cloud console, or you can programmatically call the Cloud Monitoring API.

In the Google Cloud console, monitoring data is available in the following places:

Bigtable monitoring
Bigtable instance overview
Bigtable cluster overview
Bigtable table overview
Cloud Monitoring
Key Visualizer

The monitoring and overview pages provide a high-level view of your Bigtable usage. You can use Key Visualizer to drill down into your access patterns by row key and troubleshoot specific performance issues.

Understand CPU and disk usage

No matter what tools you use to monitor your instance, it's essential to monitor the CPU and disk usage for each cluster in the instance. If a cluster's CPU or disk usage exceeds certain thresholds, the cluster won't perform well, and it might return errors when you try to read or write data.

CPU usage

The nodes in your clusters use CPU resources to handle reads, writes, and administrative tasks. We recommend that you enable autoscaling, which lets Bigtable automatically add and remove nodes to a cluster based on workload. To learn more about how the number of nodes affects a cluster's performance, see Performance for typical workloads.

Bigtable reports the following metrics for CPU usage:

Metric	Description
Average CPU utilization	The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven. The recommended maximum values provide headroom for brief spikes in usage.
CPU utilization of hottest node	CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node.
High-granularity CPU utilization of hottest node	A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. If the hottest node is frequently above the recommended value, even when your average CPU utilization is reasonable, you might be accessing a small part of your data much more frequently than the rest of your data. Use the Key Visualizer tool to identify hotspots in your table that might be causing spikes in CPU utilization. Check your schema design to make sure it supports an even distribution of reads and writes across each table.
Change stream CPU utilization	The average CPU utilization caused by change stream activity across all nodes in the cluster.
CPU utilization by app profile, method, and table	CPU utilization by app profile, method, and table. If you observe higher than expected CPU usage for a cluster, use this metric to determine if the CPU usage of a particular app profile, API method, or table is driving the CPU load.

Disk usage

For each cluster in your instance, Bigtable stores a separate copy of all of the tables in that instance.

Bigtable tracks disk usage in binary units, such as binary gigabytes (GB), where 1 GB is 2³⁰ bytes. This unit of measurement is also known as a gibibyte (GiB).

Bigtable reports the following metrics for disk usage:

Metric	Description
Storage utilization (bytes)	The amount of data stored in the cluster. Change stream usage is not included for this metric. This value affects your costs. Also, as described below, you might need to add nodes to each cluster as the amount of data increases.
Storage utilization (% max)	The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster. Change stream usage is not included for this metric. In general, do not use more than 70% of the hard limit on total storage, so you have room to add more data. If you do not plan to add significant amounts of data to your instance, you can use up to 100% of the hard limit. Important: If any cluster in an instance exceeds the hard limit on the amount of storage per node, writes to all clusters in that instance will fail until you add nodes to each cluster that is over the limit. Also, if you try to remove nodes from a cluster, and the change would cause the cluster to exceed the hard limit on storage, Bigtable will deny the request. If you are using more than the recommended percentage of the storage limit, add nodes to the cluster. You can also delete existing data, but deleted data takes up more space, not less, until a compaction occurs. For details about how this value is calculated, see Storage utilization per node.
Change stream storage utilization (bytes)	The amount of storage consumed by change stream records for tables in the instance. This storage does not count toward the total storage utilization. You are charged for change stream storage, but it is not included in the calculation of storage utilization (% max).
Disk load	The percentage your cluster is using of the maximum possible bandwidth for HDD reads. Available only for HDD clusters. If this value is frequently at 100%, you might experience increased latency. Add nodes to the cluster to reduce the disk load percentage.

Compaction and replicated instances

Storage metrics reflect the data size on disk as of the last compaction. Because compaction happens on a rolling basis over the course of a week, storage usage metrics for a cluster might sometimes temporarily be different from metrics for other clusters in the instance. Observable impacts of this include the following:

A new cluster that has recently been added to an instance might temporarily show 0 bytes of storage even though all data has successfully been replicated to the new cluster.
A table might be a different size in each cluster, even when replication is working properly.
Storage usage metrics might be different in each cluster, even after replication has finished and no writes have been sent for a few days. The internal storage implementation, including how data is divided and stored in a distributed manner, can be different for each cluster, causing the actual usage of storage to differ.

Instance overview

The instance overview page shows the current values of several key metrics for each cluster:

Metric	Description
CPU utilization average	The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven.
CPU utilization of hottest node	CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node.
High-granularity CPU utilization of hottest node	A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. Exceeding the recommended maximum for the busiest node can cause latency and other issues for the cluster.
Rows read	The number of rows read per second.
Rows written	The number of rows written per second.
Read throughput	The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied.
Write throughput	The number of bytes per second that were received when data was written.
System error rate	The percentage of all requests that failed on the Bigtable server side.
Replication latency for input	The highest amount of time at the 99th percentile, in seconds, for a write to another cluster to be replicated to this cluster.
Replication latency for output	The highest amount of time at the 99th percentile, in seconds, for a write to this cluster to be replicated to another cluster.

To see an overview of these key metrics:

Open the list of Bigtable instances in the Google Cloud console.

Open the instance list
Click the instance whose metrics you want to view. The Google Cloud console displays the current metrics for your instance's clusters.

Cluster overview

Use the cluster overview page to understand the current and past status of an individual cluster.

The cluster overview page displays charts showing the following metrics for each cluster:

Metric	Description
Number of nodes	The number of nodes in use for the cluster at a given time.
Maximum node count target	The maximum number of nodes that Bigtable will scale the cluster up to when autoscaling is enabled. This metric is visible only when autoscaling is enabled for the cluster. You are able to change this value on the Edit cluster page.
Minimum node count target	The minimum number of nodes that Bigtable will scale the cluster down to when autoscaling is enabled. This metric is visible only when autoscaling is enabled for the cluster. You are able to change this value on the Edit cluster page.
Recommended number of nodes for CPU target	The number of nodes that Bigtable recommends for the cluster based on the CPU utilization target that you set. This metric is visible only when autoscaling is enabled for the cluster. If this number is higher than the maximum node count target, consider raising your CPU utilization target or increasing the maximum number of nodes for the cluster. If this number is lower than the minimum number of nodes, the cluster might be overprovisioned for your usage, and you should consider lowering the minimum.
Recommended number of nodes for storage target	The number of nodes that Bigtable recommends for the cluster based on the built-in storage utilization target. This metric is visible only when autoscaling is enabled for the cluster. If this number is higher than the maximum node count target, consider increasing the maximum number of nodes for the cluster.
CPU utilization	The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven.
Storage utilization	The amount of data stored in the cluster. Change stream usage is not included for this metric. This metric reflects the fact that Bigtable compresses your data when it is stored.

To view a cluster's overview page, do the following:

Open the list of Bigtable instances in the Google Cloud console.

Open the instance list
Click the instance whose metrics you want to view.
Go to the section that follows the section that shows the current status of some of the cluster's metrics.
Click the cluster ID to open the cluster's Cluster overview page.

Logs

The Logs chart displays system event log entries for the cluster. System event logs are generated only for clusters that use autoscaling. To learn additional ways to view Bigtable audit logs, see Audit logging.

Table overview

Use the table overview page to understand the current and past status of an individual table.

The table overview page displays charts showing the following metrics for the table. Each chart shows a separate line for each cluster that the table is in.

Metric	Description
Storage utilization (bytes)	The percentage of the cluster's storage capacity that is being used by the table. The capacity is based on the number of nodes in the cluster. For details about how this value is calculated, see Storage utilization per node.
CPU utilization	The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven.
Read latency	The time for a read request to return a response. Measurement of read latency begins when Bigtable receives the request and ends when the last byte of data is sent to the client. For requests for large amounts of data, read latency can be affected by the client's ability to consume the response.
Write latency	The time for a write request to return a response.
Rows read	The number of rows read per second. This metric provides a more useful view of Bigtable's overall throughput than the number of read requests, because a single request can read a large number of rows.
Rows written	The number of rows written per second. This metric provides a more useful view of Bigtable's overall throughput than the number of write requests, because a single request can write a large number of rows.
Read requests	The number of random reads and scan requests per second.
Write requests	The number of write requests per second.
Read throughput	The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied.
Write throughput	The number of bytes per second that were received when data was written.
Automatic failovers	The number of requests that were automatically rerouted from one cluster to another due to a failover scenario, such as a brief outage or delay. Automatic rerouting can occur if an app profile uses multi-cluster routing. This chart does not include manually rerouted requests.

The table overview page also shows the table's replication state in each cluster in the instance. For each cluster, the page displays the following:

Status
Cluster ID
Zone
The amount of cluster storage used by the table
Encryption key and key status
Date of the latest backup of the selected table
A link to the Edit cluster page.

To view a table's overview page, do the following:

Open the list of Bigtable instances in the Google Cloud console.

Open the instance list
Click the instance whose metrics you want to view.
In the left pane, click Tables. The Google Cloud console displays a list of all the tables in the instance.
Click a table ID to open the table's Table overview page.

Monitor performance over time

Use your Bigtable instance's monitoring page to understand the past performance of your instance. You can analyze the performance of each cluster, and you can break down the metrics for different types of Bigtable resources. Charts can display a period ranging from the past 1 hour to the past 6 weeks.

Monitoring charts for Bigtable resources

The Bigtable monitoring page provides charts for the following types of Bigtable resources:

Instances
Tables
Application profiles
Replication

Charts on the monitoring page show the following metrics:

Metric	Available for	Description
CPU utilization	Instances Tables App profiles	The average CPU utilization across all nodes in the cluster. Includes change stream activity if a change stream is enabled for a table in the instance. In app profile charts, <system> indicates system background activities such as replication and compaction. System background activities are not client-driven.
CPU utilization (hottest node)	Instances	CPU utilization for the busiest node in the cluster. This metric continues to be provided for continuity, but in most cases you should use the more accurate metric High-granularity CPU utilization of hottest node.
High-granularity CPU utilization (hottest node)	Instances	A fine-grained measurement of CPU utilization for the busiest node in the cluster. We recommend that you use this metric instead of CPU utilization of hottest node because this metric is more accurate. The hottest node is not necessarily the same node over time and can change rapidly, especially during large batch jobs or table scans. Exceeding the recommended maximum for the busiest node can cause latency and other issues for the cluster.
Read latency	Instances Tables App profiles	The time for a read request to return a response. Measurement of read latency begins when Bigtable receives the request and ends when the last byte of data is sent to the client. For requests for large amounts of data, read latency can be affected by the client's ability to consume the response.
Write latency	Instances Tables App profiles	The time for a write request to return a response.
Client-side read latency	Instances Tables App profiles	The total end-to-end latency across all RPC attempts associated with a Bigtable operation. Measures the operation's round trip from the client to Bigtable and back to the client and includes all retries.
Client-side write latency	Instances Tables App profiles	The total end-to-end latency across all RPC attempts associated with a Bigtable operation. Measures the operation's round trip from the client to Bigtable and back to the client and includes all retries.
Client-side read attempt latency	Instances Tables App profiles	Latency of a client read RPC attempt.
Client-side write attempt latency	Instances Tables App profiles	Latency of a client write RPC attempt.
User error rate	Instances	The rate of errors caused by the content of a request, as opposed to errors on the Bigtable server side. The user error rate includes the following status codes: INVALID_ARGUMENT NOT_FOUND PERMISSION_DENIED RESOURCE_EXHAUSTED OUT_OF_RANGE User errors are typically caused by a configuration issue, such as a request that specifies the wrong cluster, table, or app profile. Note: To view this chart, you must group the monitoring data by instance. In the View metrics for drop-down list, select Instance. Then, under Group by, click Instance.
System error rate	Instances	The percentage of all requests that failed on the Bigtable server side. The system error rate includes the following status codes: UNKNOWN ABORTED UNIMPLEMENTED INTERNAL UNAVAILABLE
Automatic failovers	Instances Tables App profiles	The number of requests that were automatically rerouted from one cluster to another due to a failover scenario, such as a brief outage or delay. Automatic rerouting can occur if an app profile uses multi-cluster routing. This chart does not include manually rerouted requests.
Storage utilization (bytes)	Instances Tables	The amount of data stored in the cluster. Change stream usage is not included for this metric. This metric reflects the fact that Bigtable compresses your data when it is stored.
Storage utilization (% max)	Instances	The percentage of the cluster's storage capacity that is being used. The capacity is based on the number of nodes in your cluster. Change stream usage is not included for this metric. For details about how this value is calculated, see Storage utilization per node.
Disk load	Instances	The percentage your cluster is using of the maximum possible bandwidth for HDD reads. Available only for HDD clusters.
Rows read	Instances Tables App profiles	The number of rows read per second. This metric provides a more useful view of Bigtable's overall throughput than the number of read requests, because a single request can read a large number of rows.
Rows written	Instances Tables App profiles	The number of rows written per second. This metric provides a more useful view of Bigtable's overall throughput than the number of write requests, because a single request can write a large number of rows.
Read requests	Instances Tables App profiles	The number of random reads and scan requests per second.
Write requests	Instances Tables App profiles	The number of write requests per second.
Read throughput	Instances Tables App profiles	The number of bytes per second of response data sent. This metric refers to the full amount of data that is returned after filters are applied.
Write throughput	Instances Tables App profiles	The number of bytes per second that were received when data was written.
Node count	Instances	The number of nodes in the cluster.
Write throughput	Instances Tables App profiles	The number of bytes per second that were received when data was written.
Node count	Instances	The number of nodes in the cluster.
Data Boost traffic eligibility count	App profiles	Current Bigtable requests that are eligible and ineligible for Data Boost
Data Boost traffic ineligible reasons	App profiles	Reasons that current traffic is ineligible for Data Boost.
Serverless processing units (SPUs)	Instances	Billable Data Boost compute usage measured in SPU-seconds

To view metrics for these resources:

Open the list of Bigtable instances in the Google Cloud console.

Open the instance list
Click the instance whose metrics you want to view.
In the left pane, click Monitoring. The Google Cloud console displays a series of charts for the instance, as well as a tabular view of the instance's metrics. By default, the Google Cloud console shows metrics for the past hour, and it shows separate metrics for each cluster in the instance.

To view all of the charts, scroll through the pane where the charts are displayed.

To view metrics at the table level, click Tables.

To view metrics for individual app profiles, click Application Profiles.

To view combined metrics for the instance as a whole, find the Group by section above the charts, then click Instance.

To view metrics for a longer period of time, click the arrow next to 1 Hour. Choose a pre-set time range or enter a custom time range, then click Apply.

Charts for replication

The monitoring page provides a chart that shows replication latency over time. You can view the average latency for replicating writes at the 50th, 99th, and 100th percentiles.

To view the replication latency over time:

Open the list of Bigtable instances in the Google Cloud console.

Open the instance list
Click the instance whose metrics you want to view.
In the left pane, click Monitoring. The page opens with the Instance tab selected.
Click the Replication tab. The Google Cloud console displays replication latency over time. By default, the Google Cloud console shows replication latency for the past hour.

To toggle between latency charts grouped by table or by cluster, use the Group by menu.

To change which percentile to view, use the Percentile menu.

To view metrics for a longer period of time, click the arrow next to 1 Hour. Choose a pre-set time range or enter a custom time range, then click Apply.

Monitor with Cloud Monitoring

Bigtable exports usage metrics to Cloud Monitoring. You can use these metrics in a variety of ways:

Monitor programmatically using the Cloud Monitoring API.
Monitor visually in the Metrics Explorer.
Set up alerting policies.
Add Bigtable usage metrics to a custom dashboard.
Use a graphing library, such as Matplotlib for Python, to plot and analyze the usage metrics for Bigtable.

To view usage metrics in the Metrics Explorer:

Open the Monitoring page in the Google Cloud console.

Open the Monitoring page

If you are prompted to choose an account, choose the account that you use to access Google Cloud.
Click Resources, then click Metrics Explorer.
Under Find resource type and metric, type bigtable. A list of Bigtable resources and metrics appears.
Click a metric to view a chart for that metric.

For additional information about using Cloud Monitoring, see the Cloud Monitoring documentation.

For a complete list of Bigtable metrics, see Metrics.

Create a storage utilization alert

You can set up an alert to notify you when your Bigtable cluster exceeds a specified threshold. For more information about determining your target storage utilization, see Disk usage.

To create an alerting policy that triggers when the storage utilization for your Bigtable cluster is above a recommended threshold, such as 70%, use the following settings.

Steps to create an alerting policy.

To create an alerting policy, do the following:

In the Google Cloud console, go to the Alerting page:
Go to Alerting

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
If you haven't created your notification channels and if you want to be notified, then click Edit Notification Channels and add your notification channels. Return to the Alerting page after you add your channels.
From the Alerting page, select Create policy.
To select the resource, metric, and filters, expand the Select a metric menu and then use the values in the New condition table:
1. Optional: To limit the menu to relevant entries, enter the resource or metric name in the filter bar.
2. Select a Resource type. For example, select VM instance.
3. Select a Metric category. For example, select instance.
4. Select a Metric. For example, select CPU Utilization.
5. Select Apply.
Click Next and then configure the alerting policy trigger. To complete these fields, use the values in the Configure alert trigger table.
Click Next.
Optional: To add notifications to your alerting policy, click Notification channels. In the dialog, select one or more notification channels from the menu, and then click OK.

To be notified when incidents are openend and closed, check Notify on incident closure. By default, notifications are sent only when incidents are openend.
Optional: Update the Incident autoclose duration. This field determines when Monitoring closes incidents in the absence of metric data.
Optional: Click Documentation, and then add any information that you want included in a notification message.
Click Alert name and enter a name for the alerting policy.
Click Create Policy.

New condition Field	Value
Resource and Metric	In the Resources menu, select Cloud Bigtable Cluster. In the Metric categories menu, select Cluster. In the Metrics menu, select Storage utilization. (The metric.type is `bigtable.googleapis.com/cluster/storage_utilization`).
Filter	`cluster = YOUR_CLUSTER_ID`

Configure alert trigger Field	Value
Condition type	`Threshold`
Condition triggers if	`Any time series violates`
Threshold position	`Above threshold`
Threshold value	70
Retest window	`10 minutes`

What's next

Find out how to troubleshoot issues with Key Visualizer.
Read about client-side metrics.
Try the Cloud Monitoring quickstart.
Learn about creating alerts based on Bigtable metrics.