Alerting overview

Alerts help you stay informed about the health and performance of your air-gapped deployments. They provide timely notifications when specific conditions are met, letting you do the following:

Proactively address issues: Detect and respond to problems before they impact users or business operations.
Reduce downtime: Minimize service disruptions by taking corrective action quickly.
Maintain service levels: Ensure your applications meet performance and availability targets.
Gain operational insights: Identify trends and patterns in your environment to optimize resource utilization and performance.

This page provides an overview of creating and managing alerts in Google Distributed Cloud (GDC) air-gapped environments. It explains how to use monitoring data to proactively identify and respond to critical events within your applications and infrastructure.

Alerting policy types

Metric-based alerting policies track monitoring data and notify specific people when a resource meets a pre-established condition. For example, an alerting policy that monitors the CPU utilization of a virtual machine might send a notification when an event activates the policy. Alternatively, a policy that monitors an uptime check might notify on-call and development teams.

On the other hand, to monitor recurring events in your logs over time, use log-based metrics to create alerting policies. Log-based metrics generate numerical data from logging data. Log-based metrics are suitable when you want to do any of the following:

Count the message occurrences in your logs, like a warning or error. Receive a notification when the number of events crosses a threshold.
Observe trends in your data, like latency values in your logs. Receive a notification if the values change unacceptably.
Create charts to display the numeric data extracted from your logs.

In GDC, alerts can generate pages and tickets for critical errors. Pages require immediate attention from an operator, while tickets are less urgent.

Key components

The GDC alerting service uses the following components:

Prometheus: An open-source monitoring system widely used for collecting and storing metrics. It provides a powerful query language (PromQL) for defining alert rules.
Monitoring platform: A managed monitoring service that collects metrics from various sources, including Prometheus. It offers advanced features like Grafana dashboards, custom metrics, and alerting.
Alertmanager: A component responsible for receiving, processing, and routing alerts. It supports grouping, silencing, and inhibiting alerts to reduce noise and improve efficiency.

Alerting workflow

GDC provides an alerting framework that integrates with various monitoring tools and services. The typical workflow involves the following stages:

Data collection: Use tools like Prometheus and Fluent Bit to collect metrics and logs from your applications, infrastructure, and Kubernetes.
Monitoring: Store and visualize the collected data in Grafana dashboards.
Alerting rules: Define alert rules based on specific conditions, such as CPU usage exceeding a threshold or application errors exceeding a certain rate.
Alertmanager: Alertmanager receives alerts triggered by the defined rules and handles notification routing and silencing.
Notifications: Receive alerts through various channels, such as email, messages, or webhooks.

Best practices

When setting up alerts, consider the following best practices:

Define clear and actionable alerts: Ensure your alerts provide specific information about the issue and suggest appropriate actions.
Set appropriate severity levels: Categorize alerts based on their impact and urgency to prioritize response efforts.
Avoid alert fatigue: Fine-tune your alert rules to minimize false positives and unnecessary notifications.
Test your alerts regularly: Verify that your alerts are triggered correctly and notifications are delivered as expected.
Document your alerting strategy: Document your alert rules, notification channels, and escalation procedures.