Grafana Alerting Architecture
Kloudfuse uses Grafana Alert Engine for alert execution and the Alert Manager which is from Prometheus for handling notifications. This documentation is intended to help operations teams understand the alerting architecture and how the different components work together.
The following is the basic workflow for Alerts at a high level:
-
Each Evaluation Group will refer to its configuration to determine how and when the Alert Rules are executed.
-
The Alert Rules that are triggered, forward the target instances to the Alert Manager.
-
The Alert Manager organizes the triggered alerts, deciding how to group them together. It then uses Notification Policies to determine to which Contact Points receive specific notifications.
-
A Contact Point will make an external call to forward the details to a target endpoint (SMTP, PagerDuty, Webhook, etc…)
Alert Rules
Source: Grafana — Alert rules
An Alert Rule executes a query, either PromQL, or FuseQL, etc… to check if a result is above a certain threshold. When the threshold is breached, it will transition an alert instance from Normal to Pending or Firing. Each rule specifies:
-
Query and condition — A data source, the query sent to it and the threshold or expression that evaluates whether it is an alerting state or not.
-
Labels — key-value pairs attached to the alert instance. Used by the Alert Manager to route alerts to the correct contact point.
-
Annotations — descriptive metadata such as a summary, description, or runbook URL. Annotations will also appear in the notification message but do not affect routing.
Alert States
Source: Grafana — Alert rule evaluation
Each alert rule instance moves through the following states:
| State | Description |
|---|---|
Normal |
The condition is not met. No notification is sent. |
Pending |
The condition has been met, but the Pending period has not yet elapsed. The alert will not fire until it has been in this state for the configured duration, reducing noise from transient spikes. |
Alerting |
The state of an alert that has breached the threshold for longer than the pending period. |
Firing |
The condition has been met for longer than the pending period. The alert is sent to the Alert Manager. |
Recovering |
The state of a firing alert when the threshold is no longer breached, but for less than the keep firing for period. |
Error |
The query failed to execute. This can be from a query timeout or invalid query syntax. If it is an Error state you can configure the Rule to alert if you encounter this state, but would recommend avoiding that. |
No Data |
The query returned no data. You can configure how the alert will behave ( Normal, Alerting, or No Data) when it encounters this situation. |
Evaluation Groups
Source: Grafana — Evaluation Groups
An Evaluation Group is a single process that is used to schedule one or more alert rules and control how they are run and how frequently they are run. The frequency is referred to as the evaluation interval.
Grafana-managed rules, within the same group, are executed concurrently. They are evaluated at different times over the same evaluation interval but display the same evaluation timestamp.
Data source-managed rules, within the same group, are evaluated sequentially, one after the other. This is useful to ensure that recording rules are evaluated before alert rules.
NOTE: The Kfuse UI does not support implementing the creation of Data source-managed Evaluation Rules.
Evaluation Groups are executed independently of each other, which allows Alert Rules to be executed concurrently. If you have 100 Groups, you can have 100 rules be executed at the same time.
Another advantage of having separate groups is when there is a problem within one group, for example: a query taking too long, the impact is limited to that one group. The rest of the groups can continue without interruption. The limitation on this is the more groups you have, the more resources it will require, specifically memory, CPU and network bandwidth.
Key properties of an Evaluation Group:
-
Folder (namespace) — a logical grouping for access control and organization. Rules in different folders can share a group name without conflict.
-
Evaluation interval — how frequently the entire group is evaluated (for example,
1mor5m). All alert rules in the group inherit this interval. -
Rules — one or more alert rules that share the evaluation interval.
|
Alert Manager
Source: Grafana — Alertmanager
The Alert Manager is based on the architecture of the Prometheus alerting system. It receives firing and resolved alert results from alert rules and sends notifications for those alerts.
It will use Notification Policies to decide how to route alerts based on labels passed from the alerts. A single alert can match multiple notification policies.
Notification Policies
Source: Grafana — Notification policies
A Notification Policy defines how an alert is routed to a Contact Point. Notification Policies are structured as a tree with a default root Default Policy with child policies nested underneath it.
Each policy contains:
-
Label matchers — the conditions that must be met for the policy to apply (for example,
severity=critical). -
Contact point — the destination that receives a notification
-
Grouping and timing overrides (optional) — Settings to customize how to handle label grouping and timing defaults
-
Child policies — policies with different criteria to fine-tune notifications.
Grouping Alerts
The Alert Manager allows you to bundle alerts into a smaller number of notifications. This is useful when you want to avoid spamming a contact with too many notifications that trigger at once. This can be configured in the Notification Policy using the Group by labels (for example, grouping all alerts with the same cluster and severity labels into a single message).
Suppress Notifications
Source: Grafana — Configure silences
You can Suppress Notifications for a specific time frame. This will only suppress the notifications being sent out, it will not disable the Alert Evaluations. This is useful when you know that certain work needs to get done and you don’t want to send notifications while a system is unavailable.
Suppression Schedule
Source: Grafana — Configure mute timings
A Suppression Schedule is similar to Suppress Notifications. The difference between the two is that using Suppress Notifications is a single event. It is defined and then done. If you have a recurring event, for example, maintenance every Sunday at 8am, you will want to use a Suppression Schedule.
Contact Points
Source: Grafana — Contact points
A Contact Point is an endpoint configured to receive a notification about an alert. Kloudfuse supports these endpoint types in the UI:
| Type | Description |
|---|---|
Sends notifications via SMTP to one or more email addresses. |
|
Slack |
Posts messages to a Slack channel using an incoming webhook or API token. |
PagerDuty |
Creates incidents in PagerDuty using an integration API key. |
Microsoft Teams |
Posts messages to a Teams channel via incoming webhook. |
Webhook |
Sends an HTTP POST payload to any custom endpoint. |
Google Chat |
Posts messages to a Google Chat space via incoming webhook. |
OpsGenie |
Creates alerts in OpsGenie using an API key. |
For detailed configuration steps for each contact point type, see Contact Points.