Grafana Alerting Architecture

Kloudfuse uses Grafana Alert Engine for alert execution and the Alert Manager which is from Prometheus for handling notifications. This documentation is intended to help operations teams understand the alerting architecture and how the different components work together.

The following is the basic workflow for Alerts at a high level:

Each Evaluation Group will refer to its configuration to determine how and when the Alert Rules are executed.
The Alert Rules that are triggered, forward the target instances to the Alert Manager.
The Alert Manager organizes the triggered alerts, deciding how to group them together. It then uses Notification Policies to determine to which Contact Points receive specific notifications.
A Contact Point will make an external call to forward the details to a target endpoint (SMTP, PagerDuty, Webhook, etc…)

Alert Rules

Source: Grafana — Alert rules

An Alert Rule executes a query, either PromQL, or FuseQL, etc… to check if a result is above a certain threshold. When the threshold is breached, it will transition an alert instance from Normal to Pending or Firing. Each rule specifies:

Query and condition — A data source, the query sent to it and the threshold or expression that evaluates whether it is an alerting state or not.
Labels — key-value pairs attached to the alert instance. Used by the Alert Manager to route alerts to the correct contact point.
Annotations — descriptive metadata such as a summary, description, or runbook URL. Annotations will also appear in the notification message but do not affect routing.

Alert States

Source: Grafana — Alert rule evaluation

Each alert rule instance moves through the following states:

State	Description
Normal	The condition is not met. No notification is sent.
Pending	The condition has been met, but the Pending period has not yet elapsed. The alert will not fire until it has been in this state for the configured duration, reducing noise from transient spikes.
Alerting	The state of an alert that has breached the threshold for longer than the pending period.
Firing	The condition has been met for longer than the pending period. The alert is sent to the Alert Manager.
Recovering	The state of a firing alert when the threshold is no longer breached, but for less than the keep firing for period.
Error	The query failed to execute. This can be from a query timeout or invalid query syntax. If it is an Error state you can configure the Rule to alert if you encounter this state, but would recommend avoiding that.
No Data	The query returned no data. You can configure how the alert will behave ( Normal, Alerting, or No Data) when it encounters this situation.

State

Description

Normal

The condition is not met. No notification is sent.

Pending

The condition has been met, but the Pending period has not yet elapsed. The alert will not fire until it has been in this state for the configured duration, reducing noise from transient spikes.

Alerting

The state of an alert that has breached the threshold for longer than the pending period.

Firing

The condition has been met for longer than the pending period. The alert is sent to the Alert Manager.

Recovering

The state of a firing alert when the threshold is no longer breached, but for less than the keep firing for period.

Error

The query failed to execute. This can be from a query timeout or invalid query syntax. If it is an Error state you can configure the Rule to alert if you encounter this state, but would recommend avoiding that.

No Data

The query returned no data. You can configure how the alert will behave ( Normal, Alerting, or No Data) when it encounters this situation.

Evaluation Groups

Source: Grafana — Evaluation Groups

An Evaluation Group is a single process that is used to schedule one or more alert rules and control how they are run and how frequently they are run. The frequency is referred to as the evaluation interval.

Grafana-managed rules, within the same group, are executed concurrently. They are evaluated at different times over the same evaluation interval but display the same evaluation timestamp.

Data source-managed rules, within the same group, are evaluated sequentially, one after the other. This is useful to ensure that recording rules are evaluated before alert rules.

NOTE: The Kfuse UI does not support implementing the creation of Data source-managed Evaluation Rules.

Evaluation Groups are executed independently of each other, which allows Alert Rules to be executed concurrently. If you have 100 Groups, you can have 100 rules be executed at the same time.

Another advantage of having separate groups is when there is a problem within one group, for example: a query taking too long, the impact is limited to that one group. The rest of the groups can continue without interruption. The limitation on this is the more groups you have, the more resources it will require, specifically memory, CPU and network bandwidth.

Key properties of an Evaluation Group:

Folder (namespace) — a logical grouping for access control and organization. Rules in different folders can share a group name without conflict.
Evaluation interval — how frequently the entire group is evaluated (for example, 1m or 5m). All alert rules in the group inherit this interval.
Rules — one or more alert rules that share the evaluation interval.

Group rules that share the same evaluation cadence.
Do not place rules with very different frequency requirements in the same group. The evaluation interval applies to the whole group.
Long-running alert rules in the same group will delay execution of other rules. If the execution time takes too long, rules may not get executed.
If there are problematic rules, for example getting too many results, they may cause the Group to crash preventing execution of the other alert rules in the group.

Alert Manager

Source: Grafana — Alertmanager

The Alert Manager is based on the architecture of the Prometheus alerting system. It receives firing and resolved alert results from alert rules and sends notifications for those alerts.

It will use Notification Policies to decide how to route alerts based on labels passed from the alerts. A single alert can match multiple notification policies.

Notification Policies

Source: Grafana — Notification policies

A Notification Policy defines how an alert is routed to a Contact Point. Notification Policies are structured as a tree with a default root Default Policy with child policies nested underneath it.

Each policy contains:

Label matchers — the conditions that must be met for the policy to apply (for example, severity=critical).
Contact point — the destination that receives a notification
Grouping and timing overrides (optional) — Settings to customize how to handle label grouping and timing defaults
Child policies — policies with different criteria to fine-tune notifications.

Grouping Alerts

Source: Grafana — Group alert notifications

The Alert Manager allows you to bundle alerts into a smaller number of notifications. This is useful when you want to avoid spamming a contact with too many notifications that trigger at once. This can be configured in the Notification Policy using the Group by labels (for example, grouping all alerts with the same cluster and severity labels into a single message).

Default Policy

The Default Policy handles alerts that do not match a specific notification policy. By using a Default Policy with a catch-all contact point, you ensure that no alert is silently overlooked.

Policy Matching

Policies are evaluated from the most specific (deepest) child outward. The first policy whose label matchers all match the incoming alert handles the notification. If no child policy matches, the alert falls through to the default policy.

Suppress Notifications

Source: Grafana — Configure silences

You can Suppress Notifications for a specific time frame. This will only suppress the notifications being sent out, it will not disable the Alert Evaluations. This is useful when you know that certain work needs to get done and you don’t want to send notifications while a system is unavailable.

Suppression Schedule

Source: Grafana — Configure mute timings

A Suppression Schedule is similar to Suppress Notifications. The difference between the two is that using Suppress Notifications is a single event. It is defined and then done. If you have a recurring event, for example, maintenance every Sunday at 8am, you will want to use a Suppression Schedule.

Contact Points

Source: Grafana — Contact points

A Contact Point is an endpoint configured to receive a notification about an alert. Kloudfuse supports these endpoint types in the UI:

Type	Description
Email	Sends notifications via SMTP to one or more email addresses.
Slack	Posts messages to a Slack channel using an incoming webhook or API token.
PagerDuty	Creates incidents in PagerDuty using an integration API key.
Microsoft Teams	Posts messages to a Teams channel via incoming webhook.
Webhook	Sends an HTTP POST payload to any custom endpoint.
Google Chat	Posts messages to a Google Chat space via incoming webhook.
OpsGenie	Creates alerts in OpsGenie using an API key.

Type

Description

Email

Sends notifications via SMTP to one or more email addresses.

Slack

Posts messages to a Slack channel using an incoming webhook or API token.

PagerDuty

Creates incidents in PagerDuty using an integration API key.

Microsoft Teams

Posts messages to a Teams channel via incoming webhook.

Webhook

Sends an HTTP POST payload to any custom endpoint.

Google Chat

Posts messages to a Google Chat space via incoming webhook.

OpsGenie

Creates alerts in OpsGenie using an API key.

For detailed configuration steps for each contact point type, see Contact Points.

Grafana Documentation References

The following Grafana documentation pages are the primary sources for the concepts described on this page:

Grafana Alerting Architecture

Alert Rules

Alert States

Evaluation Groups

Alert Manager

Notification Policies

Grouping Alerts

Default Policy

Policy Matching

Suppress Notifications

Suppression Schedule

Contact Points

Grafana Documentation References

Related Topics