grouping
summary
Grouping classifies alerts of a similar nature into a single notification. This is particularly useful during large outages when many systems fail at the same time and may trigger hundreds to thousands of alarms at the same time.
Example: when network partitioning occurs, dozens or hundreds of service instances are running in the cluster. Half of your service instances can no longer access the database. The alert rules in Prometheus are configured to send alerts when each service instance cannot communicate with the database. As a result, hundreds of alerts are sent to Alertmanager.
As a user, you only want to get one page and still be able to see exactly which service instances are affected. Therefore, alert manager can be configured to group alerts by cluster and alert name so that it sends a single compact notification.
Alert packets, the time of packet notifications, and the recipients of these notifications are configured by the routing tree in the configuration file.
In route Section.
configuration parameter
route: group_by: ['alertname', 'app'] # Tags used in grouping. By default, all alarms are organized together. Once grouping tags are specified, alert manager will group them according to these tags; group_wait: 30s # The initial waiting time for sending a group of alarm notifications; It is allowed to wait for a suppression alarm to arrive or collect more initial alarms belonging to the same group, usually 0 to several minutes; group_interval: 40s # How long to wait before sending a message about a new alarm; The new alarm will be added to the alarm group that has sent the initial notification; Generally 5 minutes or more; repeat_interval: 1m # The waiting time for sending the alarm information again after successfully sending the alarm is generally at least 3 hours;
Configuration example
route: group_by: ['alertname', 'app'] group_wait: 30s group_interval: 40s repeat_interval: 1m
alertmanage front end view
inhibition
Suppression is a concept that suppresses notifications of certain alarms if some other alarms have been triggered.
Example: alerts that the entire cluster cannot be accessed. If this particular alert is being triggered, Alertmanager can be configured to silence all other alerts related to the cluster. This prevents hundreds or thousands of notifications that trigger alerts unrelated to the actual problem.
Configuration through Alertmanager's configuration file is prohibited.
When there is an alarm (source) matching another group of matchers, the matching rule is prohibited to alarm (silence). For tag names in the equal list, both destination and source alerts must have the same tag value.
Semantically, missing tags and tags with null values are the same thing. Therefore, if all tag names listed in are missing from both equal source and target alerts, the suppression rule is applied.
configuration parameter
# DEPRECATED: Use target_matchers below. # Matchers that have to be fulfilled in the alerts to be muted. # Target alarm string match target_match: [ <labelname>: <labelvalue>, ... ] # DEPRECATED: Use target_matchers below. # Target alarm regular matching target_match_re: [ <labelname>: <regex>, ... ] # A list of matchers that have to be fulfilled by the target # alerts to be muted. # Matching list of target alarms target_matchers: [ - <matcher> ... ] # DEPRECATED: Use source_matchers below. # Matchers for which one or more alerts have to exist for the # inhibition to take effect. # Source alarm string match source_match: [ <labelname>: <labelvalue>, ... ] # DEPRECATED: Use source_matchers below. # Source alarm regular matcher source_match_re: [ <labelname>: <regex>, ... ] # A list of matchers for which one or more alerts have # to exist for the inhibition to take effect. # Source alarm list matcher source_matchers: [ - <matcher> ... ] # Labels that must have an equal value in the source and target # alert for the inhibition to take effect. # Match label [ equal: '[' <labelname>, ... ']' ]
Configuration example
Alertmanager configuration example
inhibit_rules: - source_match: severity: 'critical' target_match_re: severity: '.*' equal: ['instance']
prometheus alert rule configuration example
- alert: Insufficient host memory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 99 for: 10s labels: severity: "critical" annotations: summary: Insufficient host memory (Host address {{ $labels.instance }}) description: "The host memory is full. The current memory is less than 70 percent\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: warning Test memory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 98 for: 10s labels: severity: "warning" annotations: summary: warning Insufficient host memory (Host address {{ $labels.instance }}) description: "The host memory is full. The current memory is less than 70 percent\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Prometheus view triggered alarm information
View alarm mail
You can see that there is no alarm message with the title (warning test memory) in the email, and the suppression takes effect.