alertmanage, grouping, suppression, silence

grouping

summary

    Grouping classifies alerts of a similar nature into a single notification. This is particularly useful during large outages when many systems fail at the same time and may trigger hundreds to thousands of alarms at the same time.

    Example: when network partitioning occurs, dozens or hundreds of service instances are running in the cluster. Half of your service instances can no longer access the database. The alert rules in Prometheus are configured to send alerts when each service instance cannot communicate with the database. As a result, hundreds of alerts are sent to Alertmanager.

As a user, you only want to get one page and still be able to see exactly which service instances are affected. Therefore, alert manager can be configured to group alerts by cluster and alert name so that it sends a single compact notification.

    Alert packets, the time of packet notifications, and the recipients of these notifications are configured by the routing tree in the configuration file.

In    route   Section.

configuration parameter

route:                                  
  group_by: ['alertname', 'app']                   # Tags used in grouping. By default, all alarms are organized together. Once grouping tags are specified, alert manager will group them according to these tags;
  group_wait: 30s                                        # The initial waiting time for sending a group of alarm notifications; It is allowed to wait for a suppression alarm to arrive or collect more initial alarms belonging to the same group, usually 0 to several minutes;
  group_interval: 40s                                    # How long to wait before sending a message about a new alarm; The new alarm will be added to the alarm group that has sent the initial notification; Generally 5 minutes or more;
  repeat_interval: 1m                                    # The waiting time for sending the alarm information again after successfully sending the alarm is generally at least 3 hours;

Configuration example

route:                                  
  group_by: ['alertname', 'app']               
  group_wait: 30s
  group_interval: 40s
  repeat_interval: 1m

alertmanage front end view

 

  inhibition

    Suppression is a concept that suppresses notifications of certain alarms if some other alarms have been triggered.

    Example: alerts that the entire cluster cannot be accessed. If this particular alert is being triggered, Alertmanager can be configured to silence all other alerts related to the cluster. This prevents hundreds or thousands of notifications that trigger alerts unrelated to the actual problem.

    Configuration through Alertmanager's configuration file is prohibited.

    When there is an alarm (source) matching another group of matchers, the matching rule is prohibited to alarm (silence). For tag names in the equal list, both destination and source alerts must have the same tag value.

    Semantically, missing tags and tags with null values are the same thing. Therefore, if all tag names listed in are missing from both equal source and target alerts, the suppression rule is applied.

configuration parameter

# DEPRECATED: Use target_matchers below.
# Matchers that have to be fulfilled in the alerts to be muted.
# Target alarm string match
target_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use target_matchers below.
# Target alarm regular matching
target_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers that have to be fulfilled by the target 
# alerts to be muted.
# Matching list of target alarms
target_matchers:
  [ - <matcher> ... ]

# DEPRECATED: Use source_matchers below.
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
# Source alarm string match
source_match:
  [ <labelname>: <labelvalue>, ... ]
# DEPRECATED: Use source_matchers below.
# Source alarm regular matcher
source_match_re:
  [ <labelname>: <regex>, ... ]

# A list of matchers for which one or more alerts have 
# to exist for the inhibition to take effect.
# Source alarm list matcher
source_matchers:
  [ - <matcher> ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
# Match label
[ equal: '[' <labelname>, ... ']' ]

Configuration example

Alertmanager configuration example

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match_re:
      severity: '.*'
    equal: ['instance']

prometheus alert rule configuration example

  - alert: Insufficient host memory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 99
    for: 10s
    labels:
      severity: "critical"
    annotations:
      summary: Insufficient host memory (Host address {{ $labels.instance }})
      description: "The host memory is full. The current memory is less than 70 percent\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: warning Test memory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 98
    for: 10s
    labels:
      severity: "warning"
    annotations:
      summary: warning Insufficient host memory (Host address {{ $labels.instance }})
      description: "The host memory is full. The current memory is less than 70 percent\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Prometheus view triggered alarm information

View alarm mail  

You can see that there is no alarm message with the title (warning test memory) in the email, and the suppression takes effect.

 

Posted on Mon, 08 Nov 2021 18:45:08 -0500 by n8w