Implementation of Prometheus alarm through Alertmanager

Prometheus does not support the alarm function, mainly through the plug-in alertmanager to achieve alarm. Alertmanager is used to receive the alarm information sent by Prometheus, process the alarm and send it to the specified user or group.

The process of Prometheus triggering an alarm is as follows:

prometheus server - > trigger method - > beyond the specified time - > alertmanager - > grouping, suppression, silence - > media type - > mail, pin, wechat, etc.

1, Install and configure Alertmanager

Download and unzip

$ wget
tar -zxf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local/
$ cd /usr/local/
$ mv alertmanager-0.20.0.linux-amd64 alertmanager

Here is a pit:

Alermanager will save the data locally. The default storage path is data /. We will directly use the-- storage.path Otherwise, an error will be reported as follows:

Jun 01 17:04:56 localhost.localdomain alertmanager[9742]: level=error ts=2020-06-01T09:04:56.626Z caller=main.go:236 msg="Unable to create data directory" err="mkdir data/: permission denied"

Create storage directory

mkdir -p /usr/local/alertmanager/data

Combined use-- config.file Specify the alertmanager profile path.

Create startup file

cat > /usr/lib/systemd/system/alertmanager.service <<EOF
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data

Configure alarm information

Before configuration, back up the configuration file of alertmanager

cp /usr/local/alertmanager/alertmanager.yml /usr/local/alertmanager/alertmanager.yml_bak

Then modify alertmanager.yml

$ cat alertmanager.yml
  resolve_timeout: 5m
  smtp_smarthost: ''
  smtp_from: ''
  smtp_auth_username: ''
  smtp_auth_password: 'PNRUAELMPDOMTEMP' # This is the authorization password of the mailbox, not the login password
  smtp_require_tls: false

route:   # route is used to set the distribution policy of alarms
  group_by: ['alertname']  # Which label is used as the grouping basis
  group_wait: 30s   # Group alarm wait time. That is to say, wait for 10s after the alarm is generated. If there are alarms of the same group sent out together
  group_interval: 10s  # Interval between two groups of alarms
  repeat_interval: 20m  # Repeat the interval time of alarm to reduce the sending frequency of the same email
  receiver: 'default-receiver'  # Set default recipient
  routes:   # You can specify which groups receive which messages
  - receiver: 'default-receiver'  
    continue: true
    group_wait: 10s
  - receiver: 'ding-receiver'  
    group_wait: 10s
    match_re:  # According to label grouping, Ding receiver group matches the label dest=hzjf
      dest: hzjf

- name: 'default-receiver'
  - to: ''
- name: "ding-receiver"
  - url: 'http://xx.xx.xx.xx/dingtalk'
    send_resolved: true

Start Alertmanager

$ chown -R prometheus:prometheus /usr/local/alertmanager
$ systemctl daemon-reload
$ systemctl start alertmanager.service
$ systemctl enable alertmanager.service
$ systemctl status alertmanager.service
$ ss -tnl|grep 9093

web ui view: http://alertmanager_ip:9093

Then there is another pit: it is configured to use email alarm, but there will be an error when sending email Jun 05 13:35:21 localhost.localdomain postfix/sendmail[9446]: fatal: parameter inet_ Interfaces: no local interface found for:: 1, you need to change the configuration file of postfix:

vim /etc/postfix/
# hold
inet_interfaces = localhost
# Change to
inet_interfaces = all
# that will do

2, Configure Prometheus to communicate with Alertmanager

vim /usr/local/prometheus/prometheus.yml

  alertmanagers:  # Configure alertmanager
  - static_configs:
    - targets:
      -  #alertmanager server ip port

rule_files:      # Alarm rule file
  - 'rules/*.yml'

3, Configure alarm rules

As defined above, put the definition file of alarm rules in the rules directory, so first create this directory:

mkdir -p /usr/local/prometheus/rules

Then create alarm rules. Here we create three alarm rules:

cat > /usr/local/prometheus/rules/node.yml <<"EOF"
- name: hostStatsAlert
  - alert: InstanceDown
    expr: up == 0
    for: 30s
      severity: critical
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds."
  - alert: hostCpuUsageAlert
    expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 85
    for: 1m
      severity: warning
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85
    for: 1m
      severity: warning
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"
  • alert: alarm rule name
  • expr: Based on the PromQL expression alarm trigger condition, it is used to calculate whether there is a time series meeting the condition.
  • For: evaluation waiting time, optional parameter. Used to indicate that an alarm is sent only after the triggering condition has lasted for a period of time. The status of the newly generated alarm during the waiting period is pending.
  • Labels: a custom label that allows the user to specify a set of additional labels to be attached to the alarm
  • Annotations: used to specify a group of additional information, such as the text used to describe the alarm details. The contents of annotations will be sent to the alert manager as parameters together when the alarm is generated.

Check alarm rules

$ /usr/local/prometheus/promtool check rules /usr/local/prometheus/rules/node.yml 
Checking /usr/local/prometheus/rules/node.yml
  SUCCESS: 3 rules found

Restart prometheus to make the alarm rule effective.

$ chown -R prometheus:prometheus /usr/local/prometheus/rules
$ systemctl restart prometheus

4, Verification

First, you can see the configured three alarm rules in alert of prometheus interface.

1. Verify InstanceDown alarm rules

Stop node on node Exporter service, and then see the effect.

$ systemctl stop node_exporter

  • Green indicates normal.
  • A red status of PENDING indicates that alerts have not been sent to Alertmanager, because for: 30s is configured in rules.
  • After 30s, the status changes from PENDING to fining. At this time, prometheus sends the alarm to alertmanager. In alertmanager, you can see that there is an alert in the card.

Email received:

2. Verification

We can manually increase the CPU utilization of the system:

cat /dev/zero>/dev/null

After running the command, the cpu usage will increase rapidly.

Email received:

Reference article:

Tags: Linux github vim Permission denied

Posted on Sun, 28 Jun 2020 22:35:57 -0400 by flamtech