Prometheus does not support the alarm function, mainly through the plug-in alertmanager to achieve alarm. Alertmanager is used to receive the alarm information sent by Prometheus, process the alarm and send it to the specified user or group.
The process of Prometheus triggering an alarm is as follows:
prometheus server - > trigger method - > beyond the specified time - > alertmanager - > grouping, suppression, silence - > media type - > mail, pin, wechat, etc.
1, Install and configure Alertmanager
Download and unzip
$ wget https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz tar -zxf alertmanager-0.20.0.linux-amd64.tar.gz -C /usr/local/ $ cd /usr/local/ $ mv alertmanager-0.20.0.linux-amd64 alertmanager
Here is a pit:
Alermanager will save the data locally. The default storage path is data /. We will directly use the-- storage.path Otherwise, an error will be reported as follows:
Jun 01 17:04:56 localhost.localdomain alertmanager[9742]: level=error ts=2020-06-01T09:04:56.626Z caller=main.go:236 msg="Unable to create data directory" err="mkdir data/: permission denied"
Create storage directory
mkdir -p /usr/local/alertmanager/data
Combined use-- config.file Specify the alertmanager profile path.
Create startup file
cat > /usr/lib/systemd/system/alertmanager.service <<EOF [Unit] Description=alertmanager Documentation=https://github.com/prometheus/alertmanager After=network.target [Service] Type=simple User=prometheus ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data Restart=on-failure [Install] WantedBy=multi-user.target EOF
Configure alarm information
Before configuration, back up the configuration file of alertmanager
cp /usr/local/alertmanager/alertmanager.yml /usr/local/alertmanager/alertmanager.yml_bak
Then modify alertmanager.yml
$ cat alertmanager.yml global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:25' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'PNRUAELMPDOMTEMP' # This is the authorization password of the mailbox, not the login password smtp_require_tls: false route: # route is used to set the distribution policy of alarms group_by: ['alertname'] # Which label is used as the grouping basis group_wait: 30s # Group alarm wait time. That is to say, wait for 10s after the alarm is generated. If there are alarms of the same group sent out together group_interval: 10s # Interval between two groups of alarms repeat_interval: 20m # Repeat the interval time of alarm to reduce the sending frequency of the same email receiver: 'default-receiver' # Set default recipient routes: # You can specify which groups receive which messages - receiver: 'default-receiver' continue: true group_wait: 10s - receiver: 'ding-receiver' group_wait: 10s match_re: # According to label grouping, Ding receiver group matches the label dest=hzjf dest: hzjf receivers: - name: 'default-receiver' email_configs: - to: '[email protected]' - name: "ding-receiver" webhook_configs: - url: 'http://xx.xx.xx.xx/dingtalk' send_resolved: true
Start Alertmanager
$ chown -R prometheus:prometheus /usr/local/alertmanager $ systemctl daemon-reload $ systemctl start alertmanager.service $ systemctl enable alertmanager.service $ systemctl status alertmanager.service $ ss -tnl|grep 9093
web ui view: http://alertmanager_ip:9093
Then there is another pit: it is configured to use email alarm, but there will be an error when sending email Jun 05 13:35:21 localhost.localdomain postfix/sendmail[9446]: fatal: parameter inet_ Interfaces: no local interface found for:: 1, you need to change the configuration file of postfix:
vim /etc/postfix/main.cf # hold inet_interfaces = localhost # Change to inet_interfaces = all # that will do
2, Configure Prometheus to communicate with Alertmanager
vim /usr/local/prometheus/prometheus.yml
alerting: alertmanagers: # Configure alertmanager - static_configs: - targets: - 127.0.0.1:9093 #alertmanager server ip port rule_files: # Alarm rule file - 'rules/*.yml'
3, Configure alarm rules
As defined above, put the definition file of alarm rules in the rules directory, so first create this directory:
mkdir -p /usr/local/prometheus/rules
Then create alarm rules. Here we create three alarm rules:
cat > /usr/local/prometheus/rules/node.yml <<"EOF" groups: - name: hostStatsAlert rules: - alert: InstanceDown expr: up == 0 for: 30s labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 seconds." - alert: hostCpuUsageAlert expr: 100 - (avg(irate(node_cpu_seconds_total[5m])) by (instance) * 100) > 85 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} CPU usgae high" description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})" - alert: hostMemUsageAlert expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 1m labels: severity: warning annotations: summary: "Instance {{ $labels.instance }} MEM usgae high" description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})" EOF
- alert: alarm rule name
- expr: Based on the PromQL expression alarm trigger condition, it is used to calculate whether there is a time series meeting the condition.
- For: evaluation waiting time, optional parameter. Used to indicate that an alarm is sent only after the triggering condition has lasted for a period of time. The status of the newly generated alarm during the waiting period is pending.
- Labels: a custom label that allows the user to specify a set of additional labels to be attached to the alarm
- Annotations: used to specify a group of additional information, such as the text used to describe the alarm details. The contents of annotations will be sent to the alert manager as parameters together when the alarm is generated.
Check alarm rules
$ /usr/local/prometheus/promtool check rules /usr/local/prometheus/rules/node.yml Checking /usr/local/prometheus/rules/node.yml SUCCESS: 3 rules found
Restart prometheus to make the alarm rule effective.
$ chown -R prometheus:prometheus /usr/local/prometheus/rules $ systemctl restart prometheus
4, Verification
First, you can see the configured three alarm rules in alert of prometheus interface.
Stop node on node 192.168.0.182_ Exporter service, and then see the effect.
$ systemctl stop node_exporter
- Green indicates normal.
- A red status of PENDING indicates that alerts have not been sent to Alertmanager, because for: 30s is configured in rules.
- After 30s, the status changes from PENDING to fining. At this time, prometheus sends the alarm to alertmanager. In alertmanager, you can see that there is an alert in the card.
Email received:
We can manually increase the CPU utilization of the system:
cat /dev/zero>/dev/null
After running the command, the cpu usage will increase rapidly.
Email received:
Reference article:
https://www.cnblogs.com/xiaobaozi-95/p/10740511.html
https://www.bookstack.cn/read/prometheus-book/alert-prometheus-alert-rule.md