Best practice: Custom monitoring network layer metrics

How to monitor the TCP/UDP connection status indicators deployed in the cloud server network layer?

It is recommended that you use cloud monitoring - Custom monitoring!

At present, it is free to use in the internal test stage, without audit, and ready to use when the service is opened. Welcome to click Application page Participate in the internal test experience.

This paper introduces how to use Shell command + SDK to report key indicator data of network layer to user-defined monitoring, and view indicators and configure alarms on user-defined monitoring.

Practical background

Regularly monitor the key indicators of the network layer on the ECS, and send SMS alarm when these monitoring indicators trigger the alarm conditions you set.

Prerequisite

  • Bought Tencent cloud Cloud server CVM.
  • Install Python 2.7 and above environment and pip tools in ECs.

Data reporting

Step 1: prepare the escalation environment

1. Execute the following command to install the Python SDK.

pip install tencentcloud-sdk-python

2. Create the configuration file ~ /. ServerMonitor.json on the ECS. The contents of the configuration file are as follows:

{
"SecretId": "xxxxx",
"SecretKey": "xxxx",
"Region": "ap-guangzhou"
}

Explain:
Region: region, available for query Geographical list.

3. Enter the following Shell command to restrict the current administrator to read and write the configuration file.

chmod 0600 ~/.ServerMonitor.json

Step 2: collect and report data

1. Create a new servermonitor.py file with the following contents, which is used to collect and report data. Please check the detailed description of network layer indicators Index explanation.

#!/usr/bin/env python
#
# A simple server monitor demo use Tencent cloud PutMonitorData api
import json
import os
import re
import socket
import sys
import time

from tencentcloud.common import credential
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.monitor.v20180724 import monitor_client, models

GLOBAL_CONF = None


def load_conf():
conf_path = os.path.expanduser("~/.ServerMonitor.json")
if not os.path.exists(conf_path):
print("config file %s not found!" % conf_path)
sys.exit(1)
config_error_msg = """load config error, sample format:
{
"SecretId": "xxxxxxx",
"SecretKey": "xxxxxxx",
"Region": "ap-guangzhou"
}
"""
try:
conf = json.loads(open(conf_path).read())
if not isinstance(conf, dict):
raise ValueError("config file format error")
except:
print(config_error_msg)
sys.exit(1)
if not conf.get("SecretId") or not conf.get("SecretKey") or not conf.get("Region"):
print(config_error_msg)
sys.exit(1)
return conf


def get_lan_ip():
"""
get lan ip use fake udp connection
this does not really 'connect' to any server
"""
# can be any routable address,
fake_dest = ("10.10.10.10", 53)
lan_ip = ""
try:
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.connect(fake_dest)
lan_ip = s.getsockname()[0]
s.close()
except Exception, e:
pass
# print >>sys.stderr, e
return lan_ip


class MonitorBase(object):

def __init__(self, sleep_time):
self.sleep_time = sleep_time
self.result1 = None;

def get_metrics(self):
"""
collect metrics from system
return metrics as dict: { "key1":v1, "key2": v2 }
"""
return {}

def process(self):
"""
call get_metrics twice between sleep_time and calc final result to report
return metrics as dict: { "key1":v1, "key2": v2 }
"""
result = self.get_metrics()
if self.sleep_time == 0:
return result

self.result1 = result
time.sleep(self.sleep_time)
result2 = self.get_metrics()
metrics = {}
for key in result2.keys():
metrics[key] = long(result2[key]) - long(result.get(key, 0))
# workaround value wrap
if metrics[key] < 0:
metrics[key] += 4294967296
return metrics

def report(self):
"""
report metrics to cloud api
:return:
"""
metrics = self.process()
try:
cred = credential.Credential(GLOBAL_CONF["SecretId"], GLOBAL_CONF["SecretKey"])
http_profile = HttpProfile()
http_profile.endpoint = "monitor.tencentcloudapi.com"

client_profile = ClientProfile()
client_profile.httpProfile = http_profile
client = monitor_client.MonitorClient(cred, GLOBAL_CONF["Region"], client_profile)

req = models.PutMonitorDataRequest()
from pprint import pprint
# limit metrics to report
metrics_allowed = ["TcpActiveOpens", "TcpPassiveOpens", "TcpAttemptFails", "TcpEstabResets",
"TcpRetransSegs", "TcpExtListenOverflows", "UdpInDatagrams", "UdpOutDatagrams",
"UdpInErrors", "UdpNoPorts", "UdpSndbufErrors"]
report_data = {"Metrics": [], "AnnounceInstance": get_lan_ip()}
for k, v in metrics.items():
if k in metrics_allowed:
report_data["Metrics"].append({"MetricName": k, "Value": v})
req.from_json_string(json.dumps(report_data))
pprint(report_data)
resp = client.PutMonitorData(req)
print(resp.to_json_string())
except TencentCloudSDKException as err:
print(err)


class NetMonitor(MonitorBase):
"""
parse /proc/net/snmp & /proc/net/netstat
"""

def get_metrics(self):
snmp_dict = {}
snmp_lines = open("/proc/net/snmp").readlines()
netstat_lines = open("/proc/net/netstat").readlines()
snmp_lines.extend(netstat_lines)

sep = re.compile(r'[:\s]+')
n = 0
for line in snmp_lines:
n += 1
fields = sep.split(line.strip())
proto = fields.pop(0)
if n % 2 == 1:
# header line
keys = fields
else:
# value line
try:
values = [long(f) for f in fields]
except Exception, e:
print e
kv = dict(zip(keys, values))
proto_dict = snmp_dict.setdefault(proto, {})
proto_dict.update(kv)
return snmp_dict


class NetSnmpIpTcpUdp(NetMonitor):
"""
Get ip/tcp/udp information from /proc/net/snmp
"""

def get_metrics(self):
snmp_dict = super(NetSnmpIpTcpUdp, self).get_metrics()
metrics = {}
for proto in ("Tcp", "Ip", "Udp", "Icmp", "TcpExt"):
if proto not in snmp_dict:
continue
for k, v in snmp_dict[proto].items():
k = proto + k
metrics[k] = v
return metrics

def process(self):
report_dict = super(NetSnmpIpTcpUdp, self).process()
# CurrEstab is a tmp value, not inc value
report_dict['TcpCurrEstab'] = self.result1['TcpCurrEstab']
return report_dict


if __name__ == "__main__":
GLOBAL_CONF = load_conf()
process_dict = {
NetSnmpIpTcpUdp: 60,
}
children = []
for key in process_dict.keys():
try:
pid = os.fork()
except OSError:
sys.exit("Unable to create child process!")
if pid == 0:
monitor = key(process_dict[key])
monitor.report()
sys.exit(0)
else:
children.append(pid)

for i in children:
os.wait()

Explain:
The SecretId, SecretKey, Region and other information in the code need to be filled in according to your actual situation.
1.Region: region, available for query Geographical list.
2.SecretId and SecretKey, please go to API key management Obtain.

2. After downloading, put the servermonitor.py file in the / usr/local/bin directory.
3. Add servermonitor.py to the crontab plan task to execute, and then automatically report the indicator data of the network layer.

chmod a+x /usr/local/bin/ServerMonitor.py
crontab -l > /tmp/cron.bak
echo "* * * * * /usr/local/bin/ServerMonitor.py &> /tmp/ServerMonitor.log" >> /tmp/cron.bak
crontab /tmp/cron.bak

Data query

After data reporting, you can Index view See the data just reported.

Explain:
1. Configure alarm and receive alarm for only one scenario.
2. To configure other indicator configurations reported by the network layer, please perform steps 2-3 in the following configuration alarms.

Configuration alarm

Scenario: regularly monitor the number of Tcp connection failures in the network layer, and send SMS alarm when the number of Tcp connection failures is greater than 0.

1. Confirm that the user message channel has been verified. You can CAM authentication Page to view the validation.

2. Enter custom monitoring Index view On the page, select [·] > configure alarm] in the upper right corner of indicator view.

3. Configure alarm rules according to background requirements. For more detailed configuration operations, see Configure alarm strategy.
As shown in the figure, when the number of Tcp connection failures is greater than 0, a short message alarm will be sent, lasting for one statistical cycle (1 minute), once every 5 minutes.

Receiving alarm

If the number of Tcp connection failures is greater than 0, a message alarm will be received 5 minutes later, and the message content is as follows:

[Tencent cloud] cloud monitoring user defined monitoring indicator alarm trigger
 Account ID: 34xxxxxxx, nickname: Custom monitoring
 Alarm details
 Alarm content: indicator view | Tcp connection failure number is greater than 0
 Alarm object: tcpatemptfails
 Current data: 1
APPID: 125xxxxxxx
 Alarm strategy: View alarm
 Triggering event: 22:36:00 on December 9, 2019 (UTC+08:00)

Index explanation

Chinese name of index English name of index Company
Tcp active connection TcpActiveOpens second
Tcp passive connection TcpPassiveOpens second
Tcp connection failed TcpAttemptFails second
Abnormal disconnection of Tcp connection TcpEstabResets second
Total number of message segments retransmitted by Tcp TcpRetransSegs individual
Tcp listening queue overflow TcpExtListenOverflows second
UDP packets in UdpInDatagrams individual
UDP packet output udpOutDatagrams individual
Number of UDP in packet errors udpInErrors individual
UDP port not reachable UdpNoPorts individual
UDP send buffer full UdpSndbufErrors second
Published 4 original articles, won praise 0, visited 58
Private letter follow

Tags: network JSON socket Python

Posted on Fri, 14 Feb 2020 04:55:33 -0500 by Anant