Using Grafana and Arthas to automatically grab the thread stack of abnormal Java process

Preface

Recently, it is found that the timeout exception caused by busy CPU occurs at the peak time of business. According to the monitoring, it is caused by the sudden seizure of a large number of CPUs by some Pod on Node.

Q: is there no CPU limit? Is it possible to solve the limited CPU usage?
Solution: in fact, this problem can not be fundamentally solved, because the container engine used is Docker, and Docker uses cgroups technology, which introduces a big and difficult problem, the isolation of cgroups. When the problem occurs, there is no way to hold the abnormal CPU process directly, but there will be a short peak. The phenomenon is: the CPU is limited to 2 cores, the CPU may be 4, 5, 6, etc. in case of an emergency, then the container will be kill ed, and K8S will try to rebuild the container.

So how to solve it?

Use a container engine with better isolation, such as kata(VM level).
Optimization procedure

Option 1

We can know that the solution of scheme 1 is quite thorough, and only needs to be processed once globally. However, the technology is relatively novel, and we don't know whether it will bring other problems. Later, we are going to take out some nodes to try kata container.

Option 2

The requirements for application developers are relatively high, and the corresponding developers need targeted intervention. The short-term benefits are very high. We deployed this first.

How to implement?

We know that in the process of running a program, unless there is a very serious BUG, the CPU peak is usually very short. At this time, it's too late to grab packets by human flesh, which is also very energy-consuming. We hope that a program can automatically grab the thread stack when the CPU reaches a certain threshold to optimize after the event, and only allow to run once in a certain period of time to prevent cyclic packet grabbing The sequence is not available.

According to the final effect to be achieved, we find that it is very close to the alarm mechanism of Grafana and Prometheus. What we need to do is to receive the webhook of the alarm and get the thread stack in the corresponding container.

So we used Grafana and wrote a program to complete this function.

Project information

Development language: Go, Shell
Project address: https://github.com/majian159/k8s-java-debug-daemon

k8s-java-debug-daemon

The alarm mechanism of Grafana and arthas of Ali are used to complete stack fetching of threads with high CPU utilization.
The overall process is as follows:

Add a webhook type alarm notification channel to Grafana. The address is the URL of the program (the default hooks path is / hooks).
Configure the Grafana chart and set the alarm threshold
When webhook is triggered, the program will automatically copy the crawl.sh script to the corresponding Pod container and execute it.
The program saves stdout to a local file.

Effect preview

Default behavior

The number of simultaneous operations per node is 10
You can change it in. / internal/defaultvalue.go

var defaultNodeLockManager = nodelock.NewLockManager(10)
Master configuration in cluster is used by default
You can change it in. / internal/defaultvalue.go

func DefaultKubernetesClient(){} // default func getConfigByInCluster(){} func getConfigByOutOfCluster(){}
By default, a stack memory based on local file is used and implemented. The path is in the stacks under the working path
You can change it in. / internal/defaultvalue.go

func GetDefaultNodeLockManager(){}
By default, the stack information of the first 50 busy threads is taken (can be modified in crawl.sh)
Sample collection time is 2 seconds (can be modified in crawl. SH)

How to use

Docker Image

majian159/java-debug-daemon

Create a new notification channel for Grafana

Attention points

You need to turn on Send reminders. Otherwise, by default, Grafana does not solve the problem after triggering the alarm, and will not send the alarm repeatedly
Send reminder every can control the maximum number of alarms

Create a new alarm chart for Grafana

If you feel inconvenient, you can directly import the following configuration and change it yourself

{ "datasource": "prometheus", "alert": { "alertRuleTags": {}, "conditions": [ { "evaluator": { "params": [ 1 ], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": [ "A", "5m", "now" ] }, "reducer": { "params": [], "type": "last" }, "type": "query" } ], "executionErrorState": "keep_state", "for": "10s", "frequency": "30s", "handler": 1, "name": "Pod high CPU Stack grab", "noDataState": "no_data", "notifications": [ { "uid": "AGOJRCqWz" } ] }, "aliasColors": {}, "bars": false, "dashLength": 10, "dashes": false, "fill": 1, "fillGradient": 0, "gridPos": { "h": 9, "w": 24, "x": 0, "y": 2 }, "hiddenSeries": false, "id": 14, "legend": { "alignAsTable": true, "avg": true, "current": true, "max": true, "min": false, "rightSide": true, "show": true, "total": false, "values": true }, "lines": true, "linewidth": 1, "nullPointMode": "null", "options": { "dataLinks": [] }, "percentage": false, "pointradius": 2, "points": false, "renderer": "flot", "seriesOverrides": [], "spaceLength": 10, "stack": false, "steppedLine": false, "targets": [ { "expr": "container_memory_working_set_bytes* on (namespace, pod) group_left(node) max by(namespace, pod, node, container) (kube_pod_info)", "legendFormat": "{} - {} - {} - {}", "refId": "A" } ], "thresholds": [ { "colorMode": "critical", "fill": true, "line": true, "op": "gt", "value": 1 } ], "timeFrom": null, "timeRegions": [], "timeShift": null, "title": "Pod CPU", "tooltip": { "shared": true, "sort": 0, "value_type": "individual" }, "type": "graph", "xaxis": { "buckets": null, "mode": "time", "name": null, "show": true, "values": [] }, "yaxes": [ { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true }, { "format": "short", "label": null, "logBase": 1, "max": null, "min": null, "show": true } ], "yaxis": { "align": false, "alignLevel": null } }

Queries configuration

In Metrics

container_memory_working_set_bytes * on (namespace, pod) group_left(node) max by(namespace, pod, node, container) (kube_pod_info)

Fill in Legend

{} - {} - {} - {}

The configuration is as follows:

Alert configuration

IS ABOVE
CPU usage value. If the CPU is more than 1 core, it will alarm. You can adjust it according to your own needs
Evaluate every
How often is it calculated
For
Pedding time

The configuration should be as follows:

structure

Binary

# Build for current system platform make # Specify the target system, GOOS: linux darwin window freebsd make GOOS=linux