SLS machine learning best practice: timing anomaly detection

On the SLS platform, machine learning functions can be used to detect related time sequence exceptions. Specific related functions can be used to detect exceptions, to help users improve the efficiency of patrol inspection and analysis. The specific function list is as follows, and the specific address is as follows: https://help.aliyun.com/document_detail/93210.html

Through the above function combination, we can get the following patrol operation icons. We will gradually disassemble how to get the corresponding results:

  • The most complex patrol SQL functions are as follows:
* |
SELECT res.name AS INSTANCE
FROM
  (SELECT ts_anomaly_filter(INSTANCE, ts, ds, preds, probs, cast(5 AS bigint), cast(1 AS bigint)) AS res
   FROM
     (SELECT INSTANCE,
             res[1] AS ts,
             res[2] AS ds,
             res[3] AS preds,
             res[4] AS uppers,
             res[5] AS lowers,
             res[6] AS probs
      FROM
        (SELECT INSTANCE,
                array_transpose(ts_predicate_arma(TIME, value, 5, 1, 1, 1, 1, TRUE)) AS res
         FROM
           (SELECT (TIME/1000) AS TIME,
                   labels['instance'] AS INSTANCE,
                   value
            FROM
              (SELECT promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m])) by (instance) ', '10m') AS t
               FROM metrics)
            ORDER BY TIME ASC)
         GROUP BY INSTANCE)))

Let's disassemble the above SQL to see how to get the corresponding results step by step!

  • First, we get the corresponding object to be detected:
* |
SELECT (TIME/1000) AS TIME,
       labels['instance'] AS INSTANCE,
       value
FROM
  (SELECT promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m])) by (instance) ', '10m') AS t
   FROM metrics)

Here, use PromQL to obtain the cpu idle index of corresponding N monitoring objects every 10 minutes from SLS. In order to show the image, we can use flow chart to visualize the corresponding graph.

  • We need to detect the abnormality of the N lines obtained. SLS provides exception detection functions and supports group by mode. We can easily use the above methods for patrol inspection
* |
SELECT INSTANCE,
       ts_predicate_arma(TIME, value, 5, 1, 1, 1.0, 1.0, TRUE)
FROM
  (SELECT (TIME/1000) AS TIME,
          labels['instance'] AS INSTANCE,
          value
   FROM
     (SELECT promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m])) by (instance) ', '10m') AS t
      FROM metrics))
GROUP BY INSTANCE

Using the above sql, we can easily detect the N lines. We will get the following results. The first column of the table represents the instance instance, and the second column corresponds to the detection results of each line. But for such a complex result, how to operate?

For ts_predicate_arma function, we provide the corresponding function to analyze and transform the model results, we first detect the array in the results for transpose operation.

* |
SELECT INSTANCE,
       array_transpose(ts_predicate_arma(TIME, value, 5, 1, 1, 1.0, 1.0, TRUE)) AS res
FROM
  (SELECT (TIME/1000) AS TIME,
          labels['instance'] AS INSTANCE,
          value
   FROM
     (SELECT promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m])) by (instance) ', '10m') AS t
      FROM metrics))
GROUP BY INSTANCE

Using array_ We have transformed the result of the function. After the result is unnest, we get the corresponding result for subsequent processing.

* |
SELECT INSTANCE,
       res[1] AS ts,
       res[2] AS ds,
       res[3] AS preds,
       res[4] AS uppers,
       res[5] AS lowers,
       res[6] AS probs
FROM
  (SELECT INSTANCE,
          array_transpose(ts_predicate_arma(TIME, value, 5, 1, 1, 1.0, 1.0, TRUE)) AS res
   FROM
     (SELECT (TIME/1000) AS TIME,
             labels['instance'] AS INSTANCE,
             value
      FROM
        (SELECT promql_query_range('1 - avg(irate(node_cpu_seconds_total{instance=~".*",mode="idle"}[10m])) by (instance) ', '10m') AS t
         FROM metrics))
   GROUP BY INSTANCE)

The corresponding results are as follows:

For such results, we filter out the exceptions that meet our needs, and we use ts_anomaly_filter is a function to solve this problem. Please refer to the document for specific operations https://help.aliyun.com/document_detail/93210.html
This is the whole content of our initial complex SQL. After we get the corresponding table results, we can complete the corresponding analysis operations through the corresponding skip configuration on the SLS side. The specific configurations are as follows:

Configure the DrillDown operation to visually manipulate the data



In this way, the corresponding selection jump can be realized.

Tags: Operation & Maintenance SQL

Posted on Fri, 22 May 2020 06:37:28 -0400 by j.bouwers