Log service data processing log parsing practice of Nginx

Introduction to data processing service

Data processing service is a log ETL processing service launched by Alibaba cloud log service, which mainly solves the scenarios of conversion, filtering, distribution, and enrichment during data processing.

The data processing service is integrated in the log service.
Common scenarios supported by data processing:

Data distribution scenario

  1. Data normalization (one to one)
  2. Data dispatch (one to many)

Next, we take nginx log parsing as an example to help you get a quick introduction to the data processing of Alibaba cloud log service.

Nginx log for parsing

Suppose we collect the default log of nginx through the minimalist mode. The default nginx log format is as follows

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                     '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

This is how the logs are seen on the machine

Use Alibaba cloud log service to collect simple logs as follows

Console enable data processing

On the console, open the "data processing" button, then input the DSL statement in the input box, and click preview data to see the effect of data processing.

Field regular extraction

Through regular, extract the fields in the nginx log. Here you can use the regular capture group name to set the variable name

e_regex("Source field name", "Regular or named capture regular", "Target field name or array(Optional)", mode="fill-auto")

Recommend a regular authoring aid: https://regex101.com/

Actually using DSL statements

e_regex("content",'(?<remote_addr>[0-9:\.]*) - (?<remote_user>[a-zA-Z0-9\-_]*) \[(?<local_time>[a-zA-Z0-9\/ :\-]*)\] "(?<request>[^"]*)" (?<status>[0-9]*) (?<body_bytes_sent>[0-9\-]*) "(?<refer>[^"]*)" "(?<http_user_agent>[^"]*)"')

Time field processing

The local time in the default format is not easy to read. We parse it into an easy to read format

dsl statements used, through dsl

e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

dt_strftime(datetime expression, "format string ")

dt_strptime('Values such as v("Field name")', "format string ")

dsl statements used in practice

e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))

Request URI resolution

Next, we want to extract the requst field. We can see that the request field is composed of the HTTP method URI HTTP version.
We can do this with the following functions

e_regex("Source field name", "Regular or named capture regular", "Target field name or array(Optional)", mode="fill-auto")

# Perform urldecode
url_decoding('Values such as v("Field name")')

# Set field value
e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

e_kv take request url Medium key value Key value pair extraction 

Implementation statement

e_regex("request", "(?<request_method>[^\s]*) (?<request_uri>[^\s]*) (?<http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")

Achieving results

HTTP Code status code mapping

If we want to map http code to specific code description information, for example, if I want to map 404 to not found, we can use the e-dict-map function

E ﹣ Dict ﹣ map ("dictionaries such as {'k1': 'v1', 'k2': 'v2'}", "source field regular or list", "target field name")

If a key match is not available for the DSL used, the value corresponding to the * key will be used

e_dict_map({'200':'OK',
            '304' : '304 Not Modified',
            '400':'Bad Request',
            '401':'Unauthorized',
            '403':'Forbidden',
            '404':'Not Found',
            '500':'Internal Server Error',
            '*':'unknown'}, "status", "status_desc")

Effect

Judging client operating system by User Agent

If we want to know what os version the customer is using, we can use regular matching to judge the fields in user agent, and use dsl statement

e_switch("Condition 1 e_match(...)", "Operation 1 is as follows e_regex(...)", "Condition 2", "Operation 2", ..., default="Optional operation without matching")

regex_match('Values such as v("Field name")', r"regular expression ", full=False)

e_set("Field name", "Fixed value or expression function", ..., mode="overwrite")

dsl statements used

e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
         regex_match(v("content"), "Linux"), e_set("os", "linux"),
         regex_match(v("content"), "Windows"), e_set("os", "windows"),
         default=e_set("os", "unknown")
)

Effect

The log of 4xx is delivered to the specified logstore separately

You can use the e'output function to post logs and regex'match to match fields

regex_match('Values such as v("Field name")', r"regular expression ", full=False)

e_output(name=None, project=None, logstore=None, topic=None, source=None, tags=None)

e_if("Condition 1 is as follows e_match(...)", "Operation 1 is as follows e_regex(...)", "Condition 2", "Operation 2", ....)

Actual dsl statement

e_if(regex_match(v("status"),"^4.*"), 
                   e_output(name="logstore_4xx", 
                   project="dashboard-demo", 
                   logstore="dsl-nginx-out-4xx"))

See this effect in the preview. (when saving the processing, you need to set the ak information of the corresponding project and logstore.)

Complete DSL code and online process

Complete DSL code

# General field extraction
e_regex("content",'(?<remote_addr>[0-9:\.]*) - (?<remote_user>[a-zA-Z0-9\-_]*) \[(?<local_time>[a-zA-Z0-9\/ :\-]*)\] "(?<request>[^"]*)" (?<status>[0-9]*) (?<body_bytes_sent>[0-9\-]*) "(?<refer>[^"]*)" "(?<http_user_agent>[^"]*)"')

# Set localttime
e_set("local_time", dt_strftime(dt_strptime(v("local_time"),"%d/%b/%Y:%H:%M:%S %z"),"%Y-%m-%d %H:%M:%S"))

# uri field extraction
e_regex("request", "(?<request_method>[^\s]*) (?<request_uri>[^\s]*) (?<http_version>[^\s]*)")
e_set("request_uri", url_decoding(v("request_uri")))
e_kv("request_uri")

# http code mapping
e_dict_map({'200':'OK',
            '304':'304 Not Modified',
            '400':'Bad Request',
            '401':'Unauthorized',
            '403':'Forbidden',
            '404':'Not Found',
            '500':'Internal Server Error',
            '*':'unknown'}, "status", "status_desc")

# user agent field judgment
e_switch(regex_match(v("content"), "Mac"), e_set("os", "osx"),
         regex_match(v("content"), "Linux"), e_set("os", "linux"),
         regex_match(v("content"), "Windows"), e_set("os", "windows"),
         default=e_set("os", "unknown")
)

# logstore delivered separately
e_if(regex_match(v("status"),"^4.*"), 
                   e_output(name="logstore_4xx", project="dashboard-demo", logstore="dsl-nginx-out-4xx"))

Save the data processing after the code is submitted on the page

Configure target logstore information. If there is e ﹐ output, you need to set the corresponding additional storage target name, project, logstore to be consistent with the code

After saving, you can go online. You can see the task under data processing - processing. Click in to see processing delay and other information.
If you need to modify, you can also enter the modification through this entry.

Reference resources

  1. https://help.aliyun.com/document_detail/125384.html [Alibaba cloud log service - Introduction to data processing]
  2. https://help.aliyun.com/document_detail/159702.html [Alibaba cloud log service - overview of data processing functions]

Tags: Web Server Nginx Linux Windows Mac

Posted on Sun, 10 May 2020 09:41:19 -0400 by jebadoa