18 classic practical cases of awk

introduce

These cases are collected by me. Most of them are of my own experience. Some of them are classic and some of them are representative.

I also recorded videos of these awk cases 18 classic practical cases of awk Welcome to have a look.

Insert several new fields

Insert three fields e f g after b of "a b c d".

echo a b c d|awk '{$3="e f g "$3}1'

Format blank

Remove the prefix and suffix blanks of each line, and align the parts to the left.

      aaaa        bbb     ccc                 
   bbb     aaa ccc
ddd       fff             eee gg hh ii jj
awk 'BEGIN{OFS="\t"}{$1=$1;print}' a.txt

Execution result:

aaaa    bbb     ccc
bbb     aaa     ccc
ddd     fff     eee     gg      hh      ii      jj

Filter IPv4 addresses

All IPv4 addresses except lo network card are filtered out from the results of ifconfig command.

Read a section in the. ini configuration file

[base]
name=os_repo
baseurl=https://xxx/centos/$releasever/os/$basearch
gpgcheck=0

enable=1

[mysql]
name=mysql_repo
baseurl=https://xxx/mysql-repo/yum/mysql-5.7-community/el/$releasever/$basearch

gpgcheck=0
enable=1

[epel]
name=epel_repo
baseurl=https://xxx/epel/$releasever/$basearch
gpgcheck=0
enable=1
[percona]
name=percona_repo
baseurl = https://xxx/percona/release/$releasever/RPMS/$basearch
enabled = 1
gpgcheck = 0

De duplication according to a field

Remove duplicate lines with uid=xxx.

2019-01-13_12:00_index?uid=123
2019-01-13_13:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710
2019-01-14_12:00_index?uid=123
2019-01-14_13:00_index?uid=123
2019-01-15_14:00_index?uid=333
2019-01-16_15:00_index?uid=9710
awk -F"?" '!arr[$2]++{print}' a.txt

Result:

2019-01-13_12:00_index?uid=123
2019-01-13_14:00_index?uid=333
2019-01-13_15:00_index?uid=9710

Frequency statistics

portmapper
portmapper
portmapper
portmapper
portmapper
portmapper
status
status
mountd
mountd
mountd
mountd
mountd
mountd
nfs
nfs
nfs_acl
nfs
nfs
nfs_acl
nlockmgr
nlockmgr
nlockmgr
nlockmgr
nlockmgr
awk '{arr[$1]++}END{OFS="\t";for(idx in arr){printf arr[idx],idx}}' a.txt

Count the number of TCP connection states

$ netstat -tnap
Proto Recv-Q Send-Q Local Address   Foreign Address  State       PID/Program name
tcp        0      0 0.0.0.0:22      0.0.0.0:*        LISTEN      1139/sshd
tcp        0      0 127.0.0.1:25    0.0.0.0:*        LISTEN      2285/master
tcp        0     96 192.168.2.17:22 192.168.2.1:2468 ESTABLISHED 87463/sshd: root@pt
tcp        0      0 192.168.2017:22 192.168.201:5821 ESTABLISHED 89359/sshd: root@no
tcp6       0      0 :::3306         :::*             LISTEN      2289/mysqld
tcp6       0      0 :::22           :::*             LISTEN      1139/sshd
tcp6       0      0 ::1:25          :::*             LISTEN      2285/master

Statistical results:

5: LISTEN
2: ESTABLISHED

One line:

netstat -tna | awk '/^tcp/{arr[$6]++}END{for(state in arr){print arr[state] ": " state}}'
netstat -tna | /usr/bin/grep 'tcp' | awk '{print $6}' | sort | uniq -c

Count the number of IP access to non-200 status codes in the log

Log sample data:

111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169 

Count the IP of non-200 status code, and take the top 10 IP with the most times.

# Law 1
awk '$8!=200{arr[$1]++}END{for(i in arr){print arr[i],i}}' access.log | sort -k1nr | head -n 10

# Law two:
awk '
    $8!=200{arr[$1]++}
    END{
        PROCINFO["sorted_in"]="@val_num_desc";
        for(i in arr){
            if(cnt++==10){exit}
            print arr[i],i
        }
}' access.log

Statistics of independent IP

url access IP access time visitor

a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest
c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest
a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest
b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest

Requirement: count the number of independent access IP S for each URL (de duplication), and save a corresponding file for each URL. The result is similar:

a.com.cn  2
b.com.cn  2
c.com.cn  1

There are three corresponding files:

a.com.cn.txt
b.com.cn.txt
c.com.cn.txt

Code:

Handling missing data in fields

ID  name    gender  age  email          phone
1   Bob     male    28   abc@qq.com     18023394012
2   Alice   female  24   def@gmail.com  18084925203
3   Tony    male    21                  17048792503
4   Kevin   male    21   bbb@189.com    17023929033
5   Alex    male    18   ccc@xyz.com    18185904230
6   Andy    female       ddd@139.com    18923902352
7   Jerry   female  25   exdsa@189.com  18785234906
8   Peter   male    20   bax@qq.com     17729348758
9   Steven          23   bc@sohu.com    15947893212
10  Bruce   female  27   bcbd@139.com   13942943905

When the field is missing, it is very difficult to use FS to divide the field directly. In order to solve this special requirement, gawk provides the FIELDWIDTHS variable.

Fieldidth can divide fields by the number of characters.

awk '{print $4}' FIELDWIDTHS="2 2:6 2:6 2:3 2:13 2:11" a.txt

Process data with field separators in the field

The following is a line in a CSV file that separates the fields with commas.

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

Requirement: obtain the third field "1234 A Pretty Street, NE".

When the field contains field separator, it is very difficult to use FS to divide the field directly. gawk provides FPAT variable to solve this special requirement.

FPAT can collect the results of regular matches and save them in various fields. (just like the grep matching success part will be colored, and FPAT is used to divide the field, the matching success part will be saved in the field $1 $2 $3...).

echo 'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA' |\
awk 'BEGIN{FPAT="[^,]+|\".*\""}{print $1,$3}'

Take the number of characters specified in the field

16  001agdcdafasd
16  002agdcxxxxxx
23  001adfadfahoh
23  001fsdadggggg

Get:

16  001
16  002
23  001
23  002
awk '{print $1,substr($2,1,3)}'
awk 'BEGIN{FIELDWIDTH="2 2:3"}{print $1,$2}' a.txt

Row and column transformation

name age
alice 21
ryan 30

The conversion results in:

name alice ryan
age 21 30
awk '
    {
      for(i=1;i<=NF;i++){
        if(!(i in arr)){
          arr[i]=$i
        } else {
            arr[i]=arr[i]" "$i
        }
      }
    }
    END{
        for(i=1;i<=NF;i++){
            print arr[i]
        }
    }
' a.txt

Row column conversion 2

Document content:

74683 1001
74683 1002
74683 1011
74684 1000
74684 1001
74684 1002
74685 1001
74685 1011
74686 1000
....
100085 1000
100085 1001

There are only two columns of documents. I want to process them into

74683 1001 1002 1011
74684 1000 1001 1002
...

As long as the numbers in the first column are the same, put their second column on a row, and separate the spaces in the middle

{
  if($1 in arr){
    arr[$1] = arr[$1]" "$2
  } else {
    arr[$1] = $2
  }
  
}

END{
  for(i in arr){
    printf "%s %s\n",i,arr[i]
  }
}

Filter logs for a given time range

grep/sed/awk is very difficult to implement if it is accurate to hours, minutes and seconds when it uses regular filtering to filter logs.

However, awk provides the mktime() function, which can convert time to epoch time value.

# November 10, 2019 03:42:40 converted to epoch
$ awk 'BEGIN{print mktime("2019 11 10 03 42 40")}'
1573328560

In this way, you can get the time string part of the log, then take out their year, month, day, hour, minute and second, and put mktime() to build the corresponding epoch value. Because the epoch value is a numeric value, you can compare the size to determine the size of the time.

The following strptime1() implements the conversion of 2019-11-10T03:42:40+08:00 format string to epoch value, and then compares the size with which time to filter out the logs accurate to seconds.

The following strptime2() implements the conversion of 10/Nov/2019:23:53:44+08:00 format string to epoch value, and then compares the size with which_time to filter the log accurate to seconds.

BEGIN{
  # To filter the log at what time, build its time into the epoch value
  which_time = mktime("2019 11 10 03 42 40")
}

{
  # Get the date time string part of the log
  match($0,"^.*\\[(.*)\\].*",arr)
  
  # Convert date time string to epoch value
  tmp_time = strptime2(arr[1])
  
  # Compare time by comparing epoch values
  if(tmp_time > which_time){
    print 
  }
}

# The format of the time string constructed is: "10/Nov/2019:23:53:44+08:00"
function strptime2(str   ,dt_str,arr,Y,M,D,H,m,S) {
  dt_str = gensub("[/:+]"," ","g",str)
  # dt_sr = "10 Nov 2019 23 53 44 08 00"
  split(dt_str,arr," ")
  Y=arr[3]
  M=mon_map(arr[2])
  D=arr[1]
  H=arr[4]
  m=arr[5]
  S=arr[6]
  return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S))
}

function mon_map(str   ,mons){
  mons["Jan"]=1
  mons["Feb"]=2
  mons["Mar"]=3
  mons["Apr"]=4
  mons["May"]=5
  mons["Jun"]=6
  mons["Jul"]=7
  mons["Aug"]=8
  mons["Sep"]=9
  mons["Oct"]=10
  mons["Nov"]=11
  mons["Dec"]=12
  return mons[str]
}

Remove the comment in the middle of / * * /

Sample data:

/*AAAAAAAAAA*/
1111
222

/*aaaaaaaaa*/
32323
12341234
12134 /*bbbbbbbbbb*/ 132412

14534122
/*
    cccccccccc
*/
xxxxxx /*ddddddddddd
    cccccccccc
    eeeeeee
*/ yyyyyyyy
5642341

Judgment of the relationship between the preceding and the following paragraphs

From the following types of files, find out the first segment of the false segment is the i-order segment, and output the two segments at the same time.

2019-09-12 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-12 07:16:27 [-][
  'data' => [
    false,
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-order',
  ],
]
2019-09-21 07:16:27 [-][
  'data' => [
    'http://192.168.100.20:2800/api/payment/i-user',
  ],
]
2019-09-17 18:34:37 [-][
  'data' => [
    false,
  ],
]
BEGIN{
  RS="]\n"
  ORS=RS
}
{
  if(/false/ && prev ~ /i-order/){
    print tmp
    print
  }
  tmp=$0
}

Processing of two files

There are two files, file1 and file2, both of which have the same format.

Requirement: first delete the fifth column of file 2, then subtract the first column of file 1 from the first column of file 2, and paste the result corresponding to the original position of the fifth column. How to write this script?

file1: 
50.481  64.634  40.573  1.00  0.00
51.877  65.004  40.226  1.00  0.00
52.258  64.681  39.113  1.00  0.00
52.418  65.846  40.925  1.00  0.00
49.515  65.641  40.554  1.00  0.00
49.802  66.666  40.358  1.00  0.00
48.176  65.344  40.766  1.00  0.00
47.428  66.127  40.732  1.00  0.00
51.087  62.165  40.940  1.00  0.00
52.289  62.334  40.897  1.00  0.00
file2: 
48.420  62.001  41.252  1.00  0.00
45.555  61.598  41.361  1.00  0.00
45.815  61.402  40.325  1.00  0.00
44.873  60.641  42.111  1.00  0.00
44.617  59.688  41.648  1.00  0.00
44.500  60.911  43.433  1.00  0.00
43.691  59.887  44.228  1.00  0.00
43.980  58.629  43.859  1.00  0.00
42.372  60.069  44.032  1.00  0.00
43.914  59.977  45.551  1.00  0.00

Tags: Linux MySQL EPEL network CentOS

Posted on Wed, 27 Nov 2019 05:12:31 -0500 by leeandrew