introduce
These cases are collected by me. Most of them are of my own experience. Some of them are classic and some of them are representative.
I also recorded videos of these awk cases 18 classic practical cases of awk Welcome to have a look.
Insert several new fields
Insert three fields e f g after b of "a b c d".
echo a b c d|awk '{$3="e f g "$3}1'
Format blank
Remove the prefix and suffix blanks of each line, and align the parts to the left.
aaaa bbb ccc bbb aaa ccc ddd fff eee gg hh ii jj
awk 'BEGIN{OFS="\t"}{$1=$1;print}' a.txt
Execution result:
aaaa bbb ccc bbb aaa ccc ddd fff eee gg hh ii jj
Filter IPv4 addresses
All IPv4 addresses except lo network card are filtered out from the results of ifconfig command.
Read a section in the. ini configuration file
[base] name=os_repo baseurl=https://xxx/centos/$releasever/os/$basearch gpgcheck=0 enable=1 [mysql] name=mysql_repo baseurl=https://xxx/mysql-repo/yum/mysql-5.7-community/el/$releasever/$basearch gpgcheck=0 enable=1 [epel] name=epel_repo baseurl=https://xxx/epel/$releasever/$basearch gpgcheck=0 enable=1 [percona] name=percona_repo baseurl = https://xxx/percona/release/$releasever/RPMS/$basearch enabled = 1 gpgcheck = 0
De duplication according to a field
Remove duplicate lines with uid=xxx.
2019-01-13_12:00_index?uid=123 2019-01-13_13:00_index?uid=123 2019-01-13_14:00_index?uid=333 2019-01-13_15:00_index?uid=9710 2019-01-14_12:00_index?uid=123 2019-01-14_13:00_index?uid=123 2019-01-15_14:00_index?uid=333 2019-01-16_15:00_index?uid=9710
awk -F"?" '!arr[$2]++{print}' a.txt
Result:
2019-01-13_12:00_index?uid=123 2019-01-13_14:00_index?uid=333 2019-01-13_15:00_index?uid=9710
Frequency statistics
portmapper portmapper portmapper portmapper portmapper portmapper status status mountd mountd mountd mountd mountd mountd nfs nfs nfs_acl nfs nfs nfs_acl nlockmgr nlockmgr nlockmgr nlockmgr nlockmgr
awk '{arr[$1]++}END{OFS="\t";for(idx in arr){printf arr[idx],idx}}' a.txt
Count the number of TCP connection states
$ netstat -tnap Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1139/sshd tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 2285/master tcp 0 96 192.168.2.17:22 192.168.2.1:2468 ESTABLISHED 87463/sshd: root@pt tcp 0 0 192.168.2017:22 192.168.201:5821 ESTABLISHED 89359/sshd: root@no tcp6 0 0 :::3306 :::* LISTEN 2289/mysqld tcp6 0 0 :::22 :::* LISTEN 1139/sshd tcp6 0 0 ::1:25 :::* LISTEN 2285/master
Statistical results:
5: LISTEN 2: ESTABLISHED
One line:
netstat -tna | awk '/^tcp/{arr[$6]++}END{for(state in arr){print arr[state] ": " state}}' netstat -tna | /usr/bin/grep 'tcp' | awk '{print $6}' | sort | uniq -c
Count the number of IP access to non-200 status codes in the log
Log sample data:
111.202.100.141 - - [2019-11-07T03:11:02+08:00] "GET /robots.txt HTTP/1.1" 301 169
Count the IP of non-200 status code, and take the top 10 IP with the most times.
# Law 1 awk '$8!=200{arr[$1]++}END{for(i in arr){print arr[i],i}}' access.log | sort -k1nr | head -n 10 # Law two: awk ' $8!=200{arr[$1]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; for(i in arr){ if(cnt++==10){exit} print arr[i],i } }' access.log
Statistics of independent IP
url access IP access time visitor
a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest b.com.cn|202.109.134.23|2015-11-20 20:34:48|guest c.com.cn|202.109.134.24|2015-11-20 20:34:48|guest a.com.cn|202.109.134.23|2015-11-20 20:34:43|guest a.com.cn|202.109.134.24|2015-11-20 20:34:43|guest b.com.cn|202.109.134.25|2015-11-20 20:34:48|guest
Requirement: count the number of independent access IP S for each URL (de duplication), and save a corresponding file for each URL. The result is similar:
a.com.cn 2 b.com.cn 2 c.com.cn 1
There are three corresponding files:
a.com.cn.txt b.com.cn.txt c.com.cn.txt
Code:
Handling missing data in fields
ID name gender age email phone 1 Bob male 28 abc@qq.com 18023394012 2 Alice female 24 def@gmail.com 18084925203 3 Tony male 21 17048792503 4 Kevin male 21 bbb@189.com 17023929033 5 Alex male 18 ccc@xyz.com 18185904230 6 Andy female ddd@139.com 18923902352 7 Jerry female 25 exdsa@189.com 18785234906 8 Peter male 20 bax@qq.com 17729348758 9 Steven 23 bc@sohu.com 15947893212 10 Bruce female 27 bcbd@139.com 13942943905
When the field is missing, it is very difficult to use FS to divide the field directly. In order to solve this special requirement, gawk provides the FIELDWIDTHS variable.
Fieldidth can divide fields by the number of characters.
awk '{print $4}' FIELDWIDTHS="2 2:6 2:6 2:3 2:13 2:11" a.txt
Process data with field separators in the field
The following is a line in a CSV file that separates the fields with commas.
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
Requirement: obtain the third field "1234 A Pretty Street, NE".
When the field contains field separator, it is very difficult to use FS to divide the field directly. gawk provides FPAT variable to solve this special requirement.
FPAT can collect the results of regular matches and save them in various fields. (just like the grep matching success part will be colored, and FPAT is used to divide the field, the matching success part will be saved in the field $1 $2 $3...).
echo 'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA' |\ awk 'BEGIN{FPAT="[^,]+|\".*\""}{print $1,$3}'
Take the number of characters specified in the field
16 001agdcdafasd 16 002agdcxxxxxx 23 001adfadfahoh 23 001fsdadggggg
Get:
16 001 16 002 23 001 23 002
awk '{print $1,substr($2,1,3)}' awk 'BEGIN{FIELDWIDTH="2 2:3"}{print $1,$2}' a.txt
Row and column transformation
name age alice 21 ryan 30
The conversion results in:
name alice ryan age 21 30
awk ' { for(i=1;i<=NF;i++){ if(!(i in arr)){ arr[i]=$i } else { arr[i]=arr[i]" "$i } } } END{ for(i=1;i<=NF;i++){ print arr[i] } } ' a.txt
Row column conversion 2
Document content:
74683 1001 74683 1002 74683 1011 74684 1000 74684 1001 74684 1002 74685 1001 74685 1011 74686 1000 .... 100085 1000 100085 1001
There are only two columns of documents. I want to process them into
74683 1001 1002 1011 74684 1000 1001 1002 ...
As long as the numbers in the first column are the same, put their second column on a row, and separate the spaces in the middle
{ if($1 in arr){ arr[$1] = arr[$1]" "$2 } else { arr[$1] = $2 } } END{ for(i in arr){ printf "%s %s\n",i,arr[i] } }
Filter logs for a given time range
grep/sed/awk is very difficult to implement if it is accurate to hours, minutes and seconds when it uses regular filtering to filter logs.
However, awk provides the mktime() function, which can convert time to epoch time value.
# November 10, 2019 03:42:40 converted to epoch $ awk 'BEGIN{print mktime("2019 11 10 03 42 40")}' 1573328560
In this way, you can get the time string part of the log, then take out their year, month, day, hour, minute and second, and put mktime() to build the corresponding epoch value. Because the epoch value is a numeric value, you can compare the size to determine the size of the time.
The following strptime1() implements the conversion of 2019-11-10T03:42:40+08:00 format string to epoch value, and then compares the size with which time to filter out the logs accurate to seconds.
The following strptime2() implements the conversion of 10/Nov/2019:23:53:44+08:00 format string to epoch value, and then compares the size with which_time to filter the log accurate to seconds.
BEGIN{ # To filter the log at what time, build its time into the epoch value which_time = mktime("2019 11 10 03 42 40") } { # Get the date time string part of the log match($0,"^.*\\[(.*)\\].*",arr) # Convert date time string to epoch value tmp_time = strptime2(arr[1]) # Compare time by comparing epoch values if(tmp_time > which_time){ print } } # The format of the time string constructed is: "10/Nov/2019:23:53:44+08:00" function strptime2(str ,dt_str,arr,Y,M,D,H,m,S) { dt_str = gensub("[/:+]"," ","g",str) # dt_sr = "10 Nov 2019 23 53 44 08 00" split(dt_str,arr," ") Y=arr[3] M=mon_map(arr[2]) D=arr[1] H=arr[4] m=arr[5] S=arr[6] return mktime(sprintf("%s %s %s %s %s %s",Y,M,D,H,m,S)) } function mon_map(str ,mons){ mons["Jan"]=1 mons["Feb"]=2 mons["Mar"]=3 mons["Apr"]=4 mons["May"]=5 mons["Jun"]=6 mons["Jul"]=7 mons["Aug"]=8 mons["Sep"]=9 mons["Oct"]=10 mons["Nov"]=11 mons["Dec"]=12 return mons[str] }
Remove the comment in the middle of / * * /
Sample data:
/*AAAAAAAAAA*/ 1111 222 /*aaaaaaaaa*/ 32323 12341234 12134 /*bbbbbbbbbb*/ 132412 14534122 /* cccccccccc */ xxxxxx /*ddddddddddd cccccccccc eeeeeee */ yyyyyyyy 5642341
Judgment of the relationship between the preceding and the following paragraphs
From the following types of files, find out the first segment of the false segment is the i-order segment, and output the two segments at the same time.
2019-09-12 07:16:27 [-][ 'data' => [ 'http://192.168.100.20:2800/api/payment/i-order', ], ] 2019-09-12 07:16:27 [-][ 'data' => [ false, ], ] 2019-09-21 07:16:27 [-][ 'data' => [ 'http://192.168.100.20:2800/api/payment/i-order', ], ] 2019-09-21 07:16:27 [-][ 'data' => [ 'http://192.168.100.20:2800/api/payment/i-user', ], ] 2019-09-17 18:34:37 [-][ 'data' => [ false, ], ]
BEGIN{ RS="]\n" ORS=RS } { if(/false/ && prev ~ /i-order/){ print tmp print } tmp=$0 }
Processing of two files
There are two files, file1 and file2, both of which have the same format.
Requirement: first delete the fifth column of file 2, then subtract the first column of file 1 from the first column of file 2, and paste the result corresponding to the original position of the fifth column. How to write this script?
file1: 50.481 64.634 40.573 1.00 0.00 51.877 65.004 40.226 1.00 0.00 52.258 64.681 39.113 1.00 0.00 52.418 65.846 40.925 1.00 0.00 49.515 65.641 40.554 1.00 0.00 49.802 66.666 40.358 1.00 0.00 48.176 65.344 40.766 1.00 0.00 47.428 66.127 40.732 1.00 0.00 51.087 62.165 40.940 1.00 0.00 52.289 62.334 40.897 1.00 0.00 file2: 48.420 62.001 41.252 1.00 0.00 45.555 61.598 41.361 1.00 0.00 45.815 61.402 40.325 1.00 0.00 44.873 60.641 42.111 1.00 0.00 44.617 59.688 41.648 1.00 0.00 44.500 60.911 43.433 1.00 0.00 43.691 59.887 44.228 1.00 0.00 43.980 58.629 43.859 1.00 0.00 42.372 60.069 44.032 1.00 0.00 43.914 59.977 45.551 1.00 0.00