Text processing in Shell programming

You have to work very hard to look effortless!

WeChat search official account [Coding road], together with From Zero To Hero!

preface

In daily work and study, we often have to deal with text files (such as log files), including but not limited to operations such as segmentation, search, replacement and deletion. Are there corresponding commands in the Shell for us to use? So this article, let's study together!

cut

The grep command can find the qualified rows in the file, and the cut command can extract the columns in the rows according to the separator. The default separator is TAB.

# cut [options] file name

option

-f Column No.:	Which column is extracted
-d Separator:	Splits the column by the specified separator


# The default delimiter for cut is TAB

Test this command and it will be easy to understand. We first create a new file, then separate it with TAB, and then extract the columns with the cut command.

Example 1

[root@VM-0-5-centos ~]# vim student.txt
ID      NAME    Gender  MARK
1       zs      M       98
2       ls      F       99

# 1. Extract the second column
[root@VM-0-5-centos ~]# cut -f 2 student.txt
NAME
zs
ls

# 2. Extract the second and fourth columns
[root@VM-0-5-centos ~]# cut -f 2,4 student.txt
NAME	MARK
zs	98
ls	99

# 3. The test uses delimiters to extract columns

# 3.1 this is the text we need to process
[root@VM-0-5-centos ~]# cat /etc/passwd | grep /bin/bash | grep -v root
lifelmy:x:1000:1000::/home/lifelmy:/bin/bash
user1:x:1001:1001::/home/user1:/bin/bash
user3:x:1002:1003::/home/user3:/bin/bash

# 3.2 extract the first column
[root@VM-0-5-centos ~]# cat /etc/passwd | grep /bin/bash | grep -v root | cut -f 1 -d :
lifelmy
user1
user3

From the above example, we can see that the cut command can extract columns in rows, but the cut command has a disadvantage that the length of separators must be consistent. If there is one space character or two space characters between columns, the cut command cannot divide the columns correctly.

Example 2

[root@VM-0-5-centos ~]# df  -h
 file system        Capacity used available used% Mount point
devtmpfs        909M     0  909M    0% /dev
tmpfs           919M   24K  919M    1% /dev/shm
tmpfs           919M  540K  919M    1% /run
tmpfs           919M     0  919M    0% /sys/fs/cgroup
/dev/vda1        50G  5.6G   42G   12% /
tmpfs           184M     0  184M    0% /run/user/0

# Use spaces as separators to find the first column
[root@VM-0-5-centos ~]# df  -h | cut -f 1 -d " "
file system
devtmpfs
tmpfs
tmpfs
tmpfs
/dev/vda1
tmpfs

# But when we look up the second column, we get all spaces, not the "capacity" column we want
[root@VM-0-5-centos ~]# df  -h | cut -f 2 -d " "

awk

As can be seen from example 2 above, cut still has some shortcomings. Is there any other command to solve this problem? It's our turn to use the awk command. Before introducing the awk command in detail, let's take a look at the standard output command.

print and printf commands are supported in the output of the awk command

  • print: print will automatically add a line break after each output (Linux does not have a print command by default)

  • printf: printf is a standard format output command and does not automatically add line breaks. If line breaks are required, they need to be added manually

printf

printf "Output type output format" Output content

[type of output]:

%ns:	Output string, n Indicates how many characters are output
%ni:	Output integer, n Indicates that several numbers are output
%m.nf:	Output floating point number, m,n Represents the number of output digits and the number of decimal places. %8.2f Indicates that a total of 8 digits are output, of which 2 digits are decimals and 6 digits are integers



[Output format]:

\a: Output warning sound
\b:	Output backspace key
\f:	Clear screen
\n:	Line feed
\r:	enter
\t:	Horizontal Output tab
\v:	Vertical Output tab

Example

# %-5s is replaced by a string with left alignment and width of 5 ('-' indicates left alignment). If it is not used, the default is right alignment.
# %-4.2f format is left aligned, width is 4, and two decimal places are reserved.

[root@VM-0-5-centos ~]# printf "%-5s %-10s %-4s\n" NO Name Mark
NO    Name       Mark
[root@VM-0-5-centos ~]# printf "%-5s %-10s %-4.2f\n" 01 Tom 90.3456
01    Tom        90.35
[root@VM-0-5-centos ~]# printf "%-5s %-10s %-4.2f\n" 02 Jack 89.2345
02    Jack       89.23
[root@VM-0-5-centos ~]# printf "%-5s %-10s %-4.2f\n" 03 Jeff 98.4323
03    Jeff       98.43


[root@VM-0-5-centos ~]# printf '%d %d %d\n' 12 34 56
12 34 56

Awk is a programming language used to process text and data under linux/unix. The data can come from standard input (stdin), one or more files, or the output of other commands. It supports advanced functions such as user-defined functions and dynamic regular expressions. It is a powerful programming tool under linux/unix. It is used on the command line, but more as a script. Awk has many built-in functions, such as arrays and functions, which is the same as C language. Flexibility is the biggest advantage of awk.

awk 'BEGIN{action} Condition 1{Action 1} Condition 2{Action 2} END{action}' file


Conditions( Pattern): 
     Relational expressions are generally used as conditions
     x>10
     x<=10
 
Action( Action):
 		Format output
 		Process control statement

An awk script consists of three parts: BEGIN statement block, general statement block that can use condition matching, and END statement block. These three parts are optional. Any part can not appear in the script.

The BEGIN condition will execute the corresponding action before reading the first line of the file; the END condition will execute after reading the file.

test

# 1. When processing the first line of file, first execute the action corresponding to BEGIN, then unconditionally separate each line with the default space, take out the fifth column, and finally execute the action corresponding to END
[root@VM-0-5-centos ~]# df -h | awk 'BEGIN{printf "this is begin \n"} {printf $5 "\n"} END{printf "this is end \n"}'
this is begin
 Used%
0%
1%
1%
0%
12%
0%
this is end

# 2. Assign i=0 before processing the first line of file, then add i by one for each line, and finally output the value of i
[root@VM-0-5-centos ~]# df -h | awk 'BEGIN{ i=0 } { i++ } END{ print i }'
7

# 3. Combine examples 1 and 2
[root@VM-0-5-centos ~]# df -h | awk 'BEGIN{i=0 ;printf "start \n"} {printf $5 "\t"} {i++;  printf i "\n"}  END{printf "end\n"}'
start
 Used%	1
0%	2
1%	3
1%	4
0%	5
12%	6
0%	7
end

Since the default delimiter of the awk command is a space (there is no limit on the number), for files whose delimiter is not a space, the main function of BEGIN is to set the delimiter before processing the file, where FS is used to set the delimiter.

# Data to process
[root@VM-0-5-centos ~]# cat /etc/passwd | grep syslog
syslog:x:996:994::/home/syslog:/bin/false

# Use BEGIN to set the separator to ":", and then output the third column of data
[root@VM-0-5-centos ~]# cat /etc/passwd | grep syslog | awk 'BEGIN {FS=":"} {printf $3 "\n"}'
996

# Using relational operator conditions
[root@VM-0-5-centos ~]# cat /etc/passwd  | awk 'BEGIN {FS=":"} $3>995{printf $3 "\n"}'
999
998
997
996
1000
1001
1002

Here is a specific question to think about how you would handle it. We have used the df -h command above to check the disk occupancy of Linux. We want to set a scheduled task to check the occupancy of a disk every day. If the disk occupancy ratio reaches 80%, we will send an email alarm. Here, we only deal with the first step, that is, judgment Whether the disk occupancy exceeds the set threshold.

Script file

#!/bin/bash

read -p 'please input rate: ' r
rate=$(df -h | grep 'vda1'| awk '{printf $5}' | cut -d '%' -f 1)
if [ $rate -gt $r ]
        then
                echo 'alarm'
fi

test

[root@VM-0-5-centos ~]# chmod 755 test.sh

[root@VM-0-5-centos ~]# ./test.sh
please input rate: 20
[root@VM-0-5-centos ~]# ./test.sh
please input rate: 10
alarm

sed

sed is a lightweight stream editor for almost all unix platforms (including Linux). It is a very important tool in text processing. It can be used perfectly with regular expressions and has extraordinary functions.

When processing, the currently processed row is stored in a temporary buffer, which is called "pattern space" Then, the sed command is used to process the contents of the buffer. After processing, the contents of the buffer are sent to the screen. Then the next line is processed, which is repeated until the end of the file. The contents of the file do not change unless you use redirection to store the output. sed is mainly used to automatically edit one or more files, simplify repeated operations on files, write conversion programs, etc .

# sed [options] 'action' file name

Options:

-n  commonly sed The command will output all data to the screen. If this option is added, only the data that has passed will be output sed The line processed by the command is output to the screen
-e	Allows multiple entries to be applied to input data sed Command editing
-i  use sed The modification result directly modifies the file reading data, rather than output by the screen

action:

a:  Append: add one or more rows after the current row. When adding multiple rows, the end of each row needs to be marked with except the last row'\' Represents that the data is not complete
c:  Line replacement with'c' The following string replaces the original data line. When replacing multiple lines, the end of each line is marked with except the last line'\' Indicates that the data is not complete
i:  Insert: inserts one or more rows before the current row. When inserting multiple rows, the end of each row is marked with except the last row'\' Indicates that the data is not complete
d:	 Delete, delete the specified row
p:   Print and output the specified line
s:   String replacement, replacing one string with another. Format is "What line s/Old string/New string/g",use '/g'Indicates that all matches in the row are replaced  (And vim similar)

test

# Files to process
[root@VM-0-5-centos ~]# cat student.txt
ID	NAME	Gender	MARK
1	zs	M	98
2	ls	F	99

# Print out the second line. By default, not only the specified line but also all lines in the original file will be output
[root@VM-0-5-centos ~]# sed '2p' student.txt
ID	NAME	Gender	MARK
1	zs	M	98
1	zs	M	98
2	ls	F	99

# -n. Only the data processed by sed is output
[root@VM-0-5-centos ~]# sed -n '2p' student.txt
1	zs	M	98

# Delete the data in line 2 without modifying the original file (modified in the buffer)
[root@VM-0-5-centos ~]# sed '2d' student.txt
ID	NAME	Gender	MARK
2	ls	F	99
[root@VM-0-5-centos ~]# cat student.txt
ID	NAME	Gender	MARK
1	zs	M	98
2	ls	F	99

# Add a piece of data after line 3 without modifying the original data (modified in the buffer)
[root@VM-0-5-centos ~]# sed '3a hello world' student.txt
ID	NAME	Gender	MARK
1	zs	M	98
2	ls	F	99
hello world

# Insert a piece of data before line 3 without modifying the original data (modified in the buffer)
[root@VM-0-5-centos ~]# sed '3i hello world' student.txt
ID	NAME	Gender	MARK
1	zs	M	98
hello world
2	ls	F	99

# Replace the data in line 3 without modifying the original data (modified in the buffer)
[root@VM-0-5-centos ~]# sed '3c hello world' student.txt
ID	NAME	Gender	MARK
1	zs	M	98
hello world

# Replace the string field without modifying the original data (modified in the buffer)
[root@VM-0-5-centos ~]# sed  '2s/98/100/g' student.txt
ID	NAME	Gender	MARK
1	zs	M	100
2	ls	F	99

# String replacement, directly modify the file
[root@VM-0-5-centos ~]# sed -i '2s/98/100/g' student.txt
[root@VM-0-5-centos ~]# cat student.txt
ID	NAME	Gender	MARK
1	zs	M	100
2	ls	F	99

# Replace multiple places
[root@VM-0-5-centos ~]# sed -i '2s/M/F/; 2s/zs/ww/' student.txt
[root@VM-0-5-centos ~]# cat student.txt
ID	NAME	Gender	MARK
1	ww	F	100
2	ls	F	99

summary

This article has learned three commands:

  • cut: used to extract columns in a line of text. The default separator is TAB. You can also specify the separator. If the separator is a space, you need to ensure that the number of spaces between each column is the same;

  • awk: a programming language with powerful functions, supporting arrays, functions, etc., similar to C language;

  • sed: powerful streaming text editor, which can easily process each line of data in the file.

PS: This article only briefly introduces the common parts of commands. If you want to have a deeper understanding, you can go to here study.

more

Personal blog: https://lifelmy.github.io/

WeChat official account: long Coding Road

Tags: Linux shell

Posted on Mon, 22 Nov 2021 10:44:12 -0500 by maxpouliot