awk different separator-space separator

Today when I formatted the string with awk, I found a strange phenomenon. I checked the awk manual and recorded it here.

Sample Text Content

Appearing in all awk names afterFile.txtThe contents are as follows:

# cat -A file.txt
     1^Iroot:x:0:0:root:/root:/bin/bash$
     2^Ibin:x:1:1:bin:/bin:/sbin/nologin$
     3^Idaemon:x:2:2:daemon:/sbin:/sbin/nologin$

Phenomenon Description

When multiple separators (including spaces) are specified by awk-F's'[]', consecutive spaces are separated into multiple fields.

By default, awk uses white space characters (including spaces, TAB characters, line breaks) as delimiters. For a more intuitive comparison, this example is specified directly through the -F parameter.A simple example compares:

Let's first specify spaces as separators to get the second field

# awk -F " " '{print NF, $2}' file.txt
2 root:x:0:0:root:/root:/bin/bash
2 bin:x:1:1:bin:/bin:/sbin/nologin
2 daemon:x:2:2:daemon:/sbin:/sbin/nologin

Get it again by specifying a space separator

# awk -F "[ ]" '{print NF, $6}' file.txt
6 1 root:x:0:0:root:/root:/bin/bash
6 2 bin:x:1:1:bin:/bin:/sbin/nologin
6 3 daemon:x:2:2:daemon:/sbin:/sbin/nologin

Isn't it strange that when we use -F''as the separator, we only have two fields per line, whereas when we use -F'[]' as the separator, we have six fields per line.$1-$5 gets an empty value, and $6 does print everything.

Check the awk manual:
4.5.1 Whitespace Normally Separates Fields

awk interpreted this value in the usual way, each space character would separate fields, so two spaces in a row would make an empty field between them. The reason this does not happen is that a single space as the value of FS is a special caseā€”it is taken to specify the default manner of delimiting fields.

If FS is any other single character, such as ",", then each occurrence of that character separates two fields. Two consecutive occurrences delimit an empty field. If the character occurs at the beginning or the end of the line, that too delimits an empty field. The space character is the only single character that does not follow these rules.

4.5.2 Using Regular Expressions to Separate Fields

There is an important difference between the two cases of 'FS = " "' (a single space) and 'FS = "[ \t\n]+"' (a regular expression matching one or more spaces, TABs, or newlines). For both values of FS, fields are separated by runs (multiple adjacent occurrences) of spaces, TABs, and/or newlines. However, when the value of FS is " ", awk first strips leading and trailing whitespace from the record and then decides where the fields are. 

awk manual

These two paragraphs just explain this strange phenomenon.It probably means:

  • Continuous spaces in a row do not separate empty fields.When the FS value is "", awk first removes the blanks at the beginning and end of the line from the record, then splits the field.
  • If FS is another character, such as ","two consecutive occurrences will separate an empty field.If a character appears at the beginning or end of a line, empty fields are also separated.The space character, as the default delimiter, is the only character that does not follow these rules.
  • If specified by -F'[]', a persistent representation is separated by a single space, which loses its character as the default separator and follows the same rules as other characters.

summary

In conjunction with the above, let's take another look at some examples to summarize what we have today.

Example:

Summary:

  • Example 1, no delimiter is specified, and the default delimiter is used, where consecutive white space characters at the beginning of the line are automatically removed.
  • Example 2 specifies that the delimiter is a space, which is equivalent to the default delimiter.
  • Example 3 specifies that the separator is one or more consecutive colons or tab keys, where the first consecutive blank characters of the line are counted together into the first field.
  • Example 4 specifies that the separator is one or more consecutive whitespace characters or colons or tab keys, in which case multiple consecutive whitespace characters at the beginning of the line are separated into a separate field.

Tags: Linux

Posted on Tue, 23 Jun 2020 12:55:31 -0400 by Brudus