Python regular expression explanation

I. re

Let's first introduce the method under re module, which we can use when we take it. Of course, the premise is to know a little regular expression. If you don't know regular expression at all, you can take a look at the regular expression part later.

1.1 match

The match method matches a pattern from the beginning of the string, and returns None if the match fails.

re.match(pattern, string, flags=0)

pattern: regular expression String: string to be matched flags: matching mode (case sensitive, single line matching or multi line matching)

Match returns a re.Match object. The methods in match will be described in detail later.

import re

content = "Cats are smarter than dogs"

# The first parameter is a regular expression, and re.I means ignore case.
match = re.match(r'(cats)', content, re.I)
print(type(match))
print(match.groups())

match = re.match(r'dogs', content, re.I)
print(type(match))
# print(match.groups())

match is mainly used to capture packets, so try to use the grouping mode, or the matching results will not be obtained. If flag is re.I, case is ignored.

Another very important point is that match only finds the first matching group:

import re

content = "aa aa smarter aa dogs"

match = re.match(r'(aa)', content, re.I)
if match:
    print(match.groups())

The output above is: ('aa ',)

1.2 search

Scanning the entire string and returning the first successful match, search differs from match in that search does not force a match from the beginning.

re.search(pattern, string, flags=0)  
import re

content = '+123abc456*def789ghi'

# \w can match [a-zA-Z0-9 "], + means match at least once
reg = r"\w+"
match = re.search(reg, content)

if match:
    print(match.group())

1.3 sub

Replace matches in string

re.sub(pattern, repl, string, count=0, flags=0)

pattern: regular expression repl: the replacement string, which can be a function String: the string to be found and replaced count: the maximum number of times to replace after pattern matching. By default, 0 means to replace all matches. Optional flags: optional parameter, matching mode, default is 0

Replace harmonious characters:

import re

content = "do something fuck you"

rs = re.sub(r'fuck', "*", content)
print(rs)

It's very simple. Replace the fuck with*

Now that the requirements have changed, we need to block words with a few characters and replace them with a few *, how to deal with it?

import re

def calcWords(matched):
    num = len(matched.group())
    return str(num * '*')

content = "do something fuck you"

rs = re.sub(r'fuck', calcWords, content)
print(rs)

The replacement string can use the function, and we can easily calculate it in the function.

1.4 findall

Find all the substrings matched by the regular expression in the string and return a list. If no matching is found, return an empty list.

re.findall(pattern, string, flags=0)

pattern: regular expression String: string to be matched flags: optional parameter, matching mode, default is 0

import re

content = '+123abc456*def789ghi'

reg = r"\d+"
rs = re.findall(reg, content)
# ['123', '456', '789']
print(rs)

One of the craziness of findall is that if there are groupings in a regular expression, only the matches in the groupings are returned.

import re

content = '+123abc456*def789ghi'

reg = r"\d+([a-z]+)"
rs = re.findall(reg, content)
# ['abc', 'ghi']
print(rs)

1.5 finditer

Find all the substrings that the regular expression matches in the string and return them as an iterator

re.finditer(pattern, string, flags=0)

pattern: regular expression String: string to be matched flags: optional parameter, matching mode, default is 0

import re

content = '+123abc456*def789ghi'

reg = r"\d+"
rss = re.finditer(reg, content)

# 123 456 789 
for rs in rss:
    print(rs.group(), end=' ')

Findier is similar to findall, but there is no maddening problem that findall only returns packets if there are packets.

import re

content = '+123abc456*def789ghi'

reg = r"\d+([a-z]+)"
rss = re.finditer(reg, content)

# 123abc 789ghi
for rs in rss:
    print(rs.group(), end=' ')

1.6 split

Split the string according to the matching substring and return to the list

re.split(pattern, string, maxsplit=0, flags=0)
import re

content = '+123abc456*def789ghi'

reg = r"\d+"
rs = re.split(reg, content)

print(rs)

1.7 compile

Compile the regular expression to generate a regular expression Pattern object. The previous methods will call this method first to get a Pattern object, and then use the method with the same name of the Pattern object.

Next, we will introduce the Pattern object.

re.compile(pattern, flags=0)  

II. Pattern

Pattern object is a compiled regular expression. Pattern cannot be instantiated directly. It must be constructed with re.compile().

2.1 attribute

attribute Explain
pattern Regular expressions used for compilation
flags Match pattern used at compile time, in numerical form
groups Number of groups in expression
groupindex Dictionary, key is the alias of group, value is the number of group
import re

pattern = re.compile(r'(\w+)(?P<gname>.*)', re.S)

# pattern: (\w+)(?P<gname>.*)
print("pattern:", pattern.pattern)
# flags: 48
print("flags:", pattern.flags)
# groups: 2
print("groups:", pattern.groups)
# groupindex: {'gname': 2}
print("groupindex:", pattern.groupindex)

2.2 method

The methods described in the previous re module are applicable to pattern, but the pattern parameter is missing.

In fact, the method in the re module uses the pattern parameter to construct a pattern object through the re.compile method.

Three, Match

The Match object is the result of a Match, which contains the information of the Match. You can use the properties or methods provided by Match to obtain the information.

3.1 attribute

attribute Explain
string Text to match
re Get the expression of Pattern
pos Where regular expressions start searching in text
endpos Where the regular expression ends the search in the text
lastindex The index of the last captured group in the text. If there are no captured groups, None
lastgroup Alias of the last captured group. If there are no captured groups, None
import re

content = "123456<H1>first</h1>123456"

reg = r'\d+<[hH](?P<num>[1-6])>.*?</[hH](?P=num)>'

match = re.match(reg, content)

# string: 123456<H1>first</h1>123456
print("string:", match.string)
# re: re.compile('\\d+<[hH](?P<num>[1-6])>.*?</[hH](?P=num)>')
print("re:", match.re)
# pos: 0
print("pos:", match.pos)
# endpos: 26
print("endpos:", match.endpos)
# lastindex: 1
print("lastindex:", match.lastindex)
# lastgroup: num
print("lastgroup:", match.lastgroup)

I feel that Match's attribute is a little weak.

3.2 method

Method Explain
groups() Get all matching strings of groups, return tuples
group([group1,......]) Get string matched by group, return tuple
start(group) Gets the start matching position of the group in the original string
end(group) Gets the end match position of the group in the original string
span(group) Get the start and end matching position of the group in the original string, tuple
groupdict() Get the matching string of the alias group, return the dictionary, alias is the key
expand(template) The template string can be matched by alias and number references

Note: parameterless group is equivalent to group(0), which returns the whole matched string

import re

match = re.match(r'(\w+) (\w+) (?P<name>.*)', 'You love sun')

# groups(): ('You', 'love', 'sun')
print("groups():", match.groups())
# group(2,3): ('love', 'sun')
print("group(2,3):", match.group(2, 3))
# start(2): 4
print("start(2):", match.start(2))
# end(2): 8
print("end(2):", match.end(2))
# span(2): (4, 8)
print("span(2):", match.span(2))
# groupdict(): {'name': 'sun'}
print("groupdict():", match.groupdict())
# expand(r'I \2 \1!'): I love You!
print(r"expand(r'I \2 \1!'):", match.expand(r'I \2 \1!'))

The method in the Match above is still important, because we basically get the Match through the method in the Match object.

4. Regular expression

4.1 commonly used

Expression Explain
. Match any character except line break. When the re.S tag is specified, any character including line break can be matched.
? Matches 0 or 1 fragments defined by the previous regular expression, non greedy
+ Match 1 or more expressions
* Match 0 or more expressions
[] Used to represent a set of characters, listed separately, [a b c] matches' a ',' b ', or' c '
[^] Characters not in [], [^ a B C] matches characters other than a,b,c
^ Match the beginning of the string and the end of the line for multiline patterns
\A Match beginning of string
$ Match the end of the string, and the multiline pattern matches the end of the line
\Z End of match string
{n} Accurate n, "o{2}" matches food, does not match food, food
{n,} At least n, "o{2,}" matches food, food, and does not match food
{n, m} Match n to m, "o{2,3}", match food, food, don't match food, food
| a|b, match a or B
- -It can represent interval, and [0-9] can match any number in 0-9.

The most commonly used is. Matching any character, a.b can match abb, acb, adb, a+b, a8b, etc. You can match ab and abb at most, but you can't match abbabb, because you only refer to the previous segment. +Indicates to match at least once: ABB + can match abb, abbb, abbbb, etc., but cannot match ab at that time

  • Represents 0 to more than one time: ABB * can match ab, abb, abbb, abbbb, etc. There is a group of characters in []. The relationship between characters is or

4.2 boundary blank

Expression Explain
\t Tab character
\n Line feed
\r Enter
\f Page change
\w Match numbers, letters, underscores, equivalent to [a-zA-Z0-9 "]
\W Match non (numbers, letters, underscores), equivalent to [^ a-zA-Z0-9 "]
\s Match white space character, equivalent to [\ t\n\r\f]
\S Match non empty characters, equivalent to [^ \ t\n\r\f]
\d Match number, equivalent to [0-9]
\D Match non numeric, equivalent to [^ 0-9]
\b Match word boundary, 'er\b' can match 'er' in 'over', but not 'er' in 'service'
\B Match non word boundary, 'er\B' can match 'er' in 'service', but not 'er' in 'over'

4.3 grouping

Expression Explain
(re) Group matching, nested pattern group count is from left to right, from outside to inside
\number Reference group, use \ 1, 2, 3... Visit 1, 2, 3... Grouping
(?P<name>) Specify the group name, and use name as the alias of the group
(?P=name) Reference group name, apply group by name

The most important function of grouping is to be traceable, that is, to refer to the matched patterns.

Thinking: how to match all h tags in html?

reg = '<[hH][1-6]>.*?</[hH][1-6]>'

Many friends may write expressions similar to those above. What's the problem?

Here's an example:

import re

content = '''
    <html>
    <body>
      <H1>first</h1>
      <p>p tag</p>
      <h2>h2</h2>
      <h3>Illegal label</h4>
    </body>
    </html>
'''

rs = re.findall(r'<[hH][1-6]>.*?</[hH][1-6]>', content)
print(rs)

rs = re.findall(r'<[hH]([1-6])>.*?</[hH]\1>', content)
print(rs)

rs = re.findall(r'((<[hH]([1-6])>).*?</[hH]\3>)', content)
print(rs)

rs = re.findall(r'((<[hH](?P<num>[1-6])>).*?</[hH](?P=num)>)', content)
print(rs)

Looking at the output, we know:

reg = '<[hH][1-6]>.*?</[hH][1-6]>'

The '< H3 > illegal label < / H4 >' part will also be matched to.

We can solve this problem by grouping and then referring to grouping.

reg1 = '<[hH]([1-6])>.*?</[hH]\1>'
reg2 = '((<[hH]([1-6])>).*?</[hH]\3>)'

Because if there is a group, findall prints out the matching group, so we use reg2 as a regular expression.

Why is it \ 3?

Because according to the principle of "from left to right, from outside to inside", we know that ([1-6]) is the third group.

If you don't want to count, or are afraid of counting wrong, you can use alias.

reg = '((<[hH](?P<num>[1-6])>).*?</[hH](?P=num)>)'

4.4 matching before and after

|(? = re) | forward matching, a(?=\d), matching a before the number| |(?! re) | forward and reverse matching, a (?)! D), matching a without a number after it| |(? < = re) | backward matching, (? < = \ d) a, matching a with a number in front| |(? <! Re) | backward reverse matching, (? < d) a, matching a that is not preceded by a number|

Before and after matching is also a non useful function. An important feature of it is not to consume the re part. Let's see an example to help understand.

import re

content = '''
http://www.mycollege.vip
https://mail.mycollege.vip
ftp://ftp.mycollege.vip
'''

# Forward matching: the previous schema, not consumed:
rs = re.findall(r'.+(?=:)', content)
# ['http', 'https', 'ftp']
print(rs)

# Backward matching, matching / / the domain name behind, no consumption//
rs = re.findall(r'(?<=//).+', content)
# ['www.mycollege.vip', 'mail.mycollege.vip', 'ftp.mycollege.vip']
print(rs)

# Backward match, number after $match, no consumption$
price = '''
item1:$99.9
CX99:$199
ZZ88:$999
'''

rs = re.findall(r'(?<=\$)[0-9.]+', price)
# ['99.9', '199', '999']
print(rs)

# Match before and after
title = '''
<head>
    <title>this is title</title>
</head>
'''

rs = re.findall(r'(?<=<title>).*(?=</title>)', title)
# ['this is title']
print(rs)

4.5 other matching

Expression Explain
(?:re) The ungrouped version of (re), (?: abc){2}, matches abcabc
(?imsuxL:re) Use the capitalization in the flag (? i:abc) in brackets to match ABC.
(?-imsuxL:re) Do not use capitalization in the flag corresponding to imsuxL in brackets
(?#...) #The following is comment ignore

4.6 flags

Pattern Explain
re.I IGNORECASE, making match pairs case insensitive
re.M MULTILINE, MULTILINE matching, affects ^ and$
re.S DOTALL, click any match pattern to make. Match all characters including newline
re.X VERBOSE, VERBOSE mode, ignores whitespace and comments in expressions, and allows comments to be added using ා.
re.L LOCALE
re.U UNICODE

re.M is a multiline match pattern:

  1. ^You can match the beginning of a string, or the beginning of a line, after the line break in the string \ n
  2. $can match the end of a string, or the end of a line, before the line break in the string \ n

Single line mode:

  1. ^ equivalent to \A
  2. $equivalent to \Z
import re

content = '''
first line
second line
third line
'''

# ['first', 'second', 'third']
rs = re.findall(r'^(.*) line$', content, re.M)
# []
# rs = re.findall(r'^(.*) line$', content)
# []
# rs = re.findall(r'\A(.*) line\Z', content, re.M)
print(rs)

In the above small example, the multi line pattern can be matched successfully. The single line pattern cannot be matched. Because the single line pattern ^ cannot match the position after \ n, so it does not match at the beginning.

In turn, when we decide whether to use re.M, we only need to consider whether there are ^ and $, if not, it is not necessary. If yes, then we need to consider whether we need to match the position before and after the \ n (line break). If so, we need to add re.M.

re.L and re.U are not easy to understand. 2.x and 3.x have changed a lot. They are basically useless. If you are interested, you can read the document.

Five, document

re

Tags: Attribute ftp

Posted on Wed, 30 Oct 2019 20:55:40 -0400 by severndigital