Day 66: search documents of Beautiful Soup

by bean

About Beautiful Soup

Beautiful Soup is a Python library that can extract data from HTML or XML files. It provides some simple operations to help you deal with the tedious work of document navigation, searching, modifying documents and so on. Because it's easy to use, Beautiful Soup will save you a lot of working time.

In the previous article, we introduced how to use Beautiful Soup to traverse the nodes in the document. In this article, we continue to learn how to use Beautiful Soup to specify the content you want in the document.

Beautiful Soup search documents

Also for the smooth development of the story, we continue to use the previous HTML text, and all the examples below are based on this text.

html_doc = """
<html><head><title>index</title></head>
<body>
<p class="title"><b>home page</b></p>
<p class="main">My favorite websites
<a href="https://www.google.com" class="website" id="google">Google</a>
<a href="https://www.baidu.com" class="website" id="baidu">Baidu</a>
<a href="https://cn.bing.com" class="website" id="bing">Bing</a>
</p>
<div><!--This is the comment--></div>
<p class="content1">...</p>
<p class="content2">...</p>
</body>
"""
soup = BeautifulSoup(html_doc, "lxml")

filter

Before we officially explain the search documents, it is necessary to understand the Beautiful Soup filters, which are reflected in the whole Search API. They can be used in the name of TAG, attribute, string or their mixture. It sounds a bit convoluted. Just look at a few examples.

1. Find the tags according to the TAG name. The following example will find all b tags in the document. At the same time, we should pay attention to unifying the incoming Unicode encoding to avoid the error of Beautiful Soup parsing encoding.

# demo 1
tags = soup.find_all('b')
print(tags)

#Output results
[<b>home page</b>]

2. If you pass in a regular expression as a parameter, Beautiful Soup matches the content through the regular expression's match().

# demo 2
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

#Output results
body
b

3. If you pass in a list parameter, Beautiful Soup returns what matches any one of the elements in the list.

# demo 3
for tag in soup.find_all(['a', 'b']):
    print(tag)

#Output results
<b>home page</b>
<a class="website" href="https://www.google.com" id="google">Google</a>
<a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>
<a class="website" href="https://cn.bing.com" id="bing">Bing</a>

4. True can match any value. The following example is to find all tags without returning a string.

# demo 4
for tag in soup.find_all(True):
    print(tag.name, end=', ')
 
#Output results
html, head, title, body, p, b, p, a, a, a, div, p, p, 

5. Method. We can define a method that takes only one parameter. If the method returns True, it means the current element matches and is found. If it returns False, it means it is not found. The following example shows finding all nodes that contain both class and id attributes.

# demo 5
def has_id_class(tag):
    return tag.has_attr('id') and tag.has_attr('class')

tags = soup.find_all(has_id_class)
for tag in tags:
	print(tag)
	
#Output results
<a class="website" href="https://www.google.com" id="google">Google</a>
<a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>
<a class="website" href="https://cn.bing.com" id="bing">Bing</a>

In most cases, string filter can meet our needs. In addition, with this magic method filter, we can implement various custom requirements.

find_all() function

This function searches all the child nodes under the current node. Their signatures are as follows: find_all( name , attrs , recursive , text , **kwargs ). We can pass in the name of the specified TAG to find the node. The above example has already been given. We will not go into details here. Let's look at a few other uses.

1. If we pass in find_ The all() function does not search for the built-in parameter name, so the search will map the parameter to the property. The following example shows finding a node with the id of google.

Parameter values that can be used to search for properties with a specified name include: string, regular expression, list, True. That is, the filter we introduced above.

# demo 6
tags = soup.find_all(id='google')
print(tags[0]['href'])

for tag in soup.find_all(id=True): # Find all tags with id attribute
	print(tag['href'])

#Output results
https://www.google.com
https://www.google.com
https://www.baidu.com
https://cn.bing.com

2. Search by CSS class name, but the keyword class of escort CSS is a built-in keyword in Python. Starting from beautiful soup version 4.1.1, you can use class_ Parameter search for TAG with specified CSS class name:

class_ Parameters also accept different types of filters: string, regular expression, method, True.

# demo 7
tags = soup.find_all("a", class_="website")
for tag in tags:
	print(tag['href'])

def has_seven_characters(css_class):
    return css_class is not None and len(css_class) == 7

for tag in soup.find_all(class_=has_seven_characters):
	print(tag['id'])

#Output results
https://www.google.com
https://www.baidu.com
https://cn.bing.com
google
baidu
bing

At the same time, because CSS can have multiple values, we can search each value in CSS separately.

# demo 8
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'lxml')
tags = css_soup.find_all("p", class_="strikeout")
print(tags)

#Output results
[<p class="body strikeout"></p>]

3. You can search not only the entire document by label and CSS, but also by content using text. At the same time, text can cooperate with other attributes to complete the search task.

# demo 9
tags = soup.find_all(text="Google")
print("google : ", tags)

tags = soup.find_all(text=["Baidu", "Bing"])
print("baidu & bing : ", tags)

tags = soup.find_all('a', text="Google")
print("a[text=google] : ", tags)

#Output results
google :  ['Google']
baidu & bing :  ['Baidu', 'Bing']
a[text=google] :  [<a class="website" href="https://www.google.com" id="google">Google</a>]

4. Limit return quantity

Sometimes the document tree is too large. We don't want to find the whole tree, just the specified number of nodes, or just the child nodes, rather than the grandchildren. Just specify the limit or recursive parameter.

# demo 10
tag = soup.find_all("a", limit=1)
print(tag)

tags = soup.find_all("p", recursive=False)
print(tags)

#Output results
[<a class="website" href="https://www.google.com" id="google">Google</a>]
[]

Because the child node of the object does not have a p label, it returns an empty list.

find() function

This function will only return one result, and find_all(some_args, limit=1) is equivalent. The only difference is that the function directly returns the result, while find_ The all() function returns a list of results. In addition, find_ If the all() method does not find the target, it returns an empty list. If the find() method cannot find the target, it returns None. There are no other differences in use.

Other functions

Except find_ In addition to all() and find(), there are 10 search API s in Beautiful Soup, five of which are related to find_all() has the same search parameters. The other five are similar to the search parameters of find (), except that they search for documents in different scopes.

find_parents() and find_parent() is used to search the parent of the current node.

find_next_siblings() and find_next_sibling() iterates over all siblings resolved after the current node.

find_previous_siblings() and find_previous_sibling() iterates over all siblings resolved before the current node.

find_all_next() and find_next() iterates over the TAG and string after the current node.

find_all_previous() and find_previous() iterates over the TAG and string before the current node.

The difference between the above five groups of functions is that the former only returns a list of all nodes that meet the search criteria, while the latter only returns the first node that meets the search criteria.

Because of the use of these 10 API s and find_all() is the same as find(). All i's are not examples here. Readers can explore by themselves.

CSS selector

The TAG can be found using the syntax of the CSS selector by passing in a string parameter in the. select() method of the TAG or beautifulsop object.

1. Search layer by layer through a label.

# demo 11
tags = soup.select("body a")
for tag in tags:
	print(tag['href'])

#Output results
https://www.google.com
https://www.baidu.com
https://cn.bing.com

2. Find the direct child label under a label

# demo 12
tags = soup.select("p > a")
print(tags)

tags = soup.select("p > #google")
print(tags)

#Output results
[<a class="website" href="https://www.google.com" id="google">Google</a>, <a class="website" href="https://www.baidu.com" id="baidu">Baidu</a>, <a class="website" href="https://cn.bing.com" id="bing">Bing</a>]
[<a class="website" href="https://www.google.com" id="google">Google</a>]

3. Find directly by CSS class name

# demo 13
tags = soup.select(".website")
for tag in tags:
	print(tag.string)

#Output results
Google
Baidu
Bing

4. Find by id attribute of tag

# demo 14
tags = soup.select("#google")
print(tags)

#Output results
[<a class="website" href="https://www.google.com" id="google">Google</a>]

5. Find by the value of the property

# demo 15
tags = soup.select('a[href="https://cn.bing.com"]')
print(tags)

#Output results
[<a class="website" href="https://cn.bing.com" id="bing">Bing</a>]

Summary of Beautiful Soup

This chapter introduces the related operations of Beautiful Soup on document search. Mastering these API operations can help us find the nodes we want to locate faster and better. Don't be scared by so many functions. In fact, we only need to master find_all() and find() are two functions. The use of other APIs is the same. You can start quickly with a little practice.

Code address

Example code: https://github.com/JustDoPython/python-100-day/tree/master/day-066

References

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#

Pay attention to the official account: python technology, reply to "python", learn and communicate with each other.

Tags: Google Python Attribute encoding

Posted on Mon, 01 Jun 2020 02:10:47 -0400 by KarlBeK0d3r