Traverse the document tree
A document in html or xml format will become a document tree after bs processing. The top-level node is a tag. This tag contains many child nodes, which can be strings or tags. Next, learn to traverse the document tree with an example document.
html_doc = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> </body> </html>""" from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')
Child node
The child node can be a string or tag. bs provides many operations and attributes to traverse the child node, but the string itself does not support continuous traversal.
Traversal by tag name
For example, in the above example document, if you want to get the first a tag, you can directly get up. A. if there are child nodes under the child node, for example, there are string child nodes under the a tag, you can get the string child nodes under the a signature child node under the soup object by way of up.a.string.
Note: the child node obtained by. Above is the first child node found in the document
If you need to get all a child nodes in the current document, you can use the find of the beautiful soup object_ The all () method returns a list
list_a = soup.find_all('a') print(lsit_a) # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Traversal through. contents and. children
Through the. contents attribute, the child nodes of a tag can be output as a list, such as:
list_tag = soup.a.contents print(list_tag) # ['Elsie']
Remember that the BeautifulSoup object itself is also a document, so it also has child nodes:
list_soup = soup.contents print(list_soup[0].name) # html
Through the tag.children attribute, you can perform a for loop on the child nodes of tag
a_tag = soup.a for i in a_tag: print(i) # Elsie
. descendants property
Through. descendants, you can recursively cycle through all descendant nodes under a parent tag node:
for i in soup.head.descendants: print(i) # <title>The Dormouse's story</title> # The Dormouse's story
. string attribute
If tag has only one child node of NavigableString type or only one child node, you can use. String to obtain the unique string, for example:
title_tag = soup.title print(title_tag.string) # The Dormouse's story head_tag = soup.head print(head_tag) print(head_tag.string) # <head><title>The Dormouse's story</title></head> # The Dormouse's story
Note: if there is more than one child node under a tag (including spaces, line breaks, etc.), then. string will output None,
. strings and stripped_strings
If the tag contains multiple strings, you can use. Strings to loop through:
for stri in soup.strings: print(repr(stri))
Parent node
Each tag or string has a parent node
.parent
The parent node of the tag or string can be obtained through the. Parent attribute of the tag or string, for example:
tag = soup.title print(tag.parent) # <head><title>The Dormouse's story</title></head> print(soup.parent) # None
.parents
. parents, as the name suggests, can traverse all parent nodes of a tag or string until None
tag = soup.a for ele in tag.parents: print(ele.name) # 'p' # 'body' # 'html' # '[document]'
Sibling node
Children with the same parent and at the same level are called siblings
.next_sibling and. previous_sibling
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>","html.parser") print(sibling_soup.b.next_sibling) print(sibling_soup.b.previous_sibling) print(sibling_soup.c.previous_sibling) print(sibling_soup.c.next_sibling) # <c>text2</c> # None # <b>text1</b> # None
.next_siblings and. previous_siblings
You can iteratively output the sibling nodes of the current node, for example
for sibling in soup.a.next_siblings: print(repr(sibling)) # ',\n' # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> # ' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> # ';\nand they lived at the bottom of a well.'
Back and forward
Let's look at an html document:
<html> <head>...</head> <body> <a>...</a> </body> </html>
After parsing through the bs, the document should be opened with a tag, opened with a tag, written to the content in the head, closed with a tag, and then opened with a tag.
.next_element and. previous_element
.next_element is used to point to the next object element (a tag or string) of the element in the parsing process. The pointing result may be the same as. next_sibling is the same, but most of the time it is different.
last_a_tag = soup.find("a", id="link3") print(repr(last_a_tag.next_element)) print(repr(last_a_tag.next_sibling)) # 'Tillie' # ';\nand they lived at the bottom of a well.'
.previous_element and. next_element instead, it points to the previous object of the current resolved object
last_a_tag = soup.find("a", id="link3") print(repr(last_a_tag.previous_element)) print(repr(last_a_tag.previous_element.next_element)) # ' and\n' # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
.next_elements and. previous_elements
Via. next_elements and. previous_elements can iterate through the objects to be parsed after and before the current element
last_a_tag = soup.find("a", id="link3") for element in last_a_tag.next_elements: print(repr(element)) # 'Tillie' # ';\nand they lived at the bottom of a well.' # '\n' # <p class="story">...</p> # '...' # '\n'