python third-party library beautiful soup4 document learning

Traverse the document tree

A document in html or xml format will become a document tree after bs processing. The top-level node is a tag. This tag contains many child nodes, which can be strings or tags. Next, learn to traverse the document tree with an example document.

html_doc = """<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>
  </body>
</html>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Child node

The child node can be a string or tag. bs provides many operations and attributes to traverse the child node, but the string itself does not support continuous traversal.

Traversal by tag name

For example, in the above example document, if you want to get the first a tag, you can directly get up. A. if there are child nodes under the child node, for example, there are string child nodes under the a tag, you can get the string child nodes under the a signature child node under the soup object by way of up.a.string.

Note: the child node obtained by. Above is the first child node found in the document

If you need to get all a child nodes in the current document, you can use the find of the beautiful soup object_ The all () method returns a list

list_a = soup.find_all('a')
print(lsit_a)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Traversal through. contents and. children

Through the. contents attribute, the child nodes of a tag can be output as a list, such as:

list_tag = soup.a.contents
print(list_tag)

# ['Elsie']

Remember that the BeautifulSoup object itself is also a document, so it also has child nodes:

list_soup = soup.contents
print(list_soup[0].name)

# html

Through the tag.children attribute, you can perform a for loop on the child nodes of tag

a_tag = soup.a
for i in a_tag:
    print(i)

# Elsie

. descendants property

Through. descendants, you can recursively cycle through all descendant nodes under a parent tag node:

for i in soup.head.descendants:
    print(i)

# <title>The Dormouse's story</title>
# The Dormouse's story

. string attribute

If tag has only one child node of NavigableString type or only one child node, you can use. String to obtain the unique string, for example:

title_tag = soup.title
print(title_tag.string)
# The Dormouse's story

head_tag = soup.head
print(head_tag)
print(head_tag.string)
# <head><title>The Dormouse's story</title></head>
# The Dormouse's story

Note: if there is more than one child node under a tag (including spaces, line breaks, etc.), then. string will output None,

. strings and stripped_strings

If the tag contains multiple strings, you can use. Strings to loop through:

for stri in soup.strings:
    print(repr(stri))
    

Parent node

Each tag or string has a parent node

.parent

The parent node of the tag or string can be obtained through the. Parent attribute of the tag or string, for example:

tag = soup.title
print(tag.parent)
# <head><title>The Dormouse's story</title></head>
print(soup.parent)
# None

.parents

. parents, as the name suggests, can traverse all parent nodes of a tag or string until None

tag = soup.a
for ele in tag.parents:
	print(ele.name)

# 'p'
# 'body'
# 'html'
# '[document]'

Sibling node

Children with the same parent and at the same level are called siblings

.next_sibling and. previous_sibling

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>","html.parser")

print(sibling_soup.b.next_sibling)
print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.previous_sibling)
print(sibling_soup.c.next_sibling)

# <c>text2</c>
# None
# <b>text1</b>
# None

.next_siblings and. previous_siblings

You can iteratively output the sibling nodes of the current node, for example

for sibling in soup.a.next_siblings:
	print(repr(sibling))

# ',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# ' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# ';\nand they lived at the bottom of a well.'

Back and forward

Let's look at an html document:

<html>
    <head>...</head>
    <body>
        <a>...</a>
    </body>
</html>

After parsing through the bs, the document should be opened with a tag, opened with a tag, written to the content in the head, closed with a tag, and then opened with a tag.

.next_element and. previous_element

.next_element is used to point to the next object element (a tag or string) of the element in the parsing process. The pointing result may be the same as. next_sibling is the same, but most of the time it is different.

last_a_tag = soup.find("a", id="link3")
print(repr(last_a_tag.next_element))
print(repr(last_a_tag.next_sibling))

# 'Tillie'
# ';\nand they lived at the bottom of a well.'

.previous_element and. next_element instead, it points to the previous object of the current resolved object

last_a_tag = soup.find("a", id="link3")
print(repr(last_a_tag.previous_element))
print(repr(last_a_tag.previous_element.next_element))

# ' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements and. previous_elements

Via. next_elements and. previous_elements can iterate through the objects to be parsed after and before the current element

last_a_tag = soup.find("a", id="link3")
for element in last_a_tag.next_elements:
    print(repr(element))

# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# <p class="story">...</p>
# '...'
# '\n'

Posted on Thu, 11 Nov 2021 14:55:15 -0500 by Dave96