Python and html parsing

Python and html parsing

Article Directory

This article was originally created by CDFMLR and is included on the personal home page https://clownote.github.io And publish to CSDN at the same time.I do not guarantee the correct CSDN layout. Please visit clownote For a good reading experience.

regular expression

Regular expression is a powerful tool for string processing. It has its own specific grammar structure, which can efficiently implement string retrieval, replacement, matching validation and other operations.

Regular expressions can be used to extract information of interest to us in HTML, just as we do when using the requests library to crawl problems we know are discovering.

The following table is the common matching rules for regular expressions:

Pattern describe
\w Match alphanumeric and underline
\W Match non-alphanumeric and underline
\s Matches any white space character, equivalent to [\t\n\rf]
\S Match any non-empty character
\d Match any number, equivalent to [0-9]
\D Match any non-number
\A Start of Match String
\Z Match end of string, if a line break exists only match end of string before line break
\z End of Match String
\G Match the last matched location
\n Match a line break
\t Match a tab
^ Beginning of match string
$ End of match string
. Matches any character except a line break. Any character including a line break can be matched when the re.DOTALL tag is specified
[...] Used to represent a set of characters, listed separately: [a m k] matches'a','m'or'k'
[^...] Characters not in []: [^a B c] matches characters other than a,b,c.
* Match 0 or more expressions.
+ Match one or more expressions.
? Match zero or one fragment defined by the previous regular expression, non-greedy
{n} Exact match n previous expressions.
{n, m} Matches fragments n to m times defined by previous regular expressions, greedy
a|b Match a or b
( ) Matches expressions within parentheses, also representing a group

With greedy matching,. * matches as many characters as possible; with non-greedy matching, matches as few characters as possible.
[Note] Escape: If you want to represent those symbols in a table in a regular pattern, we need to escape, that is, to use the \+symbol.For example, to express brackets (, we need to write \(

For example, when we grabbed a problem in Knowledge-Discovery, we used the following regular expression:

explore-feed.*?question_link.*?>(.*?)</

It matches the following in the requested html:

<div class="explore-feed feed-item" data-offset="1">
<h2><a class="question_link" href="/question/311635229/answer/..." target="_blank" data-id="..." data-za-element-name="Title">
......
</a></h2>

RE in Python

The Python built-in re library provides support for regular expressions.

Here's how to use the re library:

match()

The match() method requires us to pass in a string to match and a regular expression to detect whether the string matches the regular expression.

The match() method determines whether or not a match is made and returns a Match object if the match is successful or None if it is not.

match() matches from the beginning, that is, if the first character cannot be matched, it will not match.

We can often do this:

test = 'String to test'
if re.match(r'regular expression', test):
    print('Match')
else:
    print('Failed')

Extract Grouping:

>>> text = 'abc 123'

>>> print(re.match(r'\s+\w\d', text))
None

>>> r = re.match(r'\w*? (\d{3})', text)
>>> r
<re.Match object; span=(0, 7), match='abc 123'>
>>> r.group()
'abc 123'
>>> r.group(0)
'abc 123'
>>> r.group(1)
'123'
  • group(), group(0) will output the complete match result.
  • group(1), and group(2), group(3) not in this example...Then the first, second and third will be output.Matching results surrounded by ().

Modifier

Modifier describe
re.I Make matches case insensitive
re.L locale-aware matching
re.M Multiline matching, affecting ^ and $
re.S Match. All characters including line breaks
re.U Parses characters according to the Unicode character set.This flag affects \w, \W, \b, \B.
re.X This flag gives you more flexibility in formatting so that you can write your regular expressions more easily.

These modifiers can be passed in as the third parameter of re.match to produce the effect described above.

result = re.match('^He.*?(\d+).*?Demo$', content, re.S)

search()

Unlike match(), search() scans the entire string during a match and returns the first successful match, or None if none exists anywhere.

>>> p = '\d+'
>>> c = 'asd123sss'
>>> r = re.search(p, c)
>>> r
<re.Match object; span=(3, 6), match='123'>
>>> r.group()
'123'

findall()

findall() searches the entire string and returns everything that matches the regular expression, returning a list.

>>> import re
>>> p = '\d+'
>>> c = 'asd123dfg456;;;789'
>>> re.findall(p, c)
['123', '456', '789']

XPath & LXML

XPath (XML Path Language) is a language designed to look up information in an XML document, and it also applies to HTML.

We can use XPath to extract information when crawling.

[Note] LXML needs to be installed.

Common XPath Rules

Expression describe
nodename Select all child nodes of this node
/ Select a direct child node from the current node
// Select a descendant node from the current node
. Select Current Node
.. Select the parent of the current node
@ Select Properties

We use the XPath rule at the beginning of // to select all nodes that meet the requirements.

In addition, common operators can be found in XPath Operator.

Import HTML

Import HTML from string

The etree module of the LXML library is imported, an HTML text is declared, the HTML class is called for initialization, and we have successfully constructed an XPath parsing object.

The etree module can modify HTML text.

Calling the tostring() method outputs the modified HTML code, resulting in a bytes type (which can be converted to str using the decode() method)

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))
Import HTML from a file
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

Get Nodes

Get all nodes

Get all the nodes in one HTML using the rule//*:

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())

result = html.xpath('//*')

print(result)

We get a list of Element types.

Get all specified labels

If we want to get all the Li tags, we can change the rule in html.xpath() above to'//li':

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li')
print(result)

If no matching results can be obtained, html.xpath will return []

Get Child Nodes

Select all the direct a child nodes of the Li node, using the rule'//li/a':

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

To get all the descendant a nodes under it, you can do this: //li//a

Get nodes for specific properties

Use the @ symbol for attribute filtering.
Smt[...] Yes...Restricted smt.

Check href is the a node of link4.html, and the rule is'//a[@href='link4.html']:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]')
print(result)
Get Parent Node

If we want to get the parent node of the above example, then get its class attribute:

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
# You can also use Node Axis'//a[@href='link4.html']/parent::*/@class'

print(result)

For details on the use of node axes, see XPath Axes

Get the text in the node

The text() method in XPath can get direct text in a node (excluding text in its children).

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/text()')
print(result)
get attribute
from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

Only attributes with one value can be obtained using the above method.
For the following:

<li class="li li-first"><a href="link.html">first item</a></li>

The class attribute of the li node has two values, and this method will fail. We can use the contains() function:

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = e#tree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

Here you can also use the operator and to connect:

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

supplement

Click on the link to see the details XPath Tutorial , lxml Library.

BeautifulSoup

BeautifulSoup provides some simple Python-style functions for navigating, searching, modifying analysis trees, and so on.It provides users with the data they need to capture by parsing the document.Using it we can improve the efficiency of resolution.

BeautifulSoup has complete official Chinese documentation for viewing Official BeautifulSoup Documentation

⚠[Note] BeautifulSoup and LXML need to be installed.

BeautifulSoup can use a variety of parsers, the main ones being as follows:

Parser Usage method advantage Inferiority
Python Standard Library BeautifulSoup(markup, "html.parser") Python's built-in standard library, moderate execution speed, and document fault tolerance Pre-Python 2.7.3 or 3.2.2) Chinese version has poor fault tolerance
LXML HTML Parser BeautifulSoup(markup, "lxml") Fast, document fault tolerant C Language Library needs to be installed
LXML XML Parser BeautifulSoup(markup, "xml") Fast, unique XML parser C Language Library needs to be installed
html5lib BeautifulSoup(markup, "html5lib") Best fault tolerance, browser-style document parsing, HTML5-formatted document generation Slow, independent of external extensions

LXML parsers are commonly used for parsing in the following ways:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello</p>', 'lxml')    
print(soup.p.string)

Initialization of BeaufulSoup object

You can import HTML, initialize BeautifulSoup objects, and correct them automatically (such as closing unclosed labels) using the following code.

soup = BeautifulSoup(markup, "lxml")   # markup is an str for HTML

After initialization, we can also output the string to be parsed in a standard indented format:

print(soup.prettify())

Node selector

Select Label

When selecting an element, you can select a node element directly by calling the name of the node.
The string property is called to get the text within the node.

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.title)           # <title>The Dormouse's story</title>
print(type(soup.title))     # <class 'bs4.element.Tag'>
print(soup.title.string)    # The Dormouse's story
print(soup.head)            # <head><title>The Dormouse's story</title></head>
print(soup.p)               # <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
Nested Selection

We can also make nested selections, which are similar to the parent-child-grandchild selections:

print(soup.head.title.string)
Association Selection

Sometimes it's difficult to select the desired node element in one step, so we can select a node element first, then use it as a benchmark to select its child, parent, sibling, and so on.

Get descendant nodes

Once a node element is selected, if you want to get its immediate child nodes, you can call the contents property and return a list listing all the child nodes in turn.

Nodes such as the p tag may contain both text and nodes, and the results returned will return them all in a list.

soup.p.contents     # Notice that the text is cut into sections

'''(result)
[
    'Once upon a time ... were\n',
    <a class="sister" href="..." id="link1"><!-- Elsie --></a>,
    ',\n',
    <a class="sister" href="..." id="link2">Lacie</a>,
    ' and\n',
    <a class="sister" href="..." id="link3">Tillie</a>,
    ';\nand ... well.'
]
'''

At the same time, when querying a child node, we can also use the children attribute, which returns a list_iterator object and becomes a list, just like contents:

>>> s.p.children
<list_iterator object at 0x109d6a8d0>
>>> a = list(soup.p.children)
>>> b = soup.p.contents
>>> a == b
True

We can number out the child nodes one by one:

for i, child in enumerate(soup.p.children):
    print(i, child)

To get all descendants (all subordinate nodes), the descendants property can be invoked. descendants recursively queries all descendants (depth first) and gets all descendants, returning a <generator object Tag.descendants at 0x109d297c8>:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, d in enumerate(soup.p.descendants):
    print(i, d)
Get parent and ancestor nodes

If you want to get the parent node of a node element, you can call the parent property and return a node:

>>> soup.span.parent
# The result is <p>...</p>

If we want to get all the ancestor nodes (one level up until the entire html), we can call the parents property and return a generator:

>>> soup.span.parents
<generator object PageElement.parents at 0x109d29ed0>
>>> list(soup.span.parents)
# The results were [<p>...</p>, <div>...</div>, <body>...</body>, <html>...</html>]

⚠[Note] The father is a parent and the ancestor is a parent

Get Brothers Node

To get sibling nodes, which are siblings, we can call four different attributes, which do different things:

  • next_sibling: Gets the node to the next sibling and returns the node.
  • previous_sibling: Gets the previous sibling node and returns the node.
  • next_siblings: Gets all sibling nodes down and returns a generator.
  • previous_siblings: Gets all sibling nodes up and returns a generator.
>>> from bs4 import BeautifulSoup
>>> html = """
... <html>
...     <body>
...         <p class="story">
...             Once upon a time there were three little sisters; and their names were
...             <a href="http://example.com/elsie" class="sister" id="link1">
...                 <span>Elsie</span>
...             </a>
...             Hello
...             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
...             and
...             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
...             and they lived at the bottom of a well.
...         </p>
... """
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
>>> soup.a.next_sibling
'\n            Hello\n            '
>>> soup.a.previous_sibling
'\n            Once upon a time there were three little sisters; and their names were\n            '
>>> soup.a.next_siblings
<generator object PageElement.next_siblings at 0x1110e57c8>
>>> soup.a.previous_siblings
<generator object PageElement.previous_siblings at 0x1110e5de0>
>>> for i in soup.a.previous_siblings:
...     print(i)
... 

            Once upon a time there were three little sisters; and their names were
            
>>> for i in soup.a.next_siblings:
...     print(i)
... 

            Hello
            
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 
            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.
        
>>> 

Method selector

Sometimes when it is difficult to find the desired node directly using node selector, we can use find_all(), find() and other methods to pass in the corresponding parameters to query flexibly, get the desired node, and then use the association selection to get the desired information easily.

find()

find() passes in some attributes or text to get the qualified element and returns the first matching element.

find(name , attrs , recursive , text , **kwargs)

Examples are as follows:

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul')
print(soup.find(attrs={'class': 'element'}))
print(soup.find(text=re.compile('.*?o.*?', re.S)))      # The result returns the text of the first node that matches the regular expression (the result is not a node)
findall()

Find_all, similar to find, but find_all queries all eligible elements and returns a list of all matching elements.

More

There are also find such as find_parents(), find_next_siblings(), find_previous_siblings(), etc., which are almost all used, but the search scope is different, as detailed in File.

CSS Selector

BeautifulSoup also provides a CSS selector.
With a CSS selector, simply call the select() method, pass in the appropriate CSS selector, and the result is a list of nodes that match the CSS selector:

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

Extract Information

Get the full label

To get the complete html code for a tag, simply write its node selector:

soup.title

Get Label Type

Use the name attribute to get the type of node (p, a, title, pre etc):

print(soup.title.name)

Get tag content

As we have said before, calling the string property gives you the text within the node:

soup.title.string

[Note] If there are other tags under the tag,.string does not work and it returns a None:

>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None

You can also get content using the node's get_text() method:

soup.p.get_text()

With get_text, you can get all the text under the tag, including the text in its child nodes:

>>> from bs4 import BeautifulSoup
>>> html = '<p>Foo<a href="#None">Bar</a></p>'
>>> soup = BeautifulSoup(html, 'lxml')
>>> print(soup.p.string)
None
>>> print(soup.p.get_text())
FooBar
get attribute

Each node may have multiple attributes, such as id, class, and we can call attrs to get all the attributes, which in turn can get specific attributes through the dictionary's value-taking methods (brackets plus attribute names, or its get() method):

print(soup.p.attrs)
print(soup.p.attrs['name'])

'''(results)
{'class': ['title'], 'name': 'dromouse'}
dromouse
'''

You can also use brackets and attribute names directly:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])
    # Two lines of code in a loop are equivalent

PyQuery

pyquery: a jquery-like library for python

PyQuery is basically the same as jQuery.

PyQuery needs to be installed.

Initialization

PyQuery initialization can pass in various forms of data sources, such as strings whose content is HTML, URL s to the source, local file names, and so on.

String Initialization
from pyquery import PyQuery as pq

html = '''
<h1>Header</h1>
<p>Something</p>
<p>Other thing</p>
<div>
    <p>In div</p>
</div>
'''

doc = pq(html)      # Incoming HTML String
print(doc('p'))     # Incoming CSS Selector

'''(results)
<p>Something</p>
<p>Other thing</p>
<p>In div</p>

'''
URL Initialization
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com', encoding='utf-8')      # Don't write encoding here. Maybe Chinese is garbled
print(doc('title'))

'''(result)
<title>Baidu once, you know</title>
'''

CSS Selector

See in detail CSS Selector Table.

Find Node

  • Finding child nodes uses the children('css-selector') method, with null parameters for all.
  • Find descendant nodes using the find('css-selector') method, the parameter cannot be empty!
  • Finding a parent node uses the parent('css-selector') method, and null parameters are all.
  • Finding ancestor nodes uses the parents('css-selector') method, with null parameters for all.
  • Brothers are found using the siblings('css-selector') method with null parameters for all.
>>> p = doc('div')
>>> p
[<div#wrapper>, <div#head>, <div.head_wrapper>, <div.s_form>, <div.s_form_wrapper>, <div#lg>, <div#u1>, <div#ftCon>, <div#ftConw>]
>>> type(p)
<class 'pyquery.pyquery.PyQuery'>
>>> p.find('#head')
[<div#head>]
>>> print(p.find('#head'))
<div id="head"> ... </div> 

ergodic

The results selected with PyQuery can be traversed:

>>> for i in  p.parent():
...     print(i, type(i))
... 
<Element a at 0x1055332c8> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533368> <class 'lxml.html.HtmlElement'>
<Element a at 0x105533458> <class 'lxml.html.HtmlElement'>

Note that this is an Element of lxml, which is handled as lxml.

pick up information

attr() to get attributes
a = doc('a')
print(a.attr('href'))

attr() must pass in the name of the attribute to select.
If the object contains more than one node, calling attr() on the object will only return the corresponding result of the first object.Traversal is required to return each.

text() Get text
a = doc('a')
a.text()

This outputs the results of all text join s containing nodes.

Node Operation

PyQuery can also operate on nodes, which is not the focus.

9 original articles published. 2. 524 visits
Private letter follow

Tags: Attribute Python xml JQuery

Posted on Fri, 14 Feb 2020 00:53:41 -0500 by dunn_cann