Learn more about XPath (use XPath to collect favorite blogs) [advanced introduction to python crawler] (04)

Hello, I'm brother Manon Feige. Thank you for reading this article. Welcome to one click three times.
This article is the fourth in the crawler column, focusing on the combination of lxml library and XPath to parse web pages and extract web page content.
Dry goods are full, it is recommended to collect, and the series of articles are continuously updated. If you have any questions and needs, please leave a message and tell me ~.

Preface (why write this article)

In the last article, we briefly introduced The basic concepts of Html and xml, and focuses on the syntax of XPath . In this article, let's practice: through this article, you will learn how to use lxml library to load and parse web pages, and then use XPath syntax to locate the knowledge points of specific elements and node information.

Introduction to lxml Library

lxml library is an HTML/XML parser. Its main function is how to parse and extract HTML/XML data.
lxml, like regular, is also implemented in C language. It is a high-performance Python HTML/XML parser. Use the syntax of XPath learned earlier to quickly locate specific elements and node information on the web page.

Using pip to install lxm Library

pip install lxml

Parsing HTML fragments using lxml Library

The lxml library can parse any incoming XML or HTML fragment, provided that your XML or HTML fragment has no syntax errors.

#lxml_test.py
from lxml import etree

text = """
<div id="content_views" class="htmledit_views">
    <p style="text-align:center;"><strong>Whole network ID Name:<b>Mainong Feige</b></strong></p>
    <p style="text-align:right;"><strong>Scan the code and join the technical exchange group!</strong></span></p>
    <p style="text-align:right;"><img src="https://img-blog.csdnimg.cn/5df64755954146a69087352b41640653.png"/></p>
    <div style="text-align:left;">Personal micro signal<img src="https://img-blog.csdnimg.cn/09bddad423ad4bbb89200078c1892b1e.png"/></div>
</div>
"""
# Use etree.HTML to parse strings into HTML documents
html = etree.HTML(text)
print("call etree.HTML=", html)
# Serialize the Element object into a string
result = etree.tostring(html)
print("take Element Serialize objects into strings=", result)

result2 = etree.tostring(html, encoding='utf-8').decode()
print(result2)

From the above output results, we can see that the etree.HTML(text) method can parse the string into an HTML document, that is, an Element object. etree.tostring(html) can serialize HTML documents into strings. After serialization, the result is a bytes object, which cannot be displayed normally in Chinese. You need to specify the encoding method as utf-8 and call the decode() method to output Chinese normally, and the output html is formatted. That is, it is not printed in one line.

Using lxml library to load html files

The lxml library can parse not only XML/HTML fragments, but also complete HTML/XML files. A file named test.html is created below. Then it is parsed by etree.parse method.

<div id="content_views" class="htmledit_views">
    <p style="text-align:center;"><strong>Whole network ID Name:<b>Mainong Feige</b></strong></p>
    <p style="text-align:right;"><strong>Scan the code and join the technical exchange group!</strong></p>
    <p style="text-align:right;"><img src="https://img-blog.csdnimg.cn/5df64755954146a69087352b41640653.png"/></p>
    <div style="text-align:left;">Personal micro signal<img src="https://img-blog.csdnimg.cn/09bddad423ad4bbb89200078c1892b1e.png"/></div>
</div>

Then create an HTML_ Parse the parse.py file. Note that the file is in the same directory as the test.html file.

# html_parse.py
from lxml import etree
#Read the external file test.html
html = etree.parse('./test.html')
result = etree.tostring(html, encoding='utf-8').decode()
print(result)

The analysis result is:

It can be seen that if the parsed HTML file is a standard HTML code fragment, it can be loaded normally, because the parse method here uses the XML parser by default.
However, when the HTML file is a standard and complete HTML file, the XML parser cannot parse it. Now change test.html to the code in Figure 2 below. If you directly use the XML parser to parse, the following error will be reported.

For HTML files, you need to set the HTML parser through the HTMLParser method. Then specify the parser in the parse method, as shown in the following code.

from lxml import etree

# Define parser
html_parser = etree.HTMLParser(encoding='utf-8')
# Read the external file test.html
html = etree.parse('./test.html', parser=html_parser)
result = etree.tostring(html, encoding='utf-8').decode()
print(result)

The operation result is:

The actual battle begins

After understanding the basic use of the lxml method, let's start with Manon Feige's blog For example. Our requirement here is to crawl all the articles under his blog (not including the article content temporarily), and then save the crawled data to the local txt file. First, let's see what his blog looks like. Involving the knowledge points of the previous blogs, I won't introduce them in detail here. Here we focus on how to quickly locate specific elements and data through XPath.

Step 1: get the grouping of articles

First, get the article grouping, or use the universal XPath Helper. Call up the debugging window through F12. You can see that each article grouping is placed in < div class = "article item box CSDN tracking statistics XH highlight" > < / div >. Therefore, all article groups can be obtained through the / / div [@ class = "article item box CSDN tracking statistics"] expression.

Code examples are as follows:

from lxml import etree
import requests

response = requests.get("https://feige.blog.csdn.net/", timeout=10) # send request
html = response.content.decode()
html = etree.HTML(html)  # Get article grouping
li_temp_list = html.xpath('//div[@class="article-item-box csdn-tracking-statistics"]')
print(li_temp_list)

The operation result is:

Here, we get 40 Element objects through the HTML. XPath ('/ / div [@ class = "article item box CSDN tracking statistics"]') method. These 40 Element objects are all the articles on the current page we need to crawl. Each Element object is the following.

Next, serialize the Element object through the result = etree.tostring(li_temp_list[0], encoding='utf-8').decode() method. The result is:

<div class="article-item-box csdn-tracking-statistics" data-articleid="121003473">
      <img class="settop" src="https://csdnimg.cn/release/blogv2/dist/pc/img/listFixedTop.png" alt=""/>
      <h4 class="">
        <a href="https://feige.blog.csdn.net/article/details/121003473" data-report-click="{&quot;spm&quot;:&quot;1001.2014.3001.5190&quot;}" target="_blank">
            <span class="article-type type-1 float-none">original</span>
          Superficial knowledge XPath(Master XPath (Syntax)[ python Beginner level of reptile] (03)
        </a> 
      </h4>
      <p class="content">
        XPath With a sharp weapon in hand, you can analyze the reptile without worry
      </p>
      <div class="info-box d-flex align-content-center">
        <p>
          <span class="date">2021-10-28 14:56:05</span>
          <span class="read-num"><img src="https://csdnimg.cn/release/blogv2/dist/pc/img/readCountWhite.png" alt=""/>335</span>
          <span class="read-num"><img src="https://csdnimg.cn/release/blogv2/dist/pc/img/commentCountWhite.png" alt=""/>5</span>
        </p>
      </div>
    </div>

Step 2: get a link to the article

From the above code, we can see that the link of the article is in < a href=“ https://feige.blog.csdn.net/article/details/121003473 ">, and the parent element of element a is element h4, and the parent element of element h4 is the < div class =" article item box CSDN tracking statistics "> element.

The XPath expression is / / div [@ class = "article item box CSDN tracking statistics"] / h4 / A / @ href, that is, first select all < div class = "article item box CSDN tracking statistics"] through / / div [@ class = "article item box CSDN tracking statistics"] "> then find his child element h4 through / h4, and find the child element a of h4 through / A. Finally, find the linked content through / @ href.

However, in the first step, we have obtained the Element object of the < div class = "article item box CSDN tracking statistics" > Element. Therefore, you don't need to write / / div [@ class = "article item box CSDN tracking statistics"] repeatedly here. You just need to replace it with. To find it in the current directory. Reference codes are as follows:

#Omit the first step of the code
href = li_temp_list[0].xpath('./h4/a/@href')[0]
print(href)

The results are: https://feige.blog.csdn.net/article/details/121003473 . Note that the. xpath method returns a list, so you need to extract its first element

Step 3: get the title content

According to the idea of getting links, we can also get the title content. The content of the title is directly in the a tag. Only one more step is to extract the content of the a tag through the text () method. The expression is / / div [@ class = "article item box CSDN tracking statistics"] / H4 / A / text().

The reference code is:

title_list = li_temp_list[0].xpath('./h4/a/text()')
print(title_list)
print(title_list[1])

The operation result is:

['\n            ', '\n          Superficial knowledge XPath(Master XPath (Syntax)[ python Beginner level of reptile] (03)\n        ']

          Superficial knowledge XPath(Master XPath (Syntax)[ python Beginner level of reptile] (03)

Because the first element in the list is \ n, you need to get the second element.

Complete reference code

import requests
from lxml import etree
import json
import os


class FeiGe:
    # initialization
    def __init__(self, url, pages):
        self.url = url
        self.pages = pages
        self.headers = {
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"}

    def get_total_url_list(self):
        """
         Get all urllist
        :return:
        """
        url_list = []
        for i in range(self.pages):
            url_list.append(self.url + "/article/list/" + str(i + 1) + "?t=1")
        return url_list

    def parse_url(self, url):
        """
        Send a request, get a response, and etree handle html
        :param url:
        :return:
        """
        print('parsing url:', url)
        response = requests.get(url, headers=self.headers, timeout=10)  # Send request
        html = response.content.decode()
        html = etree.HTML(html)
        return html

    def get_title_href(self, url):
        """
        Get the of a page title and href
        :param url:
        :return:
        """
        html = self.parse_url(url)
        # Get article grouping
        li_temp_list = html.xpath('//div[@class="article-item-box csdn-tracking-statistics"]')
        total_items = []
        # Traversal grouping
        for i in li_temp_list:
            href = i.xpath('./h4/a/@href')[0] if len(i.xpath('./h4/a/@href')) > 0 else None
            title = i.xpath('./h4/a/text()')[1] if len(i.xpath('./h4/a/text()')) > 0 else None
            summary = i.xpath('./p')[0] if len(i.xpath('./p')) > 0 else None
            # Put in dictionary
            item = dict(href=href, text=title.replace('\n', ''), summary=summary.text.replace('\n', ''))
            total_items.append(item)
        return total_items

    def save_item(self, item):
        """
        Save a item
        :param item:
        :return:
        """
        with open('feige_blog.txt', 'a') as f:
            f.write(json.dumps(item, ensure_ascii=False, indent=2))
            f.write("\n")

    def run(self):

        # Find url rule, url_list
        url_list = self.get_total_url_list()
        os.remove('feige_blog.txt')
        for url in url_list:
            # Traversal url_list sends a request and gets a response
            total_item = self.get_title_href(url)
            for item in total_item:
                print(item)
                self.save_item(item)


if __name__ == '__main__':
    fei_ge = FeiGe("https://feige.blog.csdn.net/", 8)
    fei_ge.run()

The operation result is:

This category is mainly divided into several blocks.

  1. Initialization method__ init__(self, url, pages) mainly sets the domain name of the blog to be crawled, the number of pages of the blog, and the initialization request header.
  2. Get method to get links for all pages_ total_ url_ List (self), here we need to pay attention to the rules of links.
  3. parse_ The url (self, url) method mainly obtains the html page of the url according to the incoming url.
  4. get_ title_ The href (self, URL) method is mainly divided into two parts. The first part is to call parse_ The URL (self, URL) method gets the HTML page corresponding to the link. The second is to parse the HTML page to locate the elements of the data we need, and put the elements in the list total_items.
  5. save_ The item (self, item) method is to return the total returned in step 4_ Save the data in items to feige_blog.txt file.
  6. The run(self) method is the main entry and calls all methods as a whole.

summary

This paper passes Mainong Feige It demonstrates how to use XPath to crawl the data we want in practice.

Exclusive benefits for fans

Soft test materials: Practical soft test materials

Interview questions: 5G Java high frequency interview questions

Learning materials: 50G various learning materials

Withdrawal secret script: reply to [withdrawal]

Concurrent programming: reply to [concurrent programming]

											👇🏻 The verification code can be obtained by searching the official account below.👇🏻 

Tags: Python crawler Data Mining

Posted on Sat, 30 Oct 2021 03:39:39 -0400 by phprocket