03 crawler case of Baidu Post Bar

introduce

This case is only to consolidate the use of Xpath syntax and will not involve too deep knowledge of crawlers

analysis:

Generally, we have the following steps to climb a website page:

1. Get the address of the crawling target

2. Clear requirements, that is to know which data in the web page to obtain

3. After the goal is clear, the structure of the web page is parsed through the corresponding parsing syntax (Xpath), so as to obtain the desired data

4. Code implementation

1. Get the address of the crawling target

Take Baidu Post Bar as an example:

Programmer bar - Baidu Post Bar - Programmer home, the world's largest programmer Chinese exchange area-- Programmers are professionals engaged in program development and maintenance. Programmers are generally divided into programmers and programmers, but the boundary between them is not very clear

2. Clear requirements

Get the title of each post in the post bar, as shown below:

3. Use Xpath to parse the structure of the web page and get the desired data

1. Quickly locate the tag of the data to be obtained in the web page through the browser tool

1. Right click the web page and click Check

  2. Then pop up

3. Click the arrow mark in the upper left corner

 

  4. Hover the mouse over the data of the web page, and the display on the left will automatically jump to the corresponding label. Since it is to obtain the title of the post, we will hover over the title of a post

The location is located, but is this what we want? Obviously not, because it is only located to a post, which obviously does not meet our needs. We want to get the titles of all posts of the whole web page (note that this is the web page, not the website)  .

In fact, it is half done to locate one, because the tag structure and attributes of other posts are the same, and there will be no difference. Brothers who have done the front end know it. So the question is, since they are all the same, why not use them directly? This is another front-end knowledge. In fact, the layout of most websites is like many puzzles. The content of the post is in a puzzle. Now we want to find this puzzle. From the syntax of Xpath, we locate the label through path search and attribute value positioning, It's also OK to use the tag attribute value directly, because it's obviously unsafe to locate it all at once, and there may be more unwanted data.

Idea:

1. Find the parent tag of all posts through the path first

2. Since the tag structure and attribute values of posts are the same, just find one of them and analyze it. An example will be written later to see the effect

2. Use the XPath helper plug-in to verify whether the Xapth syntax is correct to obtain data

When we know the location of the data, we start to write Xpath syntax to locate the data. Then how do we know whether our syntax is correct? This is verified by our Xpath helper plug-in.

1. First of all, let's observe how many posts there are on a web page in the post bar

         There are 50 posts without advertisements. Therefore, as long as we can get the data of 50 posts in the grammar we write, it is correct

2. Using the plug-in, click the puzzle in the upper right corner

 

3. Click the XPath helper plug-in

  4. Two boxes pop up

  5. The box on the left is to input our Xapth syntax, and the box on the right is to automatically obtain the data according to the syntax

Write Xpath syntax

1. Locate using only attribute values (not recommended)

Locate one of the posts with the mouse

<a rel="noreferrer" href="/p/7598616748" title="Provide a daily budget of tens of millions and charge high prices for a long time wap Quantity, regular senseless advertising, looking for webmaster cooperation" target="_blank"
class="j_th_tit xh-highlight">Provide a daily budget of tens of millions and charge high prices for a long time wap Quantity, regular senseless advertising, looking for webmaster cooperation
</a>

Syntax:

//a[@class="j_th_tit"]

effect:

You can get 50 pieces of data. The syntax is correct. Why not recommend it? For example, we don't use the class attribute and use the rel attribute

//a[@rel="noreferrer"]

  effect:

230 pieces of data, obviously wrong.  



2. Combining path lookup and attribute lookup (recommended)

First of all, you need to know the label of the puzzle of the post, just the label that can include the content of this part of the post

  From the figure, we can know that the information of the post is in the li tag, and the data we want must be in the li tag. Analyze the li tag

  We can look up level by level according to the a tag. One problem is that the same tag will appear, and the two are of the same level. How to distinguish, as shown in the following figure:

It can be seen that the content we want is under the second div tag. How to distinguish the two is very simple. We can distinguish them by attribute value

Syntax:

//ul[@id="thread_list"]/li/div/div[@class="col2_right j_threadlist_li_right "]/div/div/a

effect:

 

The writing of grammar is based on yourself, so as long as you can get the correct data

3. Get data

What we have done above is to locate the tag where the data is located, and we do not really get the value. We get it according to the form of the data in the tag

<a rel="noreferrer" href="/p/7598616748" title="Provide a daily budget of tens of millions and charge high prices for a long time wap Quantity, regular senseless advertising, looking for webmaster cooperation" target="_blank"
class="j_th_tit xh-highlight">Provide a daily budget of tens of millions and charge high prices for a long time wap Quantity, regular senseless advertising, looking for webmaster cooperation
</a>

You can see that the value of the title attribute of the a tag is the data we want. There are also the values we want between < a > < / a >. Therefore, there are two ways to write them according to your preferences

Syntax:

Writing method 1:

//ul[@id="thread_list"]/li/div/div[@class="col2_right j_threadlist_li_right "]/div/div/a/text()

effect:

 

  Method 2:

//ul[@id="thread_list"]/li/div/div[@class="col2_right j_threadlist_li_right "]/div/div/a/@title

  effect:

4. Code implementation

code:

import requests
import time
from lxml import etree




def getHTMLContent(url):

    # Camouflage user agent, anti creep
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
    }

    # Send get request to url
    respone = requests.get(url,headers = header)

    # Convert the returned binary stream file into a text file
    content = respone.text

    # Reduce the interference of some useless tags, which belong to annotation tags in html language
    content = content.replace('<!--', '')
    content = content.replace('-->', '')

    # Convert the returned html text file into an object that can be parsed by xpath
    tree = etree.HTML(content)

    # Get content according to xpath syntax
    titles = tree.xpath('//ul[@id="thread_list"]/li/div/div[@class="col2_right j_threadlist_li_right "]/div/div/a/text()')

    # Cycle print results
    for title in titles:
        print(title)





if __name__ =="__main__":

    url = "https://tieba.baidu.com/f?kw=%E7%A8%8B%E5%BA%8F%E5%91%98&ie=utf-8&pn=50"
    getHTMLContent(url)

result:

Tags: Python crawler Python crawler

Posted on Fri, 05 Nov 2021 19:11:56 -0400 by leandro