Crawl project dataset

2021SC@SDUSC

brief introduction

According to the project schedule, it is necessary to crawl Baidu academic generated data sets to test the performance of different models. However, there are some problems in actual crawling, such as page duplication, page inaccessibility and so on.

Page duplication problem description and solution

After crawling, it is found that a large number of duplicate pages appear. On the one hand, it wastes a lot of time, on the other hand, it takes extra experience to redo. At first, I tried to take the title of the first paper on the page as the unique identification of the page and remove it through collection. However, this method still can not avoid the problem of time waste.

Through the further analysis of the algorithm and Baidu Library, the source of the problem gradually surfaced. There is a page in the original crawler_ Num parameter, indicating the number of pages expected to be crawled. The traversal of all searches is realized by replacing "pn =?" in the url.

next_page = fir_page.replace("pn=10", "pn={:d}".format(i * 10))

However, the search results of some keywords can not reach the page_num, for example, the search "retrieve" has only 68 pages

When pn =? When the value of exceeds the critical value, baidu academic processing method is to go back to the first page and then continue to crawl. It is precisely because of the above mechanism that there is a problem of page repetition.

In order to solve this problem, I further analyzed the page. When a page exists, the next page will be displayed by a right arrow, as shown in the figure:

You can analyze the page to determine whether there is a right arrow. If there is a right arrow, you can continue to traverse. The logic is as follows:

next_icon_soup = soup_new.find(id='page')
            if next_icon_soup == None:
                # When it is none, the whole page is none
                print('None_for_next')
                continue
            next_icon_soup2 = next_icon_soup.find_all('a')[-1].find(class_='c-icon-pager-next')
            if next_icon_soup2 == None:
                break

However, when solving this problem, I found that when visiting some pages, the returned results are garbled. At this time, next_icon_soup2 must be None. Even if the next page exists, it will be forced to exit. So I get next first_ icon_ If the parent element of Soup2 does not exist, it must indicate that the returned result is garbled and the page will be ignored.

The page cannot access the problem description and solution

As mentioned above, when visiting some pages, the returned results may be garbled, which also causes some difficulties in de duplication, but simple neglect will cause a lot of data loss. The root cause of this problem may be the anti crawler mechanism of the website. Therefore, the initial idea was to sleep for 1s after each crawl to avoid being found. However, the effect is not ideal.

For further analysis, I try to change the crawler's access header for pages that cannot be accessed, and access normally after replacement. Based on this, I adopt the following mechanism: first, use the request header A for crawling. If the return is garbled, replace the request header B. if it is still garbled, replace the request header A again, and so on. When it is replaced more than 10 times and still can not be accessed, abandon the page. With this mechanism, most pages can be crawled normally.

The next problem is how to judge whether the crawling result is garbled. After analysis, I found that the page with garbled code has an element class named "timeout img". When the element appears in the page, the page will be determined as an inaccessible page.

time_out = soup_new.find(class_='timeout-img')
            if time_out_num > 10:
                fail_num += 1
                flag = False
                print('fail...')
                break

The overall implementation of this part is as follows:

        while time_out != None:
            time_out_num += 1
            time.sleep(1)
            print(next_page)
            print(' timeout...')
            response = requests.get(next_page,headers = headers2)
            soup_new = BeautifulSoup(response.text, "lxml")
            # Check for timeout
            time_out = soup_new.find(class_='timeout-img')
            if time_out_num > 10:
                fail_num += 1
                flag = False
                print('fail...')
                break

Summary

After solving the above problems, the crawler runs normally and constructs a data set that meets the requirements.

The operation results are as follows (key words: beef):

Screenshots of some data sets are as follows:

Full code:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import requests
import re
from collections import defaultdict

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36 Edg/94.0.992.50'}
headers2 = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
def driver_open(key_word):
    url = "http://xueshu.baidu.com/"
#     driver = webdriver.PhantomJS("D:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
    driver = webdriver.Chrome()
    driver.get(url)
    time.sleep(2)
    driver.find_element_by_class_name('s_ipt').send_keys(key_word)
    time.sleep(2)
    driver.find_element_by_class_name('s_btn_wr').click()
    time.sleep(2)
    content = driver.page_source.encode('utf-8')

    soup = BeautifulSoup(content, 'lxml')
    return soup

def page_url_list(soup, page=0):
    global soup_new
    fir_page = "http://xueshu.baidu.com" + soup.find(id='page').find('a')["href"]
    urls_list = []
    num = 0
    fail_num = 0
    for i in range(page):
        num+=1
        print(i)
        # time.sleep(1)
        next_page = fir_page.replace("pn=10", "pn={:d}".format(i * 10))
        response = requests.get(next_page,headers = headers)
        soup_new = BeautifulSoup(response.text, "lxml")
        time_out = soup_new.find(class_='timeout-img')
        time_out_num = 0
        flag = True
        while time_out != None:
            time_out_num += 1
            time.sleep(1)
            print(next_page)
            print(' timeout...')
            response = requests.get(next_page,headers = headers2)
            soup_new = BeautifulSoup(response.text, "lxml")
            # Check for timeout
            time_out = soup_new.find(class_='timeout-img')
            if time_out_num > 10:
                fail_num += 1
                flag = False
                print('fail...')
                break
        if flag:
            c_fonts = soup_new.find_all("h3", class_="t c_font")
            for c_font in c_fonts:
                url = c_font.find("a")["href"]
                urls_list.append(url)
            # If there is no new page, stop searching
            next_icon_soup = soup_new.find(id='page')
            if next_icon_soup == None:
                # When it is none, the whole page is none
                print('None_for_next')
                continue
            next_icon_soup2 = next_icon_soup.find_all('a')[-1].find(class_='c-icon-pager-next')
            if next_icon_soup2 == None:
                break
    urls_list = set(urls_list)
    print("Total number of links")
    print(len(urls_list))
    print('Number of failed links:',fail_num*10)
    return urls_list



def get_item_info(url):
    content_details = requests.get(url,headers = headers)
    soup = BeautifulSoup(content_details.text, "lxml")
    # Extract article title
    try:
        title = ''.join(list(soup.select('#dtl_l > div > h3 > a')[0].stripped_strings))
    except(IndexError):
        title = ''


    # Extract summary
    try:
        abstract = list(soup.select('div.abstract_wr p.abstract')[0].stripped_strings)[0].replace("\u3000", ' ')
    except(IndexError):
        abstract = ''


    # Extract keywords
    try:
        key_words = ';'.join(key_word for key_word in list(soup.select('div.dtl_search_word > div')[0].stripped_strings)[1:-1:2])
    except(IndexError):
        key_words = ''
    return title,  abstract,  key_words

def get_all_data(urls_list):
    dit = defaultdict(list)
    num = 1
    len_list = len(urls_list)
    for url in urls_list:
        num+=1
        print('{}/{}'.format(num,len_list))
        title,abstract, key_words = get_item_info(url)
        if (len(str(title)) > 0 and len(str(abstract)) > 0 and len(str(key_words)) > 0):
            dit["title"].append(title)
            dit["abstract"].append(abstract)
            dit["key_words"].append(key_words)
    return dit

def save_csv(dit,num):
    data = pd.DataFrame(dit)
    print(data)
    columns = ["title",  "abstract", "key_words"]
    if num == 1:
        data.to_csv("data.csv", mode='a',index=False, columns=columns)
    else:
        data.to_csv("data.csv", mode='a', index=False , header=False)
    print("That's OK!")



if __name__ == "__main__":
    key_words=['beef']
    num=1
    for key_word in key_words:
        soup = driver_open(key_word)
        urls_list = page_url_list(soup, page=100)
        dit = get_all_data(urls_list)
        save_csv(dit,num)
        num+=1


Tags: Python

Posted on Mon, 01 Nov 2021 11:39:17 -0400 by kippy