python learning notes | crawling examples, Douban film top250


The purpose of writing this article is to review and learn with you. The example code in the video takes the data of the Douban film top250 as an example. I will mark the code according to what I learned from the video for ease of understanding and learning from the video .

1, Code composition

First, we need to introduce four libraries

import re
from bs4 import BeautifulSoup
import xlwt
import urllib.request

Then, define the main function, put the functions defined by crawling web pages, parsing data and saving data in the main function, and the later code can be executed at run time; baseurl is the home page link we want to crawl, then assign the defined data acquisition function to the datalist, set the savepath save path, and finally define the savedata function to save all data. At the end of the code, don't forget if name = = 'main': main(), so that the main function can be run.

def main():
    baseurl = ''
    # 1. Crawl web pages
    datalist = getData(baseurl)
    savepath = '.\\Douban film Top250.xlsx'
    # 3. Save data
if __name__ == '__main__':  # When the program is executed

2, Crawl web pages

Define the getData function, which is mainly used to call the askurl(url) function to obtain the page information, and then put it in the html. An empty list is defined in the datalist, which is used to store the data.

def getData(baseurl):
    datalist = []
    for i in range(0,10):  # Call the function to obtain page information, 10 times
        url = baseurl+str(i*25)
        html = askurl(url)  # Save the obtained web page source code

Define a function askurl(url) to simulate the server sending a request to Douban, and then save the web page source code sent by Douban.

def askurl(url):
    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
    req = urllib.request.Request(url,headers = head)    
html = ''
        response = urllib.request.urlopen(req)
        html ='utf-8')
        # print(html)
    except Exception as er:
        if hasattr(er,'code'):
        if hasattr(er,'reason'):
return html

The urllib.request.Request() function is mainly used here. It requires two parameters, url and headers, to simulate the browser sending a request. The head can be found in the web page we want to crawl (right-click "check", refresh the web page, click network, and user agent)

Then define an empty variable html to store the web page source code. The urllib.request.urlopen function is to receive the information sent by the web page, then decode it into UTF-8 format with'utf-8 '), and then process it in python. try... except Exception as er is a trial and error function. If the code runs incorrectly, an error will be printed out. If the askurl(url) function is run, it will print out as shown in the following figure, and the web page content is preliminarily obtained

3, Parse data

The above content will be passed back to the GetData (base URL) function, so the steps of parsing data are completed in this function.
Global variables are defined before getData to parse data. Taking findLink as an example, how to find the movie link, movie name and other information we need in the html web page source code? This requires re.compile() function to define a pattern first, and then re.findall() function to find it. In the web source code, the detailed link of the film is as follows

We imitate its format
The contents of single quotation marks represent the patterns of all links to be found, where. *? It is a regular expression operator, as shown in the figure below. Adding an r in front can keep all the characters of the found link. Other picture links and titles of the film are also written in this way.

# Movie details link
findLink = re.compile(r'<a href="(.*?)">')  # Create regular expression objects that represent rules (patterns of strings)
# Links to movie pictures
findImgsrc = re.compile(r'<img.*src="(.*?)"',re.S)  # re.S include line breaks in characters
# Film title
findTitle = re.compile(r'span class="title">(.*?)</span>')
# score
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
# Number of evaluators
findJudge = re.compile(r'<span>(\d*)Human evaluation</span>')
# survey
findInq = re.compile(r'<span class="inq">(.*)</span>')
# Find the content of the movie
findBd = re.compile(r'<p class="">(.*?)</p>',re.S)

BeautifulSoup is mainly used to parse the data. First, use html.parser to parse the web page source code in the previously obtained HTML, because there are too many contents in HTML. If you don't narrow the search range, you will extract a lot of duplicate information. Open the console on the web page and we can see div ', class_= The "item" tab contains all the information of the film, so we use find_all() function, find 'div' and class_= The content in "item" to avoid repetition.

        soup = BeautifulSoup(html,'html.parser')
        # soup.find_all, find the string that meets the requirements to form a list
        for item in soup.find_all('div',class_="item"):  # class should be underlined to indicate the attribute value

Str(item) converts the contents into string type, which is easier to find. Then use the re.findall() function (different from the find_all function, don't write it wrong) to find the string in the item with the same findLink mode, [0] represents the first string found; Then use append(link) to append the string to the data list, as are other picture links, picture links, film titles, etc.
Note that the movie name has Chinese and foreign data, while some movie foreign names are empty. At this time, it is necessary to process and place an error message. Classify them with if... else statements. If the foreign name is empty, add the upper character data.append(')

            data = []  # Save all the information about a movie
            item = str(item)  # Convert to string type
            # Linkget the first link to the movie
            link = re.findall(findLink,item)[0]  # The re library is used to find the specified string through regular expression
            imgSrc = re.findall(findImgsrc,item)[0]
            titles = re.findall(findTitle,item)
                ctitle = titles[0]
                otitle = titles[1].replace('/','')  # . replace remove irrelevant symbols
                data.append(' ')  # If there is no foreign name, leave it blank to avoid mistakes
            rating = re.findall(findRating,item)[0]
            judgeNum = re.findall(findJudge,item)
            inq = re.findall(findInq,item)
                inq = inq[0].replace('. ','')  # Remove the period
                data.append(' ')

bd represents the relevant content of the movie, as shown in the figure below. re.sub('content to be replaced', 'content after replacement', variable) function can be used to set / and
Replace it and put it in data.
When you put bd into data, you can use the strip() function to remove the space between the front and back

bd = re.findall(findBd,item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?',' ',bd)  # Remove < br / >
            bd = re.sub('/',' ',bd)
            data.append(bd.strip())  # Remove the spaces before and after
            datalist.append(data)  # Put all the information into the datalist
    return datalist

Runtime effects

4, Save data

Define the saveData(datalist,savepath) function. The parameters datalist and savepath have been handled previously. xlwt is mainly used to save the data to Excel. First, use the Workbook function (the initial letter should be capitalized) to create an Excel object in utf-8 code, and then name the worksheet as Douban movie Top250.xlsx, cell_overwrite_ok=True indicates that the written content can be overwritten. col is the table column title, which is written to the cells one by one with a for loop

def saveData(datalist,savepath):
    book = xlwt.Workbook(encoding='utf-8')  # Create a workbook object with uppercase letters
sheet = book.add_sheet('Douban film Top250.xlsx',cell_overwrite_ok=True)
col = ('Movie link','pictures linking','Movie title (Chinese)','Film name (foreign language)','score','Number of evaluations','survey','Relevant information')
    for i in range(0,8):
        sheet.write(0,i,col[i])  # Print column headers

The data in the datalist is written in two for loops, with a total of 250 rows and 8 columns in each row. Finally, it is saved. The path is. \ Douban movie Top250.xlsx in the main function, which means that it is directly saved under the current project of python.

    for i in range(0,250):
        print("The first%d strip" %(i+1))
        data = datalist[i]
        for j in range(0,8):

The final running result will generate xlsx file on the left, which can be opened in Excel

Finally, pay attention to check whether the indentation is wrong when writing all the code, which is often easy to ignore.
Let's study together~

Tags: Python Database crawler

Posted on Tue, 30 Nov 2021 14:06:37 -0500 by Whitestripes9805