python3 crawler: Top 250 film of crawling bean petals

The first website is Top 250 of Douban movie. The website is: https://movie.douban.com/top250?start=0&filter=

Analyze the parameters after the '?' symbol of the web address. The first parameter, 'start=0', represents the number of pages. When '= 0', it represents the first page. When '= 25', it represents the second page... and so on

 

1, Analysis page:

Webpage picture

Identify the elements to be crawled: ranking, name, director, comment and rating. Here, use Chrome browser to check the location of the elements

 

Every movie information is in < li > < / Li >

Location of crawling elements

After analyzing the elements to be crawled, start to prepare for crawling

2, Climbing part:

Tools:

  Python3

  requests

  BeautifulSoup

 

1. Get information about each movie

1 def get_html(web_url):  # There's nothing to say about crawlers getting Web pages
2     header = {
3         "User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"}
4     html = requests.get(url=web_url, headers=header).text#No addition text Back to response,String added
5     Soup = BeautifulSoup(html, "lxml")
6     data = Soup.find("ol").find_all("li")  # There is still a point to be said, that is to say, it is better to return only the information you need, so we have screened it here
7     return data

 

The requests.get() function returns the response object according to the url link in the parameter

. text converts the response object to str type

The find all() function will filter out the contents of each li tag under the ol tag in html text

 

2. Filter out information and save in text

 1 def get_info(all_move):
 2     f = open("F:\\Pythontest1\\douban.txt", "a")
 3 
 4     for info in all_move:
 5         #    ranking
 6         nums = info.find('em')
 7         num = nums.get_text()
 8 
 9         #    Name
10         names = info.find("span")  # The first name is easy to get directly span Namely
11         name = names.get_text()
12 
13         #    director
14         charactors = info.find("p")  # There are too many illegal symbols in this message that you need to replace
15         charactor = charactors.get_text().replace(" ", "").replace("\n", "")  # Make information arrangement regular
16         charactor = charactor.replace("\xa0", "").replace("\xee", "").replace("\xf6", "").replace("\u0161", "").replace(
17             "\xf4", "").replace("\xfb", "").replace("\u2027", "").replace("\xe5", "")
18 
19         #    Comment
20         remarks = info.find_all("span", {"class": "inq"})
21         if remarks:  # This judgment is because some movies don't have comments. You need to make judgment
22             remark = remarks[0].get_text().replace("\u22ef", "")
23         else:
24             remark = "There is no comment on this movie"
25         print(remarks)
26 
27         # score
28         scores = info.find_all("span", {"class": "rating_num"})
29         score = scores[0].get_text()
30 
31 
32         f.write(num + ',')
33         f.write(name + "\n")
34         f.write(charactor + "\n")
35         f.write(remark + "\n")
36         f.write(score)
37         f.write("\n\n")
38 
39     f.close()  # Remember to close the file

 

Note that when crawling elements, there will be illegal symbols (because the existence of these symbols will affect your writing in the text), so you need to replace the symbols with the replace function

The rest will not be explained~~

 

3. All codes

 1 from bs4 import BeautifulSoup
 2 import requests
 3 import os
 4 
 5 
 6 def get_html(web_url):  # There's nothing to say about crawlers getting Web pages
 7     header = {
 8         "User-Agent":"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16"}
 9     html = requests.get(url=web_url, headers=header).text#No addition text Back to response,String added
10     Soup = BeautifulSoup(html, "lxml")
11     data = Soup.find("ol").find_all("li")  # There is still a point to be said, that is to say, it is better to return only the information you need, so we have screened it here
12     return data
13 
14 
15 def get_info(all_move):
16     f = open("F:\\Pythontest1\\douban.txt", "a")
17 
18     for info in all_move:
19         #    ranking
20         nums = info.find('em')
21         num = nums.get_text()
22 
23         #    Name
24         names = info.find("span")  # The first name is easy to get directly span Namely
25         name = names.get_text()
26 
27         #    director
28         charactors = info.find("p")  # There are too many illegal symbols in this message that you need to replace
29         charactor = charactors.get_text().replace(" ", "").replace("\n", "")  # Make information arrangement regular
30         charactor = charactor.replace("\xa0", "").replace("\xee", "").replace("\xf6", "").replace("\u0161", "").replace(
31             "\xf4", "").replace("\xfb", "").replace("\u2027", "").replace("\xe5", "")
32 
33         #    Comment
34         remarks = info.find_all("span", {"class": "inq"})
35         if remarks:  # This judgment is because some movies don't have comments. You need to make judgment
36             remark = remarks[0].get_text().replace("\u22ef", "")
37         else:
38             remark = "There is no comment on this movie"
39         print(remarks)
40 
41         # score
42         scores = info.find_all("span", {"class": "rating_num"})
43         score = scores[0].get_text()
44 
45 
46         f.write(num + ',')
47         f.write(name + "\n")
48         f.write(charactor + "\n")
49         f.write(remark + "\n")
50         f.write(score)
51         f.write("\n\n")
52 
53     f.close()  # Remember to close the file
54 
55 
56 if __name__ == "__main__":
57     if os.path.exists("F:\\Pythontest1") == False:  # Two if To determine whether a new folder exists in the file path to delete a file
58         os.mkdir("F:\\Pythontest1")
59     if os.path.exists("F:\\Pythontest1\\douban.txt") == True:
60         os.remove("F:\\Pythontest1\\douban.txt")
61 
62     page = 0  # Initialize the number of pages, TOP There are 250, 25 on a page
63     while page <= 225:
64         web_url = "https://movie.douban.com/top250?start=%s&filter=" % page
65         all_move = get_html(web_url)  # Return to each page
66         get_info(all_move)  # Matching corresponding information stored locally
67         page += 25

Tags: Python Windows REST

Posted on Mon, 04 May 2020 02:53:58 -0400 by ow-phil