Tools:
Python3+Pycharm+Chrome
Modules used:
(1) Requests: for simple data requests. (2) lxml: faster and stronger parsing library than BeautiSoup. (3) pandas: data processing artifact. (4) Time: set the crawler access interval. (5) Random: generates a random number for use with time. (6) tqdm: displays the progress of the program.
Steps:
1. Open the short comment page of Douban movie "I am not the God of medicine", right-click to check or press F12, then select the user name and comment to display the corresponding code part
2. Send a get request through the requests module and recode it with utf-8; then add an interaction to determine whether the resource is successfully obtained (status code is 200) and output the acquisition status. We use lxml to analyze the short comments of "I'm not the God of medicine". Find the corresponding code in step 1, right-click copy, and then select Copy XPath to get its path. There may be extra spaces at the beginning and end of the crawled short comment, so we need to use the strip() method in the string to remove these extra spaces.
The code to get the user name and short comment is as follows:
name=x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i)) content=x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/text()'.format(i))
3. After obtaining the data, we construct a dictionary through the list, then a dataframe through the dictionary, and output the data to a csv file through the pandas module. The code is as follows:
infos={'name':name_list,'content':content_list} data=pd.DataFrame(infos,columns=['name','content']) data.to_csv("Douban I am not the God of Medicine.csv")
Running screenshot:
Here we crawled the user name and content of ten pages of short comments, and with the help of tqdm module, we can clearly see the progress of our crawling.
Screenshot of some short comments:
Final code:
1 # A short review of the movie I am not the God of Medicine 2 import requests 3 from lxml import etree 4 from tqdm import tqdm 5 import time 6 import random 7 import pandas as pd 8 9 name_list, content_list = [], [] 10 11 12 def get_content(page): 13 url = "https://movie.douban.com/subject/26752088/comments?start=20&limit=20&sort=new_score&status=P&percent_type={}" \ 14 .format(page * 20) 15 res = requests.get(url) 16 res.encoding = "utf-8" 17 if (res.status_code == 200): 18 print("\n The first{}Page short comment crawls successfully!".format(page + 1)) 19 else: 20 print("\n The first{}Page crawling failed!".format(page + 1)) 21 22 x = etree.HTML(res.text) 23 for i in range(1, 21): 24 name = x.xpath('//*[@id="comments"]/div[{}]/div[2]/h3/span[2]/a/text()'.format(i)) 25 content = x.xpath('//*[@id="comments"]/div[{}]/div[2]/p/text()'.format(i)) 26 name_list.append(name[0]) 27 content_list.append(str(content[0]).strip()) 28 29 30 if __name__ == '__main__': 31 for i in tqdm(range(0, 10)): 32 get_content(i) 33 time.sleep(random.randrange(6, 9)) 34 infos = {'name': name_list, 'content': content_list} 35 data = pd.DataFrame(infos, columns=['name', 'content']) 36 data.to_csv("Douban I am not the God of Medicine.csv")