Download the complete novel from "top novel" -- python crawler

This program is just for practice. First of all, it's a non fee based novel website. I often read novels on this website, so I chose this website...

This program is just for practice. First of all, it's a non fee based novel website. I often read novels on this website, so I chose this website with honor. In addition, in fact, it has its own download function and supports various formats: (TXT,CHM,UMD,JAR,APK,HTML), so there may be no anti climbing measures set, and I only set the request header. The content is then saved in TXT format.

The content involves the use of request (encoding problem), the use of xpath, the handling of strings (repalce produces lists to achieve line breaking effect), and file I/O

Vertex novel: https://www.booktxt.net

Code function: enter the name of the novel. If it exists in vertex novel, it can be downloaded directly. The final effect is as follows:

1 # -*- coding:utf-8 -*- 2 import requests 3 from lxml import etree 4 5 novel_name = '' #Global variables, storing novel names 6 headers = { 7 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'} 8 9 def get_url(name): 10 ''' 11 Get the website of the novel in the top novel through Baidu 12 name:Novel name 13 ''' 14 #site: booktxt.net + The novel name is designated as the search site 15 baidu = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=site%3A%20booktxt.net%20'+name 16 17 #Get the website of the novel in the climax novel 18 r = requests.get(baidu, headers=headers) 19 html = etree.HTML(r.content) 20 try: 21 #Extract URL link, if not, exit the program 22 url = html.xpath('//*[@id="1"]/h3/a/@href')[0] 23 url =requests.get(url, headers=headers).url 24 except: 25 print("The novel doesn't exist!") 26 exit(0) 27 if url[-4:] == 'html': #Search result is a section, invalid result 28 print("The novel doesn't exist!") 29 exit(0) 30 get_chapter(url) #Access to novel chapters 31 32 def get_chapter(url): 33 ''' 34 Get the name of the searched novel and ask if you want to download it 35 36 :param url: Links to Novels 37 ''' 38 global novel_name 39 40 r = requests.get(url=url,headers=headers) 41 coding = r.apparent_encoding #Get page encoding format 42 43 html = etree.HTML(r.content, parser=etree.HTMLParser(encoding=coding)) 44 45 novel_name = html.xpath('//*[@id="info"]/h1/text()')[0] 46 print('The title of the novel:'+novel_name+'\n Download? yes/no\n') 47 flag = input() 48 if flag=='no': 49 print('Exit system') 50 exit(0) 51 52 list = html.xpath('//*[@id="list"]/dl/dd[position()>8]') #Get chapter list 53 for item in list: 54 chapter_name = item.xpath('./a')[0].text #Name of each chapter 55 print(chapter_name) 56 link = item.xpath('./a/@href')[0] #Link to each chapter 57 full_link = url+link #Full address of each chapter 58 print(full_link) 59 get_text(chapter_name,full_link) 60 61 def get_text(name,link): 62 ''' 63 Get the contents of each chapter and write them to txt In file 64 :param name: Chapter name of the novel 65 :param link: Chapter link 66 :return: 67 ''' 68 69 r = requests.get(url=link, headers=headers) 70 coding = r.apparent_encoding 71 r = r.content 72 73 html = etree.HTML(r, parser=etree.HTMLParser(encoding=coding)) 74 #Get the content of a chapter and divide it into a list of strings bounded by spaces 75 text = html.xpath('string(//*[@id="content"])').split() 76 #print(text) 77 #Creating a novel name.txt file 78 with open('{}.txt'.format(novel_name),'a+',encoding='utf-8') as f: 79 f.write('\t'*3+name+'\n') #Chapter name 80 for i in range(len(text)): 81 f.write(' '*4+text[i]+'\n') 82 83 if __name__ == '__main__': 86 novel_name = input('Please enter a novel name:') 88 get_url(novel_name)

4 December 2019, 05:10 | Views: 5821

Add new comment

For adding a comment, please log in
or create account

0 comments