Download the complete novel from "top novel" -- python crawler

This program is just for practice. First of all, it's a non fee based novel website. I often read novels on this website, so I chose this website with honor. In addition, in fact, it has its own download function and supports various formats: (TXT,CHM,UMD,JAR,APK,HTML), so there may be no anti climbing measures set, and I only set the request header. The content is then saved in TXT format.

The content involves the use of request (encoding problem), the use of xpath, the handling of strings (repalce produces lists to achieve line breaking effect), and file I/O

Vertex novel: https://www.booktxt.net

Code function: enter the name of the novel. If it exists in vertex novel, it can be downloaded directly. The final effect is as follows:

 

 

 

 

 

 1 # -*- coding:utf-8 -*-
 2 import requests
 3 from lxml import etree
 4 
 5 novel_name = ''  #Global variables, storing novel names
 6 headers = {
 7         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
 8 
 9 def get_url(name):
10     '''
11         Get the website of the novel in the top novel through Baidu
12         name:Novel name
13     '''
14     #site: booktxt.net + The novel name is designated as the search site
15     baidu = 'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=site%3A%20booktxt.net%20'+name
16 
17     #Get the website of the novel in the climax novel
18     r = requests.get(baidu, headers=headers)
19     html = etree.HTML(r.content)
20     try:
21         #Extract URL link, if not, exit the program
22         url = html.xpath('//*[@id="1"]/h3/a/@href')[0]
23         url =requests.get(url, headers=headers).url
24     except:
25         print("The novel doesn't exist!")
26         exit(0)
27     if url[-4:] == 'html': #Search result is a section, invalid result
28         print("The novel doesn't exist!")
29         exit(0)
30     get_chapter(url) #Access to novel chapters
31 
32 def get_chapter(url):
33     '''
34     Get the name of the searched novel and ask if you want to download it
35 
36     :param url: Links to Novels
37     '''
38     global novel_name
39 
40     r = requests.get(url=url,headers=headers)
41     coding = r.apparent_encoding #Get page encoding format
42 
43     html = etree.HTML(r.content, parser=etree.HTMLParser(encoding=coding))
44 
45     novel_name = html.xpath('//*[@id="info"]/h1/text()')[0]
46     print('The title of the novel:'+novel_name+'\n Download? yes/no\n')
47     flag = input()
48     if flag=='no':
49         print('Exit system')
50         exit(0)
51 
52     list = html.xpath('//*[@id="list"]/dl/dd[position()>8]') #Get chapter list
53     for item in list:
54         chapter_name = item.xpath('./a')[0].text #Name of each chapter
55         print(chapter_name)
56         link = item.xpath('./a/@href')[0] #Link to each chapter
57         full_link = url+link #Full address of each chapter
58         print(full_link)
59         get_text(chapter_name,full_link)
60 
61 def get_text(name,link):
62     '''
63     Get the contents of each chapter and write them to txt In file
64     :param name: Chapter name of the novel
65     :param link: Chapter link
66     :return:
67     '''
68 
69     r = requests.get(url=link, headers=headers)
70     coding = r.apparent_encoding
71     r = r.content
72 
73     html = etree.HTML(r, parser=etree.HTMLParser(encoding=coding))
74     #Get the content of a chapter and divide it into a list of strings bounded by spaces
75     text = html.xpath('string(//*[@id="content"])').split()
76     #print(text)
77     #Creating a novel name.txt file
78     with open('{}.txt'.format(novel_name),'a+',encoding='utf-8') as f:
79         f.write('\t'*3+name+'\n') #Chapter name
80         for i in range(len(text)):
81             f.write(' '*4+text[i]+'\n')
82 
83 if __name__ == '__main__':
86     novel_name = input('Please enter a novel name:')
88     get_url(novel_name)

Tags: Python encoding Windows IE

Posted on Wed, 04 Dec 2019 05:10:40 -0500 by webmazter