I am learning to use xpath to parse web pages recently! Try to crawl the blog of Jane book! First is the individual user's blog article crawls! After that, I will continue to try to crawl the articles of the first 100 pages of the user list in Jianshu website!
The first is the crawling of individual user's article information, including: title, article link, summary, beta, reading amount, comments, favorite number, and publishing time.
If you don't talk much, code it first:
import requests from lxml import etree import os #/html/body/div[1]/div/div[1]/div[1]/div[2]/ul/li[3]/div/a/p def getResponse(url): #Camouflage browser header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0', 'Connection': 'close'} try: r = requests.get(url, headers=header, timeout = 30) r.raise_for_status() r.encoding = r.apparent_encoding return r except: return 0 def ResponseParse(r, alist): if r: dom = etree.HTML(r.text) bate_site = 'https://www.jianshu.com' # print(str(dom.xpath('/html/body/div[1]/div/div[1]/div[2]/ul/li[4]/div/div/span[3]/text()'))) li_path = './/div[@id="list-container"]/ul/li' title_path = './/a[@class="title"]/text()' href_path = './/a/@href' abstract_path = './/div/p[@class="abstract"]/text()' beta_path = './/div/div/span[@class="jsd-meta"]/i[@class="iconfont ic-paid1"]' read_path = './/div/div/a/i[@class="iconfont ic-list-read"]' comment_path = './/div/div/a/i[@class="iconfont ic-list-comments"]' love_path = './/div/div/span/i[@class="iconfont ic-list-like"]' time_path = './/div/div/span[@class="time"]/@data-shared-at' lis = dom.xpath(li_path) for li in lis: article_lsit = [] # print(etree.tostring(li, method='html', encoding='utf-8')) title = li.xpath(title_path)[0] href = li.xpath(href_path)[0] abstract = li.xpath(abstract_path)[0] link = bate_site + href if title: article_lsit.append(title) print("Title:" + str(title)) else: article_lsit.append("nothing") if link: print("Links:" + link) article_lsit.append(link) else: article_lsit.append("None") if abstract: print("Abstract:" + abstract.strip().replace("\n", "")) article_lsit.append(abstract.strip().replace("\n", "")) else: print("Abstract:" + "nothing") article_lsit.append("nothing") betas_ = li.xpath(beta_path) if betas_: betas = betas_[0].xpath('../text()') print("beta:", str(betas[1]).strip().replace("\n", "")) article_lsit.append(str(betas[1]).strip().replace("\n", "")) else: print("beta: ", 0) article_lsit.append("0") reads_ = li.xpath(read_path) if reads_: reads = reads_[0].xpath('../text()') print("Reading:", str(reads[1]).strip().replace("\n", "")) article_lsit.append(str(reads[1]).strip().replace("\n", "")) else: print("Reading:", 0) article_lsit.append("0") comments_ = li.xpath(comment_path) if comments_: comments = comments_[0].xpath('../text()') print("Commentary:", str(comments[1]).strip().replace("\n", "")) article_lsit.append(str(comments[1]).strip().replace("\n", "")) else: print("Commentary:", 0) article_lsit.append("0") likes_ = li.xpath(love_path) if likes_: likes = likes_[0].xpath('../text()') print("Favorite number:", str(likes[0]).strip().replace("\n", "")) article_lsit.append(str(likes[0]).strip().replace("\n", "")) else: print("Favorite number:", 0) article_lsit.append("0") time = li.xpath(time_path) print("Published:" + str(time)[2:-8]) article_lsit.append(str(time)[2:-8]) print('\n') alist.append(article_lsit) return len(lis) else: print("Crawling failed!") def Get_Total_Article(url): r = getResponse(url) dom = etree.HTML(r.text) article_num = int(dom.xpath('//div[@class="info"]//li[3]//p/text()')[0].strip()) return article_num def Get_author_name(url): #/html/body/div[1]/div/div[1]/div[1]/div[1]/a r = getResponse(url) dom = etree.HTML(r.text) author_name = dom.xpath('.//div[@class="main-top"]/div[@class="title"]/a[@class="name"]/text()')[0].strip() return author_name def WriteTxt(alist, name): save_dir = 'Article list' save_dir = os.path.join(os.getcwd(), save_dir) if not os.path.exists(save_dir): os.mkdir(save_dir) save_name = os.path.join(save_dir, name) out = "Title:<10}\n Links:<10}\n Abstract:<20}\nbeta:<10}\n Reading:<10}\n Commentary:<10}\n Favorite number:<10}\n Release time:<10}\n" with open(save_name, 'w', encoding="utf-8") as f: for i in range(len(alist)): f.write(out.format(alist[i][0], alist[i][1], alist[i][2], alist[i][3], alist[i][4], alist[i][5], alist[i][6], alist[i][7], chr(12288))) f.write("\n") f.close() print("Data written successfully:"+save_name) def main(): url = url_ + "?order_by=shared_at&page={}" print(url) author_name = Get_author_name(url.format(1)) save_name = author_name + ".txt" article_total_count = Get_Total_Article(url.format(1)) if article_total_count%9 == 0: spider_num = int(article_total_count/9) else: spider_num = int(article_total_count/9) + 1 article_count = 0 # print(article_total_count/9) artile_list = [] for i in range(spider_num): spider_url = url.format(str(i + 1)) r = getResponse(spider_url) article_count += ResponseParse(r, artile_list) WriteTxt(artile_list, save_name) print("Author:" + author_name + "A total of:" + str(article_count) + "Article!") if __name__ == '__main__': main()Procedural interpretation
Next, several important methods in the program are explained.
getResponse(url) is used to get the response corresponding to the link. The anti crawling measures of the simple book are relatively simple. You only need to disguise the headers as a browser.
The most important part of the program is to parse the obtained response. This function is implemented in ResponseParse(r, alist), where the parameter alist is used to store article information.
with https://www.jianshu.com/u/cbe095ae3d58 For example, let's look at the source code of the web page (f12)
We can clearly see that the list information of all articles in the personal homepage is located in the < li > tag, so we write out the xpath information of the < li > tag
li_path = './/div[@id="list-container"]/ul/li'
Used to get all < li > tags. Take a closer look at the source code to find the relative path of the information we need relative to the < li > tag.
According to the observed source code information, you can write out the xpath of all the required information:
title_path = './/a[@class="title"]/text()' href_path = './/a/@href' abstract_path = './/div/p[@class="abstract"]/text()' beta_path = './/div/div/span[@class="jsd-meta"]/i[@class="iconfont ic-paid1"]' read_path = './/div/div/a/i[@class="iconfont ic-list-read"]' comment_path = './/div/div/a/i[@class="iconfont ic-list-comments"]' love_path = './/div/div/span/i[@class="iconfont ic-list-like"]' time_path = './/div/div/span[@class="time"]/@data-shared-at'
It is worth noting that when writing the xpath path of beta, reading amount, comment number and favorite number, the next level tag of the required tag has been written all the time. This is mainly because when debugging the program, it is found that some articles have beta information, do not have it, and some articles have favorable information. They are all in < span > tags, and if you list the < span > tags in order, there will be confusion. So the next level label is used to locate the label where the required information is located. When we need the current information, we use ".." to go back to the previous label and get the text information contained in it.
There is also a problem in getting the publishing time. If you use. / div/div/span[@class="time"]/text() ', you cannot get the time information. Therefore, you can only find a way to get the data shared at attribute value.
The acquired data is also cleaned when parsing the text. When the acquired information does not exist, it is assigned a value artificially.
The data structure of this paper is to construct a list of the information of each article, and then add it to the list of alist for each information obtained.
When we need to get the xpath path of a label or element, we can select the modified element in the source code, right-click it, click copy, and select copy xpath
In addition, each person may not construct the same xpath, so you can try to construct the xpath according to your own understanding
In fact, we almost finished the function of crawler after parsing the web page, but in order to eliminate the impact of pull-down loading on our crawler. Further observe the url of the web page, and find https://www.jianshu.com/u/cbe095ae3d58?order_by=shared_at&page=1
The page value in the link can control the display of the web page, and each page corresponding to 9 articles information found this rule, then we can calculate the number of URLs that each author needs to crawl.
So you need to get the total number of author articles. So I wrote get total article (URL): method, which is actually to parse the web page and get the text at the corresponding path.
At this point, the crawler can crawl the article information circularly until all the article information of a certain author has been crawled.
When we want to save the crawled information to the local place, we need to write a method to save the data. In order to uniquely identify the document to save the author's information, we name the document with the nickname of the author's home page, so we have written the Get_author_name(url) method.
So the crawler can automatically access all the information of an author in the book.
After that, on the basis of crawling the article information of a certain author, I try to crawl the article information of most of the authors of the website and save it. The introduction of Jianshu's web crawling article information will be introduced in a blog.