I've been blogging for a while, but I suddenly forget that the blog of blog channel can also be grabbed, so I did
In fact, it's very simple. Open the CSDN blog homepage. Isn't it the latest article? It's all the latest articles.
Open F12 to grab the data API, and it's easy to get his interface
Extract links to look like this
https://blog.csdn.net/api/articles?type=more&category=newarticles&shown_offset=1540381234000000
It is found that the latest blog post is a waterfall flow page, which keeps dropping down. There is only one parameter, show offset, which is changing. According to my many years of practice experience, this parameter is a time stamp, and it must be the last time stamp of the last data.
Based on this theory, take a look at the data, eh, guess right~~~~~
Check the data returned by the blog to see if it is correct
Bar codeThis step is very simple, that is to grab the link through requests
import requests import pymongo import time START_URL = "https://www.csdn.net/api/articles?type=more&category=newarticles&shown_offset={}" HEADERS = { "Accept":"application/json", "Host":"www.csdn.net", "Referer":"https://www.csdn.net/nav/newarticles", "User-Agent":"Your own browser configuration", "X-Requested-With":"XMLHttpRequest" } def get_url(url): try: res = requests.get(url, headers=HEADERS, timeout=3) articles = res.json() if articles["status"]: need_data = articles["articles"] if need_data: collection.insert_many(need_data) # Data insertion print("Insert successfully{}Bar data".format(len(need_data))) last_shown_offset = articles["shown_offset"] # Get the timestamp of the last data if last_shown_offset: time.sleep(1) get_url(START_URL.format(last_shown_offset)) except Exception as e: print(e) print("System pause 60 s,What's wrong now is{}".format(url)) time.sleep(60) # After the problem occurs, stop for 60s and continue to grab get_url(url)
The data has been obtained. Of course, it needs to be saved symbolically. The operation of mongo database is in the last article. You can turn it over.