Getting started with Python crawler 17-100 blog crawling data

Written in front

I've been blogging for a while, but I suddenly forget that the blog of blog channel can also be grabbed, so I did



In fact, it's very simple. Open the CSDN blog homepage. Isn't it the latest article? It's all the latest articles.

Open F12 to grab the data API, and it's easy to get his interface

Extract links to look like this

https://blog.csdn.net/api/articles?type=more&category=newarticles&shown_offset=1540381234000000

It is found that the latest blog post is a waterfall flow page, which keeps dropping down. There is only one parameter, show  offset, which is changing. According to my many years of practice experience, this parameter is a time stamp, and it must be the last time stamp of the last data.

Based on this theory, take a look at the data, eh, guess right~~~~~

Check the data returned by the blog to see if it is correct

Bar code

This step is very simple, that is to grab the link through requests

import requests
import pymongo
import time

START_URL = "https://www.csdn.net/api/articles?type=more&category=newarticles&shown_offset={}"
HEADERS = {
    "Accept":"application/json",
    "Host":"www.csdn.net",
    "Referer":"https://www.csdn.net/nav/newarticles",
    "User-Agent":"Your own browser configuration",
    "X-Requested-With":"XMLHttpRequest"
}
def get_url(url):
    try:
        res = requests.get(url,
                           headers=HEADERS,
                           timeout=3)

        articles = res.json()
        if articles["status"]:
            need_data = articles["articles"]
            if need_data:
                collection.insert_many(need_data)  # Data insertion
                print("Insert successfully{}Bar data".format(len(need_data)))
            last_shown_offset = articles["shown_offset"]  # Get the timestamp of the last data
            if last_shown_offset:
                time.sleep(1)
                get_url(START_URL.format(last_shown_offset))
    except Exception as e:
        print(e)
        print("System pause 60 s,What's wrong now is{}".format(url))

        time.sleep(60) # After the problem occurs, stop for 60s and continue to grab
        get_url(url)

The data has been obtained. Of course, it needs to be saved symbolically. The operation of mongo database is in the last article. You can turn it over.

Tags: Python JSON Database

Posted on Mon, 02 Dec 2019 13:39:58 -0500 by datona