Crawling related news by keyword through Python crawler




The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Now the anti climbing mechanism of major websites has reached the level of madness, such as character encryption of public comments, login verification of microblog, etc. In comparison, the anti crawling mechanism of news websites is a little weaker. So today, Sina News is taken as an example to analyze how to grab relevant news by keyword through Python crawler.

First of all, if you search directly from the news, you will find that its content can display up to 20 pages, so we need to search from Sina's homepage, so there is no page limit.







Analysis of web page structure

<div class="pagebox" id="_function_code_page">
<b><span class="pagebox_cur_page">1</span></b>
<a href="javascript:;" onclick="getNewsData('')" title="Page 2">2</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 3">3</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 4">4</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 5">5</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 6">6</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 7">7</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 8">8</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 9">9</a>
<a href="javascript:;" onclick="getNewsData('')" title="Page 10">10</a>
<a href="javascript:;" onclick="getNewsData('');" title="next page">next page</a>



After entering and doing keyword search, I found that the web address will not change, but the content of the web page has been updated. Experience tells me that this is done through ajax, so I took down Sina's web code and looked at it.

Obviously, every time you turn a page, you send a request to an address by clicking the a tag. If you put the address directly in the address bar of the browser and press enter:



So congratulations, you've got the mistake


Take a closer look at the onclick of html, and find that it calls a function called getNewsData. Therefore, look up this function in the relevant js file, and you can see that it constructs the requested url before each ajax request, and uses the get request. The returned data format is JSON (cross domain).

So we only need to imitate its request format to get the data.

var loopnum = 0;
function getNewsData(url){
    var oldurl = url;
        $("#result").html("<span>Hot words without search</span>");
        return false;
        url = ''+encodeURIComponent(key);
    var stime = getStartDay();
    var etime = getEndDay();
    url +='&stime='+stime+'&etime='+etime+'&sort=rel&highlight=1&num=10&ie=utf-8'; //'&from=sina_index_hot_words&sort=time&highlight=1&num=10&ie=utf-8';
        type: 'GET',
        dataType: 'jsonp',
        cache : false,
        success: //The callback function is too long to write


Send request

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0",
params = {

response = requests.get("", params=params, headers=headers)


This time, the requests library is used to construct the same url and send the request. The result was a cold 403 Forbidden:



So go back to the site and see what's wrong




Find the returned json file from the developer tool, and check the request header. It is found that the request header has a cookie, so we can directly copy its request header when constructing headers. Run again, response200! The rest is simple. You only need to parse the returned data and write it to Excel.


Full code

import requests
import json
import xlwt

def getData(page, news):
    headers = {
        "Host": "",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0",
        "Accept": "*/*",
        "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Referer": r"",
        "Cookie": "ustat=__172.16.93.31_1580710312_0.68442000; genTime=1580710312; vt=99; Apache=9855012519393.69.1585552043971; SINAGLOBAL=9855012519393.69.1585552043971; ULV=1585552043972:1:1:1:9855012519393.69.1585552043971:; historyRecord={'href':'','refer':''}; SMART=0; dfz_loc=gd-default",
        "TE": "Trailers"

    params = {

    response = requests.get("", params=params, headers=headers)
    dic = json.loads(response.text)
    news += dic["result"]["list"]

    return news

def writeData(news):
    workbook = xlwt.Workbook(encoding = 'utf-8')
    worksheet = workbook.add_sheet('MySheet')

    worksheet.write(0, 0, "title")
    worksheet.write(0, 1, "time")
    worksheet.write(0, 2, "media")
    worksheet.write(0, 3, "website")

    for i in range(len(news)):
        worksheet.write(i+1, 0, news[i]["origin_title"])
        worksheet.write(i+1, 1, news[i]["datetime"])
        worksheet.write(i+1, 2, news[i]["media"])
        worksheet.write(i+1, 3, news[i]["url"])'data.xls')

def main():
    news = []
    for i in range(1,501):
        news = getData(i, news)

if __name__ == '__main__':



final result


Tags: Python JSON Javascript IE Windows

Posted on Sat, 27 Jun 2020 04:59:32 -0400 by OuchMedia