
preface
The text and pictures of this article are from the Internet, only for learning and communication, not for any commercial purpose. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
Now the anti climbing mechanism of major websites has reached the level of madness, such as character encryption of public comments, login verification of microblog, etc. In comparison, the anti crawling mechanism of news websites is a little weaker. So today, Sina News is taken as an example to analyze how to grab relevant news by keyword through Python crawler.
First of all, if you search directly from the news, you will find that its content can display up to 20 pages, so we need to search from Sina's homepage, so there is no page limit.

Analysis of web page structure
<div class="pagebox" id="_function_code_page"> <b><span class="pagebox_cur_page">1</span></b> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=2')" title="Page 2">2</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=3')" title="Page 3">3</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=4')" title="Page 4">4</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=5')" title="Page 5">5</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=6')" title="Page 6">6</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=7')" title="Page 7">7</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=8')" title="Page 8">8</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=9')" title="Page 9">9</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=10')" title="Page 10">10</a> <a href="javascript:;" onclick="getNewsData('https://interface.sina.cn/homepage/search.d.json?t=&q=%E6%97%85%E6%B8%B8&pf=0&ps=0&page=2');" title="next page">next page</a> </div>
After entering sina.com and doing keyword search, I found that the web address will not change, but the content of the web page has been updated. Experience tells me that this is done through ajax, so I took down Sina's web code and looked at it.
Obviously, every time you turn a page, you send a request to an address by clicking the a tag. If you put the address directly in the address bar of the browser and press enter:

So congratulations, you've got the mistake
Take a closer look at the onclick of html, and find that it calls a function called getNewsData. Therefore, look up this function in the relevant js file, and you can see that it constructs the requested url before each ajax request, and uses the get request. The returned data format is JSON (cross domain).
So we only need to imitate its request format to get the data.
var loopnum = 0; function getNewsData(url){ var oldurl = url; if(!key){ $("#result").html("<span>Hot words without search</span>"); return false; } if(!url){ url = 'https://interface.sina.cn/homepage/search.d.json?q='+encodeURIComponent(key); } var stime = getStartDay(); var etime = getEndDay(); url +='&stime='+stime+'&etime='+etime+'&sort=rel&highlight=1&num=10&ie=utf-8'; //'&from=sina_index_hot_words&sort=time&highlight=1&num=10&ie=utf-8'; $.ajax({ type: 'GET', dataType: 'jsonp', cache : false, url:url, success: //The callback function is too long to write })
Send request
import requests headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0", } params = { "t":"", "q":"Travel?", "pf":"0", "ps":"0", "page":"1", "stime":"2019-03-30", "etime":"2020-03-31", "sort":"rel", "highlight":"1", "num":"10", "ie":"utf-8" } response = requests.get("https://interface.sina.cn/homepage/search.d.json?", params=params, headers=headers) print(response)
This time, the requests library is used to construct the same url and send the request. The result was a cold 403 Forbidden:

So go back to the site and see what's wrong


Find the returned json file from the developer tool, and check the request header. It is found that the request header has a cookie, so we can directly copy its request header when constructing headers. Run again, response200! The rest is simple. You only need to parse the returned data and write it to Excel.

Full code
import requests import json import xlwt def getData(page, news): headers = { "Host": "interface.sina.cn", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0", "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", "Referer": r"http://www.sina.com.cn/mid/search.shtml?range=all&c=news&q=%E6%97%85%E6%B8%B8&from=home&ie=utf-8", "Cookie": "ustat=__172.16.93.31_1580710312_0.68442000; genTime=1580710312; vt=99; Apache=9855012519393.69.1585552043971; SINAGLOBAL=9855012519393.69.1585552043971; ULV=1585552043972:1:1:1:9855012519393.69.1585552043971:; historyRecord={'href':'https://news.sina.cn/','refer':'https://sina.cn/'}; SMART=0; dfz_loc=gd-default", "TE": "Trailers" } params = { "t":"", "q":"Travel?", "pf":"0", "ps":"0", "page":page, "stime":"2019-03-30", "etime":"2020-03-31", "sort":"rel", "highlight":"1", "num":"10", "ie":"utf-8" } response = requests.get("https://interface.sina.cn/homepage/search.d.json?", params=params, headers=headers) dic = json.loads(response.text) news += dic["result"]["list"] return news def writeData(news): workbook = xlwt.Workbook(encoding = 'utf-8') worksheet = workbook.add_sheet('MySheet') worksheet.write(0, 0, "title") worksheet.write(0, 1, "time") worksheet.write(0, 2, "media") worksheet.write(0, 3, "website") for i in range(len(news)): print(news[i]) worksheet.write(i+1, 0, news[i]["origin_title"]) worksheet.write(i+1, 1, news[i]["datetime"]) worksheet.write(i+1, 2, news[i]["media"]) worksheet.write(i+1, 3, news[i]["url"]) workbook.save('data.xls') def main(): news = [] for i in range(1,501): news = getData(i, news) writeData(news) if __name__ == '__main__': main()
final result
