Python reptile Tour

Preface:

Recently, I learned some basic syntax of python, and built a simple website with the asynchronous IO framework (although some of them don't understand it yet) with teacher Liao Xuefeng. But slowly, there's another interesting place for Python to write reptiles. I'll learn it when I have plenty of time. I recommend the course of little ape circle, which is very good

0x00: understanding reptiles

Classification of reptiles in use scenarios
——General purpose reptiles:
Grab an important part of the system. It grabs a whole page of data.
——Focus on reptiles:
It is built on the basis of the general crawler, which grabs the specific local content of the page.
——Incremental reptiles:
Monitor the data update of the website and intelligently capture the latest updated data in the website

Anti crawling mechanism: portal website can prevent crawlers from waking up the crawlers of website data by making corresponding strategies and waving means.
Anti crawling strategy: crawlers can also break the anti crawling mechanism in the portal site by formulating relevant strategies or technical means, so as to obtain the data in the portal site.

robots.txt protocol:

Gentleman agreement. It stipulates that the data in the website can be crawled, and the data cannot be crawled

0x01: http & HTTPS protocol

http protocol:

A form of data interaction between server and client

Common request header information:

User agent: identity of request carrier
 Connection: disconnect or keep the connection after the request

Common response header information:

Content type: the data type of the server response back to the client

https protocol

Secure Hypertext Transfer Protocol, involving data encryption

Encryption method:

  1. Symmetric key encryption
  2. Asymmetric key encryption
  3. Certificate key encryption

Symmetric key encryption

The client encrypts the data first, and sends the ciphertext and the key to the server. The server uses the key to decrypt the ciphertext.
Disadvantages: easy to intercept and unsafe

Asymmetric key encryption

For example: RSA encryption

Certificate key encryption


0x02:request module

The request module is a native network request based module in python. It has powerful functions and is very efficient. Its function is to simulate the browser to initiate requests. Since it is to simulate the browser to initiate requests, it needs to use the request module to do the same work as the browser

request encoding process:

  1. Specify url
  2. Initiate request
  3. Get response data
  4. Persistent storage

Exercise 1: crawl the page data of Sogou homepage (basic process)

Crawling Code:

# Crawling the page data of Sogou Homepage
import requests
if __name__ == "__main__":
	#setp1 specifies url
	url = 'https://www.sogou.com/'
	#setp2 initiate request
	#get method returns a response object
	response = requests.get(url=url)
	#setp3 get response data
	#text returns response data in the form of a string
	page_text = response.text
	print(page_text)
	#setp4 persistent storage
	with open('D:/Reptile/sogou.html','w',encoding='utf-8') as f:
		f.write(page_text)
	print('Crawl finish')

Response:

Exercise 2: Web page collector (GET)

Before writing, you need to understand some anti climbing mechanisms:

UA: user agent (identity of request carrier)

UA detection: the server of the portal site will detect the carrier identity of the corresponding request. If it detects that the carrier identity of the request is a browser, it means that the request is a normal request. However, if it detects that the carrier identity of the request is not based on a browser, it means that the request is an abnormal request (crawler), so the server is very It is possible to reject the request.

UA camouflage: encapsulate the corresponding user agent into a dictionary and add it into the code.

Here is the search dog browser

For example, when searching the sword God domain, there are a bunch of parameters followed, but one parameter query is particularly obvious. This parameter is the content we want to query, only keep this parameter, and it can be queried, so the url is better specified

In addition, UA camouflage should be added to prevent rejection

Here's how to write crawl Code:

import requests

if __name__ == "__main__":
	#UA camouflaged, encapsulated in a dictionary
	header = {
		'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
	} 
	#Specify url
	url = "https://www.sogou.com/sogou"
	#Custom parameters
	kw = input('please input param')
	#Encapsulate parameters into a dictionary
	param = {
		'query': kw
	}
	#Initiate the request, the second parameter will pass in the customized parameter, and the third parameter will pass in the header
	reponse = requests.get(url=url,params=param,headers=header)
	#Get response information
	page_text = reponse.text
	#Save information
	fileName = kw+'.html'
	with open(fileName,'w',encoding='utf-8') as f:
		f.write(page_text)
	print("Climb to success")

Climb to success

If you want to crawl other content, just change the parameters

Exercise 3: crack Baidu translation (Post)

Learn about Ajax before writing

AJAX is a technology for creating fast dynamic web pages.
AJAX enables web pages to be updated asynchronously by exchanging a small amount of data with the server in the background. This means that a part of the page can be updated without reloading the entire page


Open Baidu translation and find the local refresh of the page. AJAX is used here. After the request is successful, a local refresh will be performed. Therefore, as long as the corresponding Ajax request is captured, the corresponding translation results can be stored in the json text file

Why is it stored in the json text file? We can know it by capturing the server-side return information

Next, we will capture the Ajax request, such as typing dog, and capture the corresponding parameters. Next, we will write the crawl code

import requests
import json

#Specify the url of the post
post_url = 'https://fanyi.baidu.com/sug'
#UA camouflage
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
#Custom post parameters
word = input('please input')
data = {
    'kw': word
}
#Initiate request
post_response = requests.post(url=post_url,data=data,headers=headers)

#Get response
#Note: the method of getting response data json() returns obj (if you confirm that the response data is JSON, you can use json())
dict_text = post_response.json()
# print(dict_text)
#Save data
fileName = word+'.json'
fp = open(fileName,'w',encoding = 'utf-8')
#Note: since the returned json string is in Chinese, the guarantee_ascii is false, and ASCII cannot be used for encoding
json.dump(dict_text,fp=fp,ensure_ascii=False)
print('Finish')

Climb to success

Exercise 4: climbing the Douban movie

Click any type of movie, and when you look down, you will find that each time when the page is 20 movies, you can continue to look down for more movies that are not displayed after a little while. The web page has no change, only local changes, so you can capture the Ajax request


These parameters will be carried when Ajax requests. Now that you know what parameters are carried, you can write a crawl script

# Crawling for Douban movie
import requests
import json
if __name__ =="__main__":
    url="https://movie.douban.com/j/chart/top_list"
    #parameter
    params = {
        "type":"24",
        "interval_id": "100:90",
        "action":"",
        "start": "0",
        "limit": "20",
    }
   	#UA camouflage
    headers={
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"
    }
    #request
    r=requests.get(url=url,params=params,headers=headers)
    #reaction
    data=r.json()
    #storage
    fp = open("Film.json","w",encoding="utf-8")
    #Indent: indent (for example, indent=4, indent 4 spaces);
    json.dump(data,fp=fp,ensure_ascii=False,indent=4)
    print("Climb to success")

Assignment: crawling KFC restaurant location information


When querying the restaurant, only the partial page has changed, so it can be determined to be Ajax, so capture the request
Request through POST

Note that this is no longer json type, but text,

Code the crawler:

import requests

if __name__ == "__main__":
    url='http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
key = input("Please enter the query address\n")
page = input("Please enter the query page number\n")
data = {
    'cname':'',
    'pid': '',
    'keyword':key,
    'pageIndex':page,
    'pageSize': '10',
}

reponse = requests.post(url=url,data=data,headers=headers)
data = reponse.text
print(data)
fileName = key+'.txt'
with open(fileName,'w',encoding='utf-8') as fp:
	fp.write(data)
print("Climb to success")

Climb to success

Conclusion:

This is the first time to learn here, this learning harvest is great, the next will record a more fun reptile journey!!

61 original articles published, 72 praised, 10000 visitors+
Private letter follow

Tags: JSON encoding Windows Python

Posted on Fri, 14 Feb 2020 03:42:31 -0500 by hellangel