catalogue
1. Get Baidu web page and print it
2. Get handsome pictures and download them locally
3. Get the beauty video and download it locally
4. Sogou keyword search crawling
1. Get Baidu web page and print it
import requests url="https://www.baidu.com/" #ua camouflage param={ ' User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' } response =requests.get(url,params=param) response.encoding = response.apparent_encoding print(response.text)
Where code response.encoding = response.apparent_encoding
Encoding is the encoding method extracted from the charset field in the header in http. If there is no charset field in the header, it defaults to the ISO-8859-1 encoding mode, and Chinese cannot be parsed. This is the reason for garbled code
apparent_encoding will analyze the encoding method of the web page from the content of the web page_ Encoding is more accurate than encoding. When there is garbled code on the web page, you can put the app arent_ The encoding format of encoding is assigned to encoding.
Function is Prevent garbled code
2. Get handsome pictures and download them locally
import requests url = "https://cn.bing.com/images/search?view=detailV2&ccid=XQzISsWk&id=979B73C4E472CCA4C34C216CD0693FDC05421E1E&thid=OIP.XQzISsWklI6N2WY4wwyZSwHaHa&mediaurl=https%3A%2F%2Ftse1-mm.cn.bing.net%2Fth%2Fid%2FR-C.5d0cc84ac5a4948e8dd96638c30c994b%3Frik%3DHh5CBdw%252fadBsIQ%26riu%3Dhttp%253a%252f%252fp2.music.126.net%252fPFVNR3tU9DCiIY71NdUDcQ%253d%253d%252f109951165334518246.jpg%26ehk%3Do08VEDcuKybQIPsOGrNpQ2glID%252fIiEV7cw%252bFo%252fzopiM%253d%26risl%3D1%26pid%3DImgRaw%26r%3D0&exph=1410&expw=1410&q=%e5%bc%a0%e6%9d%b0&simid=608020541519853506&form=IRPRST&ck=68F7B9052016D84898D3E330A6F4BC38&selectedindex=2&ajaxhist=0&ajaxserp=0&vt=0&sim=11" r = requests.get(url) with open("zhangjie.jpg","wb") as f: f.write(r.content) print("over!!!")
The result is this:
But this photo can't be opened on my window computer
After looking for information for a long time, this problem still hasn't been solved. Some friends know that they can give advice in the comment area
3. Get the beauty video and download it locally
import requests url = 'https://haokan.baidu.com/v?pd=wisenatural&vid=12502968524882193208' r = requests.get(url) with open('jk.mp4', 'wb') as f: f.write(r.content) print('Download complete')
4. Sogou keyword search crawling
import requests #Web collector url = "https://www.sogou.com/web/" kw=input('enter a word: ') header={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' } param={ 'query':kw } response = requests.get(url=url,params=param,headers=header) page_txt = response.text filename = kw+'.html' with open(filename,"w",encoding="utf_8") as fp: fp.write(page_txt) print(filename,"Saved successfully")
5. Climb Baidu translation
It is found that it is a post request (and the page will be refreshed once you enter a letter under the English input method)
import requests import json #Get url post_url = "https://fanyi.baidu.com/sug" #UA camouflage header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' } #Data incoming word = input("enter a word:") data = { 'kw': word } response = requests.post(url=post_url,data=data,headers=header) dic_obj = response.json() #print(dic_obj) #Permanent storage filename = word+'.json' fp = open(filename,"w",encoding="utf-8") json.dump(dic_obj,fp=fp,ensure_ascii=False) print("over!!!")
Save as a json file and copy the json file
We open the json format website https://www.bejson.com/
6. Climb the Douban film list
import requests import json url = "https://movie.douban.com/j/chart/top_list" params = { 'type': '24', 'interval_id': '100:90', 'action': '', 'start': '100', 'limit': '20' } header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' } response = requests.get(url=url,params=params,headers=header) list_obj = response.json() fp = open("./b.json","w",encoding="utf-8") json.dump(list_obj,fp,ensure_ascii=False) #Finally, go to the web page json format website format print("over!!!")
Let's emphasize the importance of params on the website
7.JK sister crawling
import requests import re import urllib.request import time import os header={ 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36' } url="https://cn.bing.com/images/async?q=jk%E5%88%B6%E6%9C%8D%E5%A5%B3%E7%94%9F%E5%A4%B4%E5%83%8F&first=118&count=35&relp=35&cw=1177&ch=705&tsc=ImageBasicHover&datsrc=I&layout=RowBased&mmasync=1&SFX=4" r = requests.get(url=url,headers=header) c = r.text pattern = re.compile(r'<div class="imgpt".*?<div class="img_cont hoff">.*?src="(.*?)".*?</div>',re.S) items = re.findall(pattern,c) os.makedirs('photo',exist_ok=True) for a in items: print(a) for a in items: print("Download pictures:"+a) b=a.split('/')[-1] urllib.request.urlretrieve(a,'photo/'+str(int(time.time()))+'.jpg') print(a+'.jpg') time.sleep(2)
Supplement 1: if the re.S parameter is not used, the matching is only performed in each line. If there is no line, it will be replaced with the next line and start again.
After using the re.S parameter, the regular expression will match the string as a whole.
Supplement 2: os.makedirs(name, mode=0o777, exist_ok=False)
effect
Used to create multi-layer directories (single layer, please use os.mkdir)
Parameter description
Name: the name of the directory you want to create
Mode: the permission number mode to be set for the directory. The default mode is 0o777 (octal).
Exist_ok: whether to trigger an exception when the directory exists. If exist_ok is False (the default), the FileExistsError exception will be triggered when the target directory already exists; if exist_ok is True, the FileExistsError exception will not be triggered when the target directory already exists.
Supplement III:
describe
urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)
Function description
Copies the network object represented by the URL to a local file. If the URL points to a local file, the object will not be copied unless a file name is provided. Returns a tuple () (filename, header), where filename is the local file name where the object can be found, and header is the info() method of the object returned by urlopen() (for remote objects).
The second parameter (if present) specifies the file location to copy to (if not, the location will be a tempfile with the generated name) The third parameter, if it exists, is a callback function. It will be called once the network connection is established and will be called once after each block is read. This callback function will pass three parameters, so far the block count, the block size of bytes, and the total size of the file. The third parameter may be -1, on the old FTP server. It does not return a file size in response to a retrieval request.
Parameter description
url: external or local url
filename: Specifies the path to save the data locally (if this parameter is not specified, urllib will generate a temporary file to save the data);
reporthook: it is a callback function that will be triggered when the server is connected and the corresponding data block is transmitted. We can use this callback function to display the current download progress.
Data: refers to the data post ed to the server. This method returns a tuple (filename, headers) containing two elements. Filename represents the path saved locally, and header represents the response header of the server.