Crawler requests module entry to prison: basic knowledge + practical analysis

πŸ“’πŸ“’πŸ“’πŸ“£πŸ“£πŸ“£
🌻🌻🌻 Hello, everyone. My name is Dream. I'm an interesting Python blogger. I have a small white one. Please take care of it 😜😜😜
πŸ…πŸ…πŸ… CSDN is a new star creator in Python field. I'm a sophomore. Welcome to cooperate with me
πŸ’• Introduction note: this paradise is never short of genius, and hard work is your final admission ticket! πŸš€πŸš€πŸš€
πŸ’“ Finally, may we all shine where we can't see and make progress together 🍺🍺🍺
πŸ‰πŸ‰πŸ‰ "Ten thousand times sad, there will still be Dream, I have been waiting for you in the warmest place", singing is me! Ha ha ha~ 🌈🌈🌈
🌟🌟🌟✨✨✨

The request s library is actually similar to the urlib library, but the urlib library is a little outdated, so it is generally used. Let's learn about it

1, Basic use

1. Use documents

Official documents
http://cn.python‐requests.org/zh_CN/latest/

Get started quickly
http://cn.python‐requests.org/zh_CN/latest/user/quickstart.html

2. Installation

pip install requests

After the installation is successful, there will be a prompt of successful. After the installation, there will be a prompt of Requirement already satisfied:

3. Attribute and type of response

1. Type

```html
import requests
url = 'https://www.baidu.com/'
response = requests.get(url = url)
# One type has six attributes
# Response type
print(type(response))
<class 'requests.models.Response'>

2. Return the source code of the web page in the form of string

# Return the source code of the web page in the form of string
print(response.text)

3. Return a url address

# Return a url address
print(response.url)
https://www.baidu.com/

4. Binary data is returned

# Binary data is returned
print(response.content)


5. Return the status code of the response

# Returns the status code of the response
print(response.status_code)
200

6. The response header is returned

# The response header is returned
print(response.headers)

2, Simply compare urllib and requests

1.urllib

# (1) One type and six methods
# (2)get request
# (3)post request Baidu translation
# (4) get request for ajsx
# (5)ajax post request
# (6) Log in to microblog with cookie
# (7) Agency

2.requests

# (1) One type has six attributes
# (2)get request
# (3)post request
# (4) Agency
# (5)cookie verification code

3, Application of requests method

1. get requests for requests

(1) Request Baidu interface

import requests
url = 'https://www.baidu.com/s'
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
data = {
    'wd': 'Beijing'
}

# url request resource path
# params parameter
# kwargs dictionary
response = requests.get(url=url, params=data, headers=headers)
content = response.text
print(content)

(2) Summary of characteristics

1. Parameters are passed using params
2. Parameters do not need urlencode code
3. There is no need to request object customization
4. In the request resource path? You can add it or not

2. post request of requests

(1) Request Baidu translation

# -*-coding:utf-8 -*-
# @Author: it's time. I love brother Xu
# Ollie, do it!!!
import requests
import json
url = 'https://fanyi.baidu.com/sug'
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
data = {
    'kw':'eye'
}
# url request address
# data request parameters
# kwargs dictionary
response = requests.post(url=url, data=data, headers=headers)
content = response.text

print(content)
obj = json.loads(content,encoding='utf-8')
print(obj)

(2) Summary of characteristics

1.post requests do not require encoding and decoding
2. The parameter of the post request is data
3. Customization of the request object is not required

3. cookie proxy for requests

(1) Log in to ancient poetry website

1. Open the ancient poetry website:
Ancient poetry network

2. Login interface:

# Login interface
url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}

3. Get the source code of the page

# Get the source code of the page:
response = requests.get(url=url,headers=headers)
content = response.text

4. Parse the page source code, and then get__ VIEWSTATE 'and'__ VIEWSTATEGENERATOR’

# Parse the page source code, and then get '__ VIEWSTATE 'and'__ VIEWSTATEGENERATOR'
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'lxml')

# 'get'__ VIEWSTATE'
viewstate = soup.select('#__VIEWSTATE')[0].attrs.get('value')

# 'get'__ VIEWSTATEGENERATOR'
viestategener = soup.select('#__VIEWSTATEGENERATOR')[0].attrs.get('value')

5. Obtain the verification code picture

# Get verification code picture
code = soup.select('#imgCode')[0].attrs.get('src')
code_url ='	https://so.gushiwen.cn'+code

6. After obtaining the picture verification code, save it locally, and then observe the verification code for input.

# After obtaining the picture verification code, save it locally, and then observe the verification code for input.
# There is a session() method in requests. You can use the return value of the session to turn the request into an object
session = requests.session()
# The content of the url of the verification code
response_code = session.get(code_url)
# Note that binary data should be used at this time
content_code = response_code.content
# wb's mode is to write binary data to a file
with open('code.jpg','wb')as fp:
    fp.write(content_code)

code_name = input('Please enter your verification code:'),

7. Click login

url_post = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'

data_post = {
    '__VIEWSTATE':viewstate ,
    '__VIEWSTATEGENERATOR':viestategener ,
    'from': 'http://so.gushiwen.cn/user/collect.aspx',
    'email': '18300396393',
    'pwd': '20020102XYPxyp',
    'code': code_name,
    'denglu':'Sign in',
}
response_post = session.post(url=url,headers=headers,data=data_post)
content_post = response_post.text
with open('gushiwen.html','w',encoding='utf-8') as fp:
    fp.write(content_post)

8. Obtain dynamic verification code

9. Open the obtained website:

get into:

Successfully sprinkle flowers!

(2) Difficulties

1. Hidden fields
2. Verification code

4, Automatic identification verification code

1. First find the super Eagle website:

Super Eagle

Available account and password: Account: action password: action

2. Then find Python in the development document:


After entering, download the Python language Demo.

3. Modify code

Put the downloaded Demo into our project file and observe its code:
1. Replace this with our user name and code

2. Follow the prompts to replace our id:

3. Generate our own software id:

4. Finally, add () after print!
5. By returning the dictionary, we can find the value of our verification code through the corresponding relationship between key value pairs:

4. Source code sharing:

#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: Picture byte
        codetype: Topic type reference http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:Pictures of wrong topics ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


if __name__ == '__main__':
	chaojiying = Chaojiying_Client('action', 'action', '925358')	#The user center > > software ID generates a replacement 96001
	im = open('a.jpg', 'rb').read()													#Local image file path to replace a.jpg. Sometimes WIN system needs to//
	print(chaojiying.PostPic(im, 1902).get('pic_str'))										#1902 verification code type official website > > price system version 3.4 + print should be added ()


❀️ Previous articles recommended ❀️:

Python crawler ❀️ Urllib usage collection—— ⚑ One click easy entry crawler ⚑

Have you got the psychology of love in reptiles? A cup of Starbucks warms your whole winter - reptile bs4 analysis from entry to pit

❀️ I'm not alone! ❀️ Xpath crawler - your most loyal partner: give me a minute according to the old rules, and ten thousand words will teach you to start Xpath! ⚑

Python crawler practice ❀️ Analyze the page from scratch and grab the data - climb to any number of pages of Douban movie. I don't understand. Come to me! ❀️

The sky is blue and the reptiles are waiting for you ❀️ post request ⚑ cookie login ⚑ handler processor ⚑

🌲🌲🌲 Well, that's all I want to share with you today
❀️❀️❀️ If you like, don't save your one button three connections~

Tags: crawler requests

Posted on Tue, 23 Nov 2021 06:04:04 -0500 by Franko126