Python crawler 9- form interaction and simulated Login

       

catalogue

9.1 simulated Login

9.1.1Cookie overview

9.1.2 submitting cookies to simulate login

9.2 form interaction

9.2.1 post method

9.2.2 view web page source code submission form

         Whether it is a simple web page or a web page using asynchronous loading technology, the web page information is obtained by requesting the web address through the GET method. But how to GET the information after the login form? This section will explain the Post method of reqrequests library, fill in the form to obtain web page information by observing the form code and reverse engineering, and simulate login to the website by submitting Cookie information.

        The main knowledge points of this section are as follows:

        Form interaction: use the POST method of Requests library for form interaction

        Cookie: understand the basic concept of cookie

        Simulated Login: learn to use Cookie information to simulate login to the website

9.1 simulated Login

                Sometimes, form fields may be wrapped in encryption or other forms. This increases the difficulty of constructing the form. You can choose to submit Cookie information for simulated Login.

9.1.1Cookie overview

        Cookie refers to the data stored on the local terminal by some websites in order to identify users and track session s. Internet shopping companies provide users with goods of relevant interest by tracking users' cookie information. Similarly, because cookies save the user's information, we can simulate login to the website by submitting cookies.

9.1.2 submitting cookies to simulate login

        Next, take yaozhi.com as an example to find Cookie information and submit it to simulate logging in to yaozhi.com.

        (1) Enter yaozhi.com, open the developer tool of Chrome browser and select the Network option.

        (2) Manually enter the account and password to log in. At this time, you will find that many files will be loaded in the Network.

        (3) At this time, you do not need to view the file information of the login page, but directly view the file information after login, as shown in the following figure

 

        (1) Add cookies to headers to pass parameters

import requests

member_url = 'https://www.yaozh.com/member/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
    'Cookie': '_ga=GA1.2.1508206450.1629445703; UtzD_f52b_saltkey=NzQM8wJu;UtzD_f52b_lastvisit=1629442261; yaozh_uidhas=1; UtzD_f52b_ulastactivity=1629445859%7C0; _gid=GA1.2.468845100.1630220751; yaozh_userId=828458; PHPSESSID=q3p104a4m5oidc0ftanp7n4rk7; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1629705497,1629792799,1630220751,1630292865; yaozh_mylogin=1630304933; UtzD_f52b_creditnotice=0D0D2D0D0D0D0D0D0D721338;
UtzD_f52b_creditbase=0D0D6D0D0D0D0D0D0;UtzD_f52b_creditrule=%E6%AF%8F%E5%A4%A9%E7%99%BB%E5%BD%95;acw_tc=707c9f9816303118236322722e19d2bb3bf070bb34e00eed1b4400b5b6ab67;UtzD_f52b_lastact=1630311824%09uc.php%09; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1630311838'
}

response = requests.get(member_url,headers=headers)

print(response.text)

        Please note that sometimes the Cookie value you directly copy is incorrect. If there is an error at this time, you need to check it on the website.  

        (2) Cookie s are passed directly as parameters

member_url = 'https://www.yaozh.com/member/'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
}

Cookies = '_ga=GA1.2.1508206450.1629445703; UtzD_f52b_saltkey=NzQM8wJu; UtzD_f52b_lastvisit=1629442261; yaozh_uidhas=1; UtzD_f52b_ulastactivity=1629445859|0; _gid=GA1.2.468845100.1630220751; yaozh_userId=828458; PHPSESSID=q3p104a4m5oidc0ftanp7n4rk7; Hm_lvt_65968db3ac154c3089d7f9a4cbb98c94=1629705497,1629792799,1630220751,1630292865; yaozh_mylogin=1630304933; UtzD_f52b_creditnotice=0D0D2D0D0D0D0D0D0D721338; UtzD_f52b_creditbase=0D0D6D0D0D0D0D0D0; UtzD_f52b_creditrule=Log in every day; acw_tc=707c9f9816303118236322722e19d2bb3bf070bb34e00eed1b4400b5b6ab67; UtzD_f52b_lastact=1630311824	uc.php	; Hm_lpvt_65968db3ac154c3089d7f9a4cbb98c94=1630311838'

"""
#What is needed is a dictionary
cook_dict = {}
cookies_list = cookies.split('; ')
for cookie in cookies_list:
    cook_dict[cookie.split('=')[0]] = cookie.split('=')[1]
"""

# Dictionary derivation
cook_dict = {cookie.split('=')[0]:cookie.split('=')[1] for cookie in cookies.split('; ')}

response = requests.get(member_url, headers=headers, cookies=cook_dict)

data = response.content.decode()

9.2 form interaction

        This section will explain how to use the POST of Requests library, submit the form by observing the web page source code of the form, and finally obtain the fields submitted by the form through reverse engineering, so as to interact with the form.

9.2.1 post method

        The POST method of Requests library is simple. You only need to simply pass the data of a dictionary structure to the data parameter. In this way, when the request is initiated, it will be automatically encoded into form form to complete the form filling.

import requests

params = {
    'key':'value1',
    'key':'value2',
    'key':'value3'
}

html = requests.post(url,data=params)      #post method

print(html.text)      

9.2.2 view web page source code submission form

        (1) Open yaozhi.com, locate the login location, use Chrome browser to "check" and find the location of the login element, as shown in the following figure:

          Here we should pay attention to several parameters. The first is our username and pwd, and the second is the formhash and backurl in the last two lines, because the four parameters are the parameters in from data, which are very important

          (2) Construction code

import requests

url = 'https://www.yaozh.com/login '# login website
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}      

data = {     #Parameters passed
    'username':'18303056364',
    'pwd':'qfkjr8yn',
    'formhash':'5AE08D06CB',
    'backurl':'https%3A%2F%2Fwww.yaozh.com%2F'
}

session = requests.session()   #Establish a session
r = session.post(url,data=params,headers=headers)  #Log in to the website with parameters

member_url= 'https://www.yaozh.com/member / '# login to the page to visit
r2 = session.get(member_url,headers=headers)   #Visit the personal center with a cookie
print(r2.text) 

        There is a point to be prompted, that is, the values of formhash and backurl need to be found before login. It can be found in the unlisted web page source code. Formhash will change according to time.

Tags: Python crawler Python crawler

Posted on Tue, 14 Sep 2021 14:50:52 -0400 by Pedro Sim