[Python industry analysis] crawler analysis of BOSS direct recruitment information acquisition

Today, we are going to officially use the program to crawl the BOSS recruitment data. I will improve the program step by step from the most basic step to help you understand the crawler program. There are still many problems that I can't solve. I also hope that someone can leave a message to help us

First, let's visit the next page to see if the result is consistent with the browser's visit

There are too many information returned from specific pages. We can find that the Title of different pages is different, and it changes according to our query conditions. So we can only focus on the change of Title for the moment

from bs4 import BeautifulSoup as bs
import requests


def de_title(func):
    def wrapper(*args, **kwargs):
        req = func(*args, **kwargs)
        content = req.content.decode("utf-8")
        content = bs(content, "html.parser")
        print(func.__name__, content.find("title").text)
    return wrapper


@de_title
def test1():
    req = requests.get('https://www.zhipin.com')
    return req


@de_title
def test2():
    req = requests.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req


if __name__ == "__main__":
    test1()
    test2()


It can be seen from the returned result that the home page is returned normally, but the result of the query page is different from the expected result
I will go to the browser again to refresh the next query page. If your network is not good, you can see that there will be a loading process, please wait a moment
After Baidu, I learned that this is the generation of cookies

Get Cookies

It's true that the above code is very simple. It's just for visiting, without adding cookies. boss must have done a lot of anti creep functions.
Let's analyze the wave first. We can crawl to the homepage to show that the homepage does not need cookie verification. We can open F12 of the browser to enter the Application to see how cookies look

You can see that there is a cookie on the left side, and then you can right-click the URL to see Clear. We will Clear the saved cookies first, and when will they be generated
The key of cookies has__ c and Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a, seems to be the effective time, but the actual test does not seem to take an hour.



There is also an encrypted field in cookies__ zp_stoken__ Analyze the cookie production mechanism of boss again, I can't trigger js to generate effective cookies, and__ zp_stoken__ It's hard to encrypt. Finding relevant articles means that you can't understand them. Give up

At this point, it can be thought that since the homepage does not need to be cached, we can first visit the homepage, and continue to visit the subsequent pages using the cookies generated by the homepage, which seems to be OK, but there is a difference between the acquisition process of the crawler and the browser we use. After the browser obtains the page, it will parse the html to render the js, css, etc., and when the boss generates the cookies, it needs to obtain the js The current route is redirected to the characteristic route. After many calculations, the crawler does not render directly. It gives up again

However, in order to obtain some data, only the latest cookies can be obtained from the browser. Refresh the browser regularly to keep the cookies effective. True low

GO! Get Cookies from browser

import os
import json
import base64
import sqlite3
from win32crypt import CryptUnprotectData
from cryptography.hazmat.primitives.ciphers.aead import AESGCM


def get_string(local_state):
    with open(local_state, 'r', encoding='utf-8') as f:
        s = json.load(f)['os_crypt']['encrypted_key']
    return s


def pull_the_key(base64_encrypted_key):
    encrypted_key_with_header = base64.b64decode(base64_encrypted_key)
    encrypted_key = encrypted_key_with_header[5:]
    key = CryptUnprotectData(encrypted_key, None, None, None, 0)[1]
    return key


def decrypt_string(key, data):
    nonce, cipherbytes = data[3:15], data[15:]
    aesgcm = AESGCM(key)
    plainbytes = aesgcm.decrypt(nonce, cipherbytes, None)
    plaintext = plainbytes.decode('utf-8')
    return plaintext


def get_cookie_from_chrome(host):
    local_state = os.environ['LOCALAPPDATA'] + r'\Google\Chrome\User Data\Local State'
    cookie_path = os.environ['LOCALAPPDATA'] + r"\Google\Chrome\User Data\Default\Cookies"

    sql = "select host_key,name,encrypted_value from cookies where host_key='%s'" % host

    with sqlite3.connect(cookie_path) as conn:
        cu = conn.cursor()
        res = cu.execute(sql).fetchall()
        cu.close()
        cookies = {}
        key = pull_the_key(get_string(local_state))
        for host_key, name, encrypted_value in res:
            if encrypted_value[0:3] == b'v10':
                cookies[name] = decrypt_string(key, encrypted_value)
            else:
                cookies[name] = CryptUnprotectData(encrypted_value)[1].decode()

        # print(cookies)
        return cookies


if __name__ == "__main__":
    print(get_cookie_from_chrome('.zhipin.com'))

# Print results
> {'Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a': '1591673534', 'Hm_lvt_194df3105ad7148dcf2b98a91b5e727a': '1591090007,1591669802', '__a': '8822883.1591091039.1591091039.1591669802.22.2.14.22', '__c': '1591669802', '__g': '-', '__l': 'l=%2Fwww.zhipin.com%2Fshanghai%2F&r=&friend_source=0&friend_source=0', '__zp_stoken__': 'ddfaaCzFwRCxhSDdVFyZWXhMbWlVzT3c9XEtcFFtqWzJsClIXOkAaLHYPQU4EVQFRQSADE3tSDRQoX3dkHBwcGUxZKzhQID5pY35mGiMvDT8aR2cvOlt0Ukc5YSoYQitNAxlGbCBbZz9gTSU%3D', 'lastCity': '101020100'}

Add Cookies to the request

from tp.boss.get_cookies import get_cookie_from_chrome
from bs4 import BeautifulSoup as bs
import requests


@de_title
def test3():
    cookie_dict = get_cookie_from_chrome('.zhipin.com')
    # To convert a dictionary to CookieJar:
    cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
    s = requests.Session()
    s.cookies = cookies
    req = s.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req

if __name__ == "__main__":
    test1()
    test2()
    test3()


There is another situation. This is not the Title that should be found in the query, but the problem of Cookies seems to be in the past. Please print the details and have a look

It's a restricted IP address. Try adding a header

from tp.boss.get_cookies import get_cookie_from_chrome
from bs4 import BeautifulSoup as bs
import requests
import random


@de_title
def test4():
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; ...) Gecko/20100101 Firefox/61.0",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
        "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15"
    ]

    headers = {
        "user-agent": random.choice(user_agent_list)
    }


    cookie_dict = get_cookie_from_chrome('.zhipin.com')
    # To convert a dictionary to CookieJar:
    cookies = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
    s = requests.Session()
    s.cookies = cookies
    s.headers = headers
    req = s.get('https://www.zhipin.com/job_detail/?query=python&city=101010100&industry=&position=100109')
    return req

if __name__ == "__main__":
    test1()
    test2()
    test3()
    test4()



We finally saw the recruitment information
If you can't, you need to refresh your browser. Oh, it's too low...

That's it today

Tags: Windows Python JSON Google

Posted on Tue, 09 Jun 2020 03:16:16 -0400 by PW