python3 crawls the super curriculum list of schools and departments

python3 access to super Curriculum

python3 crawls the super curriculum list of schools and departments

There are three steps: 1. Install software 2. Grab data 3. Write program.

Install software

The python 3 version, idle according to personal preferences, I use anaconda's jupyter notebook; grab the bag with fibbler 5.0; the latest version of the super curriculum (mobile phone).

Grab data

1. Set up the environment for grabbing bags. First, make the computer and the mobile phone in the same wifi. When adding wifi in the wireless network settings of the mobile phone (long press the wifi name to select "change the network" if connected), click Advanced options, select "manual" in the agent option, fill in the IP address of the computer in the server host name, and fill in the port of packet capturing in the server port (note here) The port number here should be the same as that of grabbing packets by fibbler. The default port of fibbler is 8888, and it should not conflict with other ports. The default port of anaconda in the process of grabbing packets by the author is 8888, resulting in 403 for a long time. After setting up, operate the software (network request) normally on the mobile phone. Observe that the package capturing software will display, which means that the environment is set up
2. Start to grab data. Open the super curriculum software on the mobile phone, click to enter the personal information page, choose to change the school (there will be a prompt here only twice, ignore it, as long as you don't save it), and then randomly enter a school in the input box (shameless change here to Tsinghua University)
And then in the fibbler software, you can see that we made a request

Here, the author is also confused. In the software, the request is for a school through text search, but in the network request, the post is for schoolId, so there should be a table corresponding to two values. Then, observe the request process again, and find that there is a request for updatescoollist, but the return value is empty, so I suspect that there may be a cached file in the local mobile phone If you have a friend who starts to grab the registration file after you start, you can try to read the returned result message to me.

Programming

Combing ideas:
Simulate sending request → receiving return data → unpacking → writing to file storage

import requests
import csv

def get_school_name():
    headers = {
    "Cookie": "JSESSIONID=289617D002DA1E5DBAB1F9B6851AB63C-memcached1; Path=/; HttpOnly",
    "User-Agent": "Dalvik/2.1.0 (Linux; U; Android 8.0.0; LLD-AL20 Build/HONORLLD-AL20)-SuperFriday_9.6.1",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Content-Length": "146",
    "Host": "120.55.151.61",
    "Connection": "Keep-Alive",
    "Accept-Encoding": "gzip",
    }
    data = "phoneModel=LLD-AL20&searchType=5&phoneBrand=HONOR&channel=huawei&page=1&platform=1&versionNumber=9.6.1&content=160027&phoneVersion=26&timestamp=0&"
    url = "http://120.55.151.61/V2/School/getUpdateSchoolList.action"
    #Here is the dictionary for obtaining the schoolID and school name in the next class chat forum
    s = requests.Session()
    r = s.post(url = url,data = data,headers = headers)
    print(r.status_code)
    # print(r.text)
    school_text = r.text.replace("false","False")  #If you don't replace it here, you will get a keyword error
    school_list = eval(school_text)["data"]["schoolBOs"]
    school_dict  = []
    for school in school_list:
        school_dict.append({"name":school["name"],"id":school['schoolId']})
#     print(school_dict)
    return school_dict

def get_school_content(s_list):
    
    s = requests.Session()
    school_dict_list = []
    print("Start of crawling")
    for school in s_list:
        if school["name"] != "":
            print(school["name"],school["id"])
            url = "http://120.55.151.61/V2/Academy/findBySchoolId.action"
            data = "phoneModel=LLD-AL20&phoneBrand=HONOR&schoolId=" + str(school["id"]) + "&channel=huawei&platform=1&versionNumber=9.6.1&phoneVersion=26&"
#             data = "phoneModel=LLD-AL20&phoneBrand=HONOR&schoolId=" + '1001' + "&channel=huawei&platform=1&versionNumber=9.6.1&phoneVersion=26&"

            headers = {
                "Cookie": "JSESSIONID=289617D002DA1E5DBAB1F9B6851AB63C-memcached1; Path=/; HttpOnly",
                "User-Agent": "Dalvik/2.1.0 (Linux; U; Android 8.0.0; LLD-AL20 Build/HONORLLD-AL20)-SuperFriday_9.6.1",
                "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                "Content-Length": "113",
                "Host": "120.55.151.61",
                "Connection": "Keep-Alive",
            "Accept-Encoding": "gzip"
            }
            r = s.post(url = url,data = data,headers = headers)
            content = eval(r.text)["data"]
            temp_list = []
#             print(r.text)
            for item in content:
                temp_list.append(item["name"])
            school_dict_list.append({"name":school["name"],"id":school["id"],"content":temp_list})
            
    print("End of crawling")
    return school_dict_list
    
def save1(school_dict_list):
    f = open("C:\\Users\\KSH\\Desktop\\Reptile.txt","w")
    f.write(str(school_dict_list))
    f.close()
    
def save2(school_dict_list):
    f = open("C:\\Users\\KSH\\Desktop\\Reptile.csv","w")
    csv_writer = csv.writer(f)
    csv_writer.writerow(["university","schoolID","Department"])
    for school in school_dict_list:
        contentList = [school["name"],school["id"]]
        contentList.extend(school["content"])
        csv_writer.writerow(contentList)
    f.close()
    
# a = [{"name": "Tsinghua University", "id":"1001"},{"name": "Tsinghua University", "id":"1001"}]
a = get_school_name()
b = get_school_content(a)
save2(b)

Full code download
Crawling down the csv file download

Reference article:
Crawler re exploration practice (5) - crawling APP data - Super curriculum [1] - unrivalled children

Tags: Mobile network Programming Anaconda

Posted on Thu, 18 Jun 2020 05:45:17 -0400 by sticks464