Hitokoto spider one word library crawler development diary

Original address: http://bili33.top/2020/02/11/Hitokoto-Spider/

Recently, although there are classes at home (school starts), but the two and a half hours of rest time at noon and the free time at night are really idle, what do you want to do Then I found octopus on my student's computer desktop, and when I thought of him grabbing a word library with octopus, I was thinking: why don't I make a word grabbing reptile myself? Just do it, so I began to sit up

Below is my personal development diary, including memories. Some details may not be clearly remembered

Project address: https://github.com/gamernotile/hitokoto-spider

Monday, February 10, 2020 weather: not very good

Today is the first day of school. The conference function of nail is really a dish. The video is delayed for 3 seconds, and the interactive board is delayed for 3 minutes And this meeting can be viewed back, can't be downloaded??? Take fiddler and grab the bag

(12:00) I'm starting to be a crawler. Let's see how to use the API first. On the official website, you can see the following table (for convenience, I took the source code directly)

time Affect Api adjustment
Before June 2018 Old API (http://api.hitokoto.cn and https://sslapi.hitokoto.cn) The old API will be incorporated into the v1API by June in the form of switching parsing. This means that requesting this API after adjustment is no different from requesting the v1API. After adjustment, the stability of this interface will no longer be maintained.
Before July 2018 v1API(https://v1.hitokoto.cn) The final version of v1API will be released. The v1 interface will exist for a long time in the future (even if v2 is released, please feel free to use it).
v2 release (unknown time) v2API (unknown domain name) Go online to V2 API.

This is the address of several APIs in one word. I chose https://v1.hitokoto.cn

When I got the api, I wanted to see the effect of the access, so I went to the address again and got a large string of text

{
  "id": 79,		//ID in a word
  "hitokoto": "So, their ceremony is not over.",	//A word of content
  "type": "a",	//Type. Please refer to the official website for details. Here a is Anime
  "from": "There is something wrong with my love story",	//A word of origin
  "from_who": null,	//Who said that?
  "creator": "Abu carbon.",	//Users submitting this entry
  "creator_uid": 0,		//User's uid can be ignored
  "reviewer": 0,		
  "uuid": "f6aa4116-5a0f-4ab0-807a-bf3838a5fd23",	//User's uuid can be ignored
  "created_at": "1468605909"		//When to create it is shown as a time stamp
}

I look at it: how does the format of this thing look like json???

Until I pull down and find:

This is json (the note above is added by me)

Once we have figured out the composition of the returned results, we can start to roll up the code!

Open VSCode, create a new folder, and modify it in this folder

First determine what I need to use to simulate http requests, because I've used it myself @Dawnnnnnn/bilibili-live-tools This project, and when using this project, especially on the school server, it's difficult to install a requests runtime library in the Windows Server 2016 DataCenter, so I especially remember that requests can simulate http requests, first install it!

So I called in powershell

$ pip install requests

To install my requests runtime, and then, in my python file, type

import requests as r

Because I'm lazy, I like to abbreviate the runtime library. With as r, I can avoid typing so many letters. It's really convenient

Next, we need to find out the usage of the requests library, so I went to look up these functions again

requests.get('https://github.com/timeline.json')                 # GET request
requests.post("http://httpbin.org/post")                         # POST request
requests.put("http://httpbin.org/put")                           # PUT request
requests.delete("http://httpbin.org/delete")                     # DELETE request
requests.head("http://httpbin.org/get")                          # HEAD request
requests.options("http://httpbin.org/get" )                      # OPTIONS request

So I want to GET the result, take the GET request, and type

res = r.get('https://international.v1.hitokoto.cn/')

In this way, we can assign the obtained result to res, and then get the desired result by calling the value of res, and then run our file once

import requests as r
res = r.get('https://v1.hitokoto.cn/')
print(res)

Then the returned value is: < response [200] >

Isn't this a status code??? I don't want this stuff. I want results...

Then I commit ted, covered my garbage, Lenovo, and went to bed

Tuesday, February 11, 2020 weather: still not very good

Continue to develop today, yesterday I got a status code, and then I have ideas!

I just move my hand and foot in print, add a parameter to the res to be printed, and the code becomes

import requests as r
res = r.get('https://v1.hitokoto.cn/')
print(res.text)

And then I output what I want

This project was supposed to Student brother @ Soulxyz We did it together. Then I push ed the prototype into the warehouse. He also fork ed it. Then he came back and gave me a name called csvdomo.py (click to download) I wonder if you want to type csvdemo

I opened it and found that: you are not getting 10 concurrent results??? Don't you get it again every time???

So I decided to give up his place, and I will work on this project myself Later, he told me that in a word, international nodes can have higher QPS, and then I switched to international nodes

The next step is to repeatedly grab. For this part, I want the user to enter the number of times to grab. The type is int, and the name is num. then I use for loop to repeatedly grab. I wrote the following paragraph

print("Please input the number of grabs. If you want to grab all, please go to https://hitokoto.cn/status view the current total number of words and fill in:“)
num=input()
for i in range(num):
    res = r.get('https://v1.hitokoto.cn/')

In this way, we can realize the function of repeated grabbing. As for why we need to input if we want to grab all of them, because I really can't get that number

Finally, the function of grabbing is realized. Next, the function of writing files is realized

My idea is to write csv files. Because I used related functions when I was studying in South China Institute of technology, I went to search for them. As a result, I found that python natively supports csv and does not need pandas, so I called my own csv library

(it's time for class, let's leave first)

(16:30) I'm back! Now it's time to solve the problem of writing to the csv library. I found relevant methods on the Internet:

First, open a file, and then set the file as the file variable, and then def two functions, one for creation and the other for writing, just like the following sub sub sub sub:

import csv
def create_csv(path):
    with open(path,"w+",newline="",encoding="utf8") as file:    
        # Opening a file is also a carriage return to avoid overwriting the document
        csv_file = csv.writer(file)
        head = ["id","sort","hitokoto"] # Create csv header
        csv_file.writerow(head)
def append_csv(path):
    with open(path,"a+",newline='',encoding="utf8") as file:
        csv_file = csv.writer(file)
        data = [inputs]
        csv_file.writerows(data)

So I've defined two functions, and I'll use them directly below

In addition, the code above that allows users to fill in the output path and the number of crawling times is written. It's good to use the variables directly

print("Please input the output name of the file (please input the suffix of the file by yourself csv How to save):")
path=input()    # output file
print("Please input the number of grabs. If you want to grab all, please go to https://hitokoto.cn/status view the current total number of words and fill in:“)
num=int(input()) # Grasping quantity

Use path for the path of the file and num for the following number of grabs, perfect!

Next, I will layout the captured data

data=res.json()
if data["type"]== "a": sorts=("Anime")  # Automatically restore classification code to classification
if data["type"]== "b": sorts=("Comic")
if data["type"]== "c": sorts=("Game")
if data["type"]== "d": sorts=("Novel")
if data["type"]== "e": sorts=("Myself")
if data["type"]== "f": sorts=("Internet")
if data["type"]== "g": sorts=("Other")
inputs=[data["id"],sorts,data["hitokoto"]]
# print(res.text)   # Output a word, if you need to put the first#Remove it.
append_csv(path)

This completes the write operation! After the operation of writing csv is finished, the result of eliminating the repetition will be left at last. Otherwise, there will be a lot of the same, which is the most troublesome part

At the beginning, my idea is: every time I get a result, I store the id into a storage variable called ids, and initialize it first

ids=[]

In this way, ids will be initialized to an empty list, and then id will be added to the empty list. Whether the id in ids is the same as the obtained id every time you traverse. If the id is the same, it means that the repeated id is obtained. Discard the result and get it again

Write my repetitions out, like this (from the previous commit)

ids=['0']
i=1
for i in range(num):
    res = r.get('https://international.v1.hitokoto.cn/',timeout=60) # Get a response from the server. At this time, the content of the response is the json file (res.text) and status code
    data=res.json() # Convert the obtained result to json string
    t=1
    for t in range(i):
        if data["id"]==ids[t]:  
            # ID already exists, i.e. it has been caught. BUG may be reported in this place. It has not been fixed at present (array overrun BUG)
            break
        else:
            t=t+1   # Self increment
            if t==i:
                ids.append(data["id"])  
    if data["type"]== "a": sorts=("Anime")  # Automatically restore classification code to classification
    if data["type"]== "b": sorts=("Comic")
    if data["type"]== "c": sorts=("Game")
    if data["type"]== "d": sorts=("Novel")
    if data["type"]== "e": sorts=("Myself")
    if data["type"]== "f": sorts=("Internet")
    if data["type"]== "g": sorts=("Other")
    temp=[data["id"],sorts,data["hitokoto"]]
    print(res.text)
    append_csv(path)

Then it will be found that, running running, tell me that the array is out of limit?!!!

This problem is really annoying, because C + + allows the array to exceed the limit, but python doesn't allow it. It will directly report errors to you, so every time I encounter this problem, I don't really want to deal with it, but my program can't run without it!!! So I'm going to comment out this paragraph first, close the garbage, Lenovo, and go to sleep!

Wednesday, February 12, 2020 weather: it's raining!

At 11:20 in class this morning, I started to write the code, and I wanted to finish this repeated detection function at noon

Think of yesterday's array limit on headache, I intend to give up this method, do not save the list!

I remember that there was something like array[t] in Chinese workers at that time, so I thought of a sharp weapon - array array!

The new idea is to store the data into the array named temp, and initialize the variable first:

from array import array
temp = array['i',[0]]

As for why I write a 0 in it, it's because I find that if I don't do it, he will report me a mistake

This initializes an array named temp type int, which contains an element 0. Then save the id into the temp variable every time you get it

temp.append(data["id"])

In this way, each time you traverse the contents of temp, as long as you repeat it, it means you have grasped it. If you don't repeat it, it means you haven't grasped it. Then you can get rid of the repetition!

for i in range(num):
    time.sleep(delay)
    print("Getting a new word")
    res = r.get('https://international.v1.hitokoto.cn/',timeout=60) 
    # Get a response from the server. At this time, the content of the response is the json file (res.text) and status code
    data=res.json() # Convert the obtained result to json string
    temp_minus=temp.count-1
    if temp_minus!=0:
        t=1
        print("Detecting whether the result has been grabbed")
        for t in range(temp.count):
            if(int(data["id"])==temp[t]):
                print("Found the result that has been grabbed, discarding")
                i=i-1
                break
            elif(t==temp.count-1):
                print("Unclipped result, saving file")
                if data["type"]== "a": sorts=("Anime")  # Automatically restore classification code to classification
                if data["type"]== "b": sorts=("Comic")
                if data["type"]== "c": sorts=("Game")
                if data["type"]== "d": sorts=("Novel")
                if data["type"]== "e": sorts=("Myself")
                if data["type"]== "f": sorts=("Internet")
                if data["type"]== "g": sorts=("Other")
                inputs=[data["id"],sorts,data["hitokoto"]]
                # print(res.text)   # Output a word, if you need to put the first#Remove it.
                append_csv(path)
                temp.append(data["id"])
                end_Pro=datetime.datetime.now()
                print("Completed quantity:"+str(i+1)+',Used time:'+str(end_Pro-start_Pro))
                break
    else:
        if data["type"]== "a": sorts=("Anime")  # Automatically restore classification code to classification
        if data["type"]== "b": sorts=("Comic")
        if data["type"]== "c": sorts=("Game")
        if data["type"]== "d": sorts=("Novel")
        if data["type"]== "e": sorts=("Myself")
        if data["type"]== "f": sorts=("Internet")
        if data["type"]== "g": sorts=("Other")
        inputs=[data["id"],sorts,data["hitokoto"]]
        # print(res.text) # Output a word, if you need to put the first#Remove it.
        append_csv(path)
        temp.append(data["id"])
        end_Pro=datetime.datetime.now()
        print("Completed quantity:"+str(i+1)+',Used time:'+str(end_Pro-start_Pro))

The final effect is like this, and then the problem arises again. My temp.count can't seem to be correctly recognized by it, so I opened a new document to test what the output looks like:

emmmm, what is this???

Then I went to the omnipotent Baidu for help. The netizens told me that I could use len(temp) to get the number of elements in the array. Then I changed this thing, and finally it could run!!! Then the question came...

This is??? How come it's not continuous? The number??? I think it's probably when I catch repetition that I catch it, which leads to i+1 (but this test is quite different, it's the same for so many times... Not to the extreme is a kind of Europe

Later, I tried to add an i=i-1 in the corresponding position, but it still failed. I thought that maybe the local variable could not be changed???

I'll define a variable myself, and use while loop Then I'll use the while loop instead

Finally, it's OK. I didn't catch the duplicate results (because the program stopped running due to the timeout of the midway connection, so I only caught 2159 items. I wanted to catch the whole one word library, Click me to download 2159 pieces of data. Please use UTF8 to decode them and save them as gbk to view them in Excel

Finally done!!! This project can finally run as expected!

The next problem to be solved:

  • json profile support
  • GUI support

Original address: http://bili33.top/2020/02/11/Hitokoto-Spider/

Digression:

This is my first time to be a crawler and use the python technology I learned. The project has been uploaded to https://github.com/gamernotile/hitokoto-spider

I'm also very happy that someone would like to work with me on this project. I hope that you can make full use of this project and stop using Octopus. The VIP software is not friendly to our programmers at all. It's better to come by yourself

Published 1 original article, praised 0 and visited 14
Private letter follow

Tags: JSON Python github encoding

Posted on Wed, 12 Feb 2020 05:31:49 -0500 by Gafaddict