Remember a failed crawl

Receiving one day's inspiring false news led me to go to the public information website to find information about designated drugstores. Although the results were relatively unsuccessful, the process was very happy, so I can write down another article about water.The following is the original text:

The page search function is limited, I made a crawler, thinking in two steps, first find the number of the drug store name, then check the specific address of the drug store.

There are only 773 site information, I made a crawl by hand, the code is rough, completed in ten minutes, the effect is still good.The code is shared as follows:


package com.fun

import com.fun.frame.Save
import com.fun.frame.httpclient.FanLibrary
import com.fun.utils.Regex
import com.fun.utils.WriteRead
import net.sf.json.JSONObject
import org.apache.http.client.methods.HttpGet
import org.apache.http.client.methods.HttpPost

class sd extends FanLibrary {

    static names = []

    public static void main(String[] args) {

        test2(1)
//        52.times {
//            test(it+1)
//        }
//        Save.saveStringList(names,"hospital")

        def line = WriteRead.readTxtFileByLine(LONG_Path + "hospital")
        line.each {
            def s = it.split(" ", 3)[1]
            output(s)
            it += test2(changeStringToInt(s))
            names << it
        }
        Save.saveStringList(names,"address.txt")

        testOver()
    }

    static def test(int i) {
        String url = "http://ybj.beijing.gov.cn/ddyy/ddyy2/list";

        HttpPost httpPost = getHttpPost(url, getJson("page=" + i));

        JSONObject response = getHttpResponse(httpPost);
        def all = response.getString("content").replaceAll("\\s", EMPTY)

        def infos = Regex.regexAll(all, "<tr>.*?</tr>")
        infos.remove(0)
        output(infos)
        output(infos.size())
        infos.each {x ->
            names << x.substring(4).replaceAll("<.*?>", SPACE_1)
        }
        output(names.size())
    }
    static def test2(int i) {
        String url = "http://ybj.beijing.gov.cn/ddyy/ddyy2/findByName?id="+i;

        HttpGet httpGet = getHttpGet(url);

        JSONObject response = getHttpResponse(httpGet);
        def all = response.getString("content").replaceAll("\\s", EMPTY)

        def infos = Regex.regexAll(all, "<tr>.*?</tr>")
//        infos.remove(0)
        output(infos)
//        output(infos.size())
        def address = EMPTY
        infos.each {x ->
//            names << x.substring(4).replaceAll("<.*?>", SPACE_1)
           address += x.substring(4).replaceAll("<.*?>", SPACE_1)
        }
        output(address)
        return address
    }
}

Here is the page structure of the two crawls:

Due to limited space, I have organized my Excel documents.Need a small partner to add a WeChat request.

  • Solemn statement: The article was first published under the public number FunTester, which prohibits third parties (except Tencent Cloud) from reproducing and publishing it.

Selected Technical Articles

Selected non-technical articles

Tags: JSON Apache Linux Selenium

Posted on Mon, 03 Feb 2020 22:08:42 -0500 by ale1981