java uses htmlunit tools to grab data loaded in js

Htmlunit is an open source java page analysis tool, after reading the page, it can effectively use htmlunit to analyze the content on the page. The project can simulate the operation of browser, known as the open source implementation of java browser. This browser without interface runs very fast. The Rhinojs engine is used. Simulate js running.

To put it bluntly, it is a browser, which is written in Java and has no interface. Because it has no interface, the execution speed is still dripping. HtmlUnit provides a series of APIs. These APIs can do a lot of functions, such as form filling, form submission, imitation of clicking links. Because of the built-in Rhinojs engine, Javascript can be executed.
Web page acquisition and parsing speed is faster and performance is better. It is recommended for application scenarios that need to parse web scripts.

Before using this tool, you need to import the jar package required by htmlunit:

 

Code:

public static String url="http://www.XXX.cn/XXX";//Address of grabbing data
    public static void main(String[] args) throws IOException, SAXException
    {
        WebClient wc = new WebClient(BrowserVersion.FIREFOX_52);
        wc.getOptions().setJavaScriptEnabled(true); //Enable JS Interpreter, default true
        wc.setJavaScriptTimeout(100000);//Set up JS Execution timeout
        wc.getOptions().setCssEnabled(false); //Prohibit css Support
        wc.getOptions().setThrowExceptionOnScriptError(false); //js Whether to throw an exception when running an error
        wc.getOptions().setTimeout(10000); //Set the connection timeout, here is 10 S. If 0, wait indefinitely
        wc.setAjaxController(new NicelyResynchronizingAjaxController());//Setting support AJAX
        wc.setWebConnection(new WebConnectionWrapper(wc) {
                    public WebResponse getResponse(WebRequest request) throws IOException {
                        WebResponse response = super.getResponse(request);
                        String data=  response.getContentAsString();
                        if (data.contains("{\"js Data identification in\"")){//Judging what's caught js Is the data a field that contains the crawl?
                            System.out.println(data);
                            writeFile(data);//take js Writes the data obtained in the specified path txt In file
                        }
                        return response;
                    }
                }
        );
        HtmlPage page = wc.getPage(url);
        System.out.println("page:" + page);
        try {
            Thread.sleep(1000);//Set up
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        //Close webclient
        wc.close();
    }

    /**
     * Write to TXT file
     */
    public static void writeFile(String data) {
        try {
            File writeName = new File("data.txt"); // Relative paths, if not, create a new one output.txt file
            writeName.createNewFile(); // create a new file,Direct coverage of documents with the same name
            try{
          FileWriter writer = new FileWriter(writeName);
          BufferedWriter out = new BufferedWriter(writer);
out.write(data); out.flush();
// Put the contents of the buffer into the file } } catch (IOException e) { e.printStackTrace(); } }

Tags: Java Javascript

Posted on Thu, 03 Oct 2019 17:00:07 -0400 by prashanth0626