Java crawler framework WebMagic

The architecture design of WebMagic refers to scratch, while the implementation uses mature Java tools such as HttpClient and Jsoup.

WebMagic consists of four components (Downloader, PageProcessor, Scheduler, Pipeline):

  • Downloader: Downloader
  • PageProcessor: page parser
  • Scheduler: task assignment, url de duplication
  • Pipeline: data storage and processing

Objects of WebMagic data flow:

  • Request: a request corresponds to a URL address. It is the only way for PageProcessor to control Downloader.
  • Page: represents the content downloaded from Downloader
  • ResultItems: equivalent to a Map, which stores the results processed by PageProcessor for Pipeline.

Crawler engine Spider:

  • Spider is the core of WebMagic's internal process. The above four components are equivalent to a property of spider. Different functions can be realized by setting this property.
  • Spider is also the entry of WebMagic operation, which encapsulates the creation, start, stop, multithreading and other functions of crawler

Using Maven to install WebMagic

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-core</artifactId>
    <version>0.7.3</version>
</dependency>
<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
</dependency>


WebMagic uses slf4j-log4j12 as the slf4j implementation. If you customize the slf4j implementation, you need to remove this dependency from the project.

<dependency>
    <groupId>us.codecraft</groupId>
    <artifactId>webmagic-extension</artifactId>
    <version>0.7.3</version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-log4j12</artifactId>
        </exclusion>
    </exclusions>
</dependency>

If you don't use Maven, you can go http://webmagic.io Download the latest jar package in the. After downloading, unzip it, and then import it in the project.

Start developing the first crawler

After the WebMagic dependency is added to the project, you can start the development of the first crawler!
Here is a test. Click the main method and select "run" to check whether it runs normally.

package com.example.demo;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.processor.PageProcessor;

public class DemoPageGet implements PageProcessor {
    private Site site = Site.me();

    @Override
    public void process(Page page) {
        System.out.println(page.getHtml());
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new DemoPageGet()).addUrl("http://httpbin.org/get").run();
    }
}

Write basic Crawlers

In WebMagic, to implement a basic crawler, you only need to write a class and implement the PageProcessor interface.

In this part, we will introduce how to write PageProcessor directly through GitHub repo PageProcessor.

The customization of PageProcessor is divided into three parts: the configuration of crawler, the extraction of page elements and the discovery of links.

public class GithubRepoPageProcessor implements PageProcessor {

    // Part 1: relevant configuration of website grabbing, including encoding, grabbing interval, retry times, etc
    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);

    @Override
    // process is the core interface of custom crawler logic, where extraction logic is written
    public void process(Page page) {
        // Part 2: define how to extract page information and save it
        page.putField("author", page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());
        page.putField("name", page.getHtml().xpath("//h1[@class='entry-title public']/strong/a/text()").toString());
        if (page.getResultItems().get("name") == null) {
            //skip this page
            page.setSkip(true);
        }
        page.putField("readme", page.getHtml().xpath("//div[@id='readme']/tidyText()"));

        // Part three: retrieve the subsequent url address from the page
        page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/[\\w\\-]+/[\\w\\-]+)").all());
    }

    @Override
    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {

        Spider.create(new GithubRepoPageProcessor())
                //From“ https://github.com/code4craft "Start grabbing
                .addUrl("https://github.com/code4craft")
                //Enable 5 threads to grab
                .thread(5)
                //Start crawler
                .run();
    }
}

Append requested link

First, the links are matched or spliced through regular matching, for example: page.getHtml().links().regex("").all()
Then through the addtargetrequests method page.addTargetRequests(url) will add these links to the queue to be fetched.

Crawler configuration

Spider: the entry of the crawler. Other components of spider (Downloader, Scheduler, Pipeline) can be set through the set method.

Site: some configuration information of the site itself, such as encoding, HTTP header, timeout, retry policy, and proxy, can be configured by setting the site object.

Configure http proxy. From version 0.7.1, WebMagic began to use the new proxy approxyprovider. Because the location of ProxyProvider is more a "component" than the "configuration" of Site, the proxy is no longer set from Site, but set by HttpClientDownloader.

For more information, see Official documents.

Extraction of page elements

WebMagic mainly uses three data extraction technologies:

  • XPath
  • regular expression
  • CSS selector
  • In addition, content in JSON format can be parsed using JsonPath

Save results with Pipeline

The component WebMagic uses to store results is called Pipeline.

For example, we use "console output results" to do this through a built-in Pipeline, which is called ConsolePipeline.

So, now I want to save the results in Json format. What can I do?

I just need to change the implementation of Pipeline to "JsonFilePipeline".

public static void main(String[] args) {
    Spider.create(new GithubRepoPageProcessor())
            //From“ https://github.com/code4craft "Start grabbing
            .addUrl("https://github.com/code4craft")
            .addPipeline(new JsonFilePipeline("./webmagic"))
            //Enable 5 threads to grab
            .thread(5)
            //Start crawler
            .run();
}

Simulate POST request method

After version 0.7.1, the old Method of nameValuePair was abandoned, and the Method and requestBody were added to the Request object.

Request request = new Request("http://xxx/path");
request.setMethod(HttpConstant.Method.POST);
request.setRequestBody(HttpRequestBody.json("{'id':1}","utf-8"));

HttpRequestBody has several built-in initialization methods, including the most common form submission and json submission.

Tags: Java github JSON Maven encoding

Posted on Thu, 11 Jun 2020 23:38:02 -0400 by Drabin