Use webmagic to crawl recruitment information and enter it into Hbase database

1. First look at the website we're going to crawl to.

This is a typical list page + detail page scenario, and web magic is very suitable for this situation. Let's talk about what web magic is. It imitates python's scrapy. The main features are
Completely modular design, strong scalability.
The core is simple but covers the whole process of the crawler. It is flexible and powerful. It is also a good material for learning how to get started with the crawler.
Provide rich API s for extracting pages.
There is no configuration, but a crawler can be implemented in the form of POJO + annotations.
Support multi-threading.
Supporting distributed.
Support crawling js dynamic rendering pages.
Frameless dependencies can be flexibly embedded into projects.
For my kind of white, is the best introduction.
Its official document address

2. Okay, let's check that we're going to crawl to the data.

Encapsulation with a user class

package linyirencaiwang;

public class User {
    private String key;//keyword
    private String name;//User name
    private String sex;//Gender
    private String minzu;//Nation
    private String location;//Location
    private String identity;//Identity Education
    private String school;//School
    private String major;//major
    private String work_experience;//Hands-on background
    private String hope_position;//Hope to find a job
    private String hope_palce;//Hope workplace
    private String hope_salary;//Desired treatment
    private String work_type;//Hope type of work

    public String getMinzu() {
        return minzu;
    public void setMinzu(String minzu) {
        this.minzu = minzu;
    public String getWork_experience() {
        return work_experience;
    public void setWork_experience(String work_experience) {
        this.work_experience = work_experience;
    public String getHope_position() {
        return hope_position;
    public void setHope_position(String hope_position) {
        this.hope_position = hope_position;
    public String getHope_palce() {
        return hope_palce;
    public void setHope_palce(String hope_palce) {
        this.hope_palce = hope_palce;
    public String getHope_salary() {
        return hope_salary;
    public void setHope_salary(String hope_salary) {
        this.hope_salary = hope_salary;
    public String getWork_type() {
        return work_type;
    public void setWork_type(String work_type) {
        this.work_type = work_type;
    public String getKey() {
        return key;
    public void setKey(String key) {
        this.key = key;
    public String getName() {
        return name;
    public void setName(String name) { = name;
    public String getIdentity() {
        return identity;
    public void setIdentity(String identity) {
        this.identity = identity;
    public String getLocation() {
        return location;
    public void setLocation(String location) {
        this.location = location;

    public String getSex() {
        return sex;
    public void setSex(String sex) { = sex;
    public String getSchool() {
        return school;
    public void setSchool(String school) { = school;
    public String getMajor() {
        return major;
    public void setMajor(String major) {
        this.major = major;

    public String toString() {
        return "User [name=" + name+ ", sex=" + sex + ", minzu=" + minzu + ", location="
                + location+ ", identity=" + identity + ", school=" + school + ", major=" + major + ", work_experience=" + 
                work_experience+ ", hope_position=" +hope_position  + ", hope_palce=" + hope_palce + ", hope_salary=" +hope_salary
                + ", work_type=" +work_type + "]";


3. Following is crawling information classes

This is used in the framework of webmagic. As long as the regular expressions of the corresponding list page url and the detailed url are written, webmagic will match, just as if the url of the list page is matched in the code, the url of the detailed page of the page and the rest of the list page will be added to the queue, otherwise the information of the detailed page will be obtained by xpath.

package linyirencaiwang;
import java.util.ArrayList;
import java.util.List;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.FilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;
public class  Test implements PageProcessor {
     private LinyirencaiDao LinyirencaiDao = new LinyircDaoImpL();
     public static final String URL_LIST ="http://rc\\.lyrc\\.net/Companyzp\\.aspx\\?Page=[1-9]{1,3}";
     public static final String URL_POST="/Person_Lookzl/id-[0-9]{4}\\.html";
    // Part 1: The relevant configuration of crawling website, including encoding, crawling interval, retry times, etc.
     static int size=1;

     private Site site =;
    public void process(Page page) {
        // Part 2: Define how to extract page information and save it
        List<String> urls = page.getHtml().css("div#paging").links().regex("/Companyzp\\.aspx\\?Page=").all();


            System.out.println("The first"+size+"strip");
            User user =new User();
            String key="0";//keyword
            String name =page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[1]/td[2]/text()").get();//User name
            String sex= page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[1]/td[4]/text()").get();//Gender
            String minzu=page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[2]/td[4]/text()").get().toString();//Nation
            String location= page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[3]/td[4]/text()").get();//Location
            String identity=page.getHtml().xpath("//*td[@width='283']/text()").get();//Identity Education
            String school=page.getHtml().xpath("//*td[@width='201']/text()").get();//School
            String major=page.getHtml().xpath("//*[@width='90%']/tbody/tr[2]/td[4]/text()").get();//major
            String work_experience=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[6]/tbody/tr[2]/td[2]/text()").get();//Hands-on background
            String hope_position=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[5]/td[2]/text()").get();//Hope to find a job
            String hope_palce=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[4]/td[2]/text()").get();//Hope workplace
            String hope_salary=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[2]/td[2]/text()").get();//Desired treatment
            String work_type=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[1]/td[2]/text()").get();

        // Part 3: Find the following url address from the page to grab
    public Site getSite() {

        return site;

    public static void main(String args[]) {
        long startTime, endTime;
        startTime =System.currentTimeMillis();
        System.out.println("[Please wait patiently for a big wave of data to come to your bowl....");
        Spider.create(new Test()).addUrl("")
        //.addPipeline(new FilePipeline("D:\\webmagic\\"))//.addPipeline(new ConsolePipeline)
        endTime = System.currentTimeMillis();
        System.out.println("[Reptilian End) Common Grab" + size + "Articles, time-consuming" + ((endTime - startTime) / 1000) + "Seconds, saved to the database, please check!");



4. The following are the Dao, DaoImpl classes

Create tables and design column clusters in the database. Because this time it's just a little demo. So simply build a person 2 table, the column cluster is info, after my attempt, crawl information for too long, in order to run a little shorter, first enter the name.

package linyirencaiwang;

public interface LinyirencaiDao {

     public void saveUser(User user);
package linyirencaiwang;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;

public class LinyircDaoImpL implements LinyirencaiDao {

    public void saveUser(User user)  {
        // TODO Auto-generated method stub
        Configuration conf = HBaseConfiguration.create();
        conf.set("hbase.zookeeper.quorum", ",,");

        try {HBHelper helper = HBHelper.getHelper(conf);
                helper.createTable("person2",  "info");
        //  helper.dropTable("person");
        //  helper.createTable("person",  "info");

            helper.insterRow("person2", user.getName(), "info", "name", user.getName());
//           helper.insterRow("person", user.getName(), "info", "sex", user.getSex());
//           helper.insterRow("person", user.getName(), "info", "sex", user.getMinzu());
//           helper.getData("person", "row1","info");
        } catch (IOException e) {
            // TODO Auto-generated catch block



5. Tool classes are as follows

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.util.Bytes;

public class HBHelper {
    private static Connection connection = null;
      private static Admin admin = null;

      public HBHelper (Configuration conf) {
          try {
              connection = HBHelper.getConnInstance(conf);
            } catch (Exception e) {

      private static synchronized Connection getConnInstance(Configuration conf)throws IOException{
                  connection = ConnectionFactory.createConnection(conf);
                   admin = connection.getAdmin();
              catch (IOException e) {

        System.out.println("Successful connection to database");
            return connection;
      public  void close(){  
            try {  
                if(null != admin)  
                if(null != connection)  
                System.out.println("Successful database closure");
            } catch (IOException e) {  


      public void createTable(String table, String... colfams)
              throws IOException {
                createTable(table, null, colfams);
      public void createTable(String table, byte[][] splitKeys, String... colfams)
              throws IOException {
                HTableDescriptor desc = new HTableDescriptor(TableName.valueOf(table));
                for (String cf : colfams) {
                  HColumnDescriptor coldef = new HColumnDescriptor(cf);
                if (splitKeys != null) {
                  admin.createTable(desc, splitKeys);
                } else {

             * @Title disableTable
             * @Description Disable table
             * @param table
             * @throws IOException
            public void disableTable(String table) throws IOException {
//              admin.disableTable(table);
            public boolean existsTable(String table)
                      throws IOException {
                        return admin.tableExists(TableName.valueOf(table));
             * @Title dropTable
             * @Description dropTable
             * @param table
             * @throws IOException
            public void dropTable(String table) throws IOException {
                if (existsTable(table)) {
            public void insterRow(String tableName,String rowkey,String colFamily,String col,String val) throws IOException {  

                Table table = connection.getTable(TableName.valueOf(tableName));  
                Put put = new Put(Bytes.toBytes(rowkey));  
                put.addColumn(Bytes.toBytes(colFamily), Bytes.toBytes(col), Bytes.toBytes(val));  

                //Batch insertion  
               /* List<Put> putList = new ArrayList<Put>(); 

              //Format output  
            public  void showCell(Result result){  
                Cell[] cells = result.rawCells(); 

                for(Cell cell:cells){  
                      System.out.println("RowName:"+new String(CellUtil.cloneRow(cell))+", "
                        +"Timetamp:"+cell.getTimestamp()+", "
                        +"column Family:"+new String(CellUtil.cloneFamily(cell))+", "
                        +"Column modifiers:"+new String(CellUtil.cloneQualifier(cell))+", "
                        +"value:"+new String(CellUtil.cloneValue(cell))+" ");  

            //Bulk Search Data  
            public void scanData(String tableName/*,String startRow,String stopRow*/)throws IOException{  

                Table table = connection.getTable(TableName.valueOf(tableName));  
                Scan scan = new Scan();  
                ResultScanner resultScanner = table.getScanner(scan);  
                for(Result result : resultScanner){  


6. The overall framework is shown in the figure.

Tags: Hadoop Apache HBase Java

Posted on Mon, 01 Apr 2019 16:18:30 -0400 by merebel