1. First look at the website we're going to crawl to. http://rc.lyrc.net/Companyzp.aspx?Page=1
This is a typical list page + detail page scenario, and web magic is very suitable for this situation. Let's talk about what web magic is. It imitates python's scrapy. The main features are
Completely modular design, strong scalability.
The core is simple but covers the whole process of the crawler. It is flexible and powerful. It is also a good material for learning how to get started with the crawler.
Provide rich API s for extracting pages.
There is no configuration, but a crawler can be implemented in the form of POJO + annotations.
Support multi-threading.
Supporting distributed.
Support crawling js dynamic rendering pages.
Frameless dependencies can be flexibly embedded into projects.
For my kind of white, is the best introduction.
Its official document address http://webmagic.io/docs/zh/
2. Okay, let's check that we're going to crawl to the data.
Encapsulation with a user class
package linyirencaiwang; public class User { private String key;//keyword private String name;//User name private String sex;//Gender private String minzu;//Nation private String location;//Location private String identity;//Identity Education private String school;//School private String major;//major private String work_experience;//Hands-on background private String hope_position;//Hope to find a job private String hope_palce;//Hope workplace private String hope_salary;//Desired treatment private String work_type;//Hope type of work public String getMinzu() { return minzu; } public void setMinzu(String minzu) { this.minzu = minzu; } public String getWork_experience() { return work_experience; } public void setWork_experience(String work_experience) { this.work_experience = work_experience; } public String getHope_position() { return hope_position; } public void setHope_position(String hope_position) { this.hope_position = hope_position; } public String getHope_palce() { return hope_palce; } public void setHope_palce(String hope_palce) { this.hope_palce = hope_palce; } public String getHope_salary() { return hope_salary; } public void setHope_salary(String hope_salary) { this.hope_salary = hope_salary; } public String getWork_type() { return work_type; } public void setWork_type(String work_type) { this.work_type = work_type; } public String getKey() { return key; } public void setKey(String key) { this.key = key; } public String getName() { return name; } public void setName(String name) { this.name = name; } public String getIdentity() { return identity; } public void setIdentity(String identity) { this.identity = identity; } public String getLocation() { return location; } public void setLocation(String location) { this.location = location; } public String getSex() { return sex; } public void setSex(String sex) { this.sex = sex; } public String getSchool() { return school; } public void setSchool(String school) { this.school = school; } public String getMajor() { return major; } public void setMajor(String major) { this.major = major; } @Override public String toString() { return "User [name=" + name+ ", sex=" + sex + ", minzu=" + minzu + ", location=" + location+ ", identity=" + identity + ", school=" + school + ", major=" + major + ", work_experience=" + work_experience+ ", hope_position=" +hope_position + ", hope_palce=" + hope_palce + ", hope_salary=" +hope_salary + ", work_type=" +work_type + "]"; } }
3. Following is crawling information classes
This is used in the framework of webmagic. As long as the regular expressions of the corresponding list page url and the detailed url are written, webmagic will match, just as if the url of the list page is matched in the code, the url of the detailed page of the page and the rest of the list page will be added to the queue, otherwise the information of the detailed page will be obtained by xpath.
package linyirencaiwang; import java.util.ArrayList; import java.util.List; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.pipeline.ConsolePipeline; import us.codecraft.webmagic.pipeline.FilePipeline; import us.codecraft.webmagic.processor.PageProcessor; public class Test implements PageProcessor { private LinyirencaiDao LinyirencaiDao = new LinyircDaoImpL(); public static final String URL_LIST ="http://rc\\.lyrc\\.net/Companyzp\\.aspx\\?Page=[1-9]"; public static final String URL_POST="/Person_Lookzl/id-[0-9]\\.html"; // Part 1: The relevant configuration of crawling website, including encoding, crawling interval, retry times, etc. static int size=1; private Site site = Site.me().setRetryTimes(3).setSleepTime(1000); @Override public void process(Page page) { // Part 2: Define how to extract page information and save it List<String> urls = page.getHtml().css("div#paging").links().regex("/Companyzp\\.aspx\\?Page=").all(); if(page.getUrl().regex(URL_LIST).match()){ page.addTargetRequests(page.getHtml().links().regex(URL_POST).all()); page.addTargetRequests(page.getHtml().links().regex(URL_LIST).all()); page.addTargetRequests(urls); } else{ System.out.println("The first"+size+"strip"); size++; User user =new User(); String key="0";//keyword String name =page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[1]/td[2]/text()").get();//User name String sex= page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[1]/td[4]/text()").get();//Gender String minzu=page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[2]/td[4]/text()").get().toString();//Nation String location= page.getHtml().xpath("//*[@width='61%']/table/tbody/tr[3]/td[4]/text()").get();//Location String identity=page.getHtml().xpath("//*td[@width='283']/text()").get();//Identity Education String school=page.getHtml().xpath("//*td[@width='201']/text()").get();//School String major=page.getHtml().xpath("//*[@width='90%']/tbody/tr[2]/td[4]/text()").get();//major String work_experience=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[6]/tbody/tr[2]/td[2]/text()").get();//Hands-on background String hope_position=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[5]/td[2]/text()").get();//Hope to find a job String hope_palce=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[4]/td[2]/text()").get();//Hope workplace String hope_salary=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[2]/td[2]/text()").get();//Desired treatment String work_type=page.getHtml().xpath("//td[@width='773']/table/tbody/tr/td/table[8]/tbody/tr[1]/td[2]/text()").get(); user.setHope_palce(name); user.setHope_palce(hope_palce); user.setHope_position(hope_position); user.setHope_salary(hope_salary); user.setIdentity(identity); user.setKey(key); user.setLocation(location); user.setMajor(major); user.setMinzu(minzu); user.setName(name); user.setSchool(school); user.setSex(sex); user.setWork_experience(work_experience); user.setWork_type(work_type); System.out.println(user.toString()); System.out.println(); LinyirencaiDao.saveUser(user); } // Part 3: Find the following url address from the page to grab } @Override public Site getSite() { return site; } public static void main(String args[]) { long startTime, endTime; startTime =System.currentTimeMillis(); System.out.println("[Please wait patiently for a big wave of data to come to your bowl...."); Spider.create(new Test()).addUrl("http://rc.lyrc.net/Companyzp.aspx?Page=1") //.addPipeline(new FilePipeline("D:\\webmagic\\))//.addPipeline(new ConsolePipeline) .thread(5).run(); endTime = System.currentTimeMillis(); System.out.println(""[Reptilian End) Common Grab"" + size + ""Articles1 April 2019, 16:18 | Views: 1474
Add new comment
0 comments