Hand-on instructions for building a distributed Java-based crawling system

Personal Blog Visit http://www.x0100.top

1 Overview

Without the crawler framework, we tried to implement a distributed crawler system, and we could save the data in different places, such as MySQL, HBase, and so on.

This system is developed based on the Interface-oriented coding idea, so it has some extensibility. Interested friends can understand the design idea directly by looking at the code. Although the code is tightly coupled in many places at present, many of them can be extracted and configurable with some time and effort.

Because of the time relationship, I only wrote the crawlers of Jingdong and Suning Easy-to-buy websites. However, random scheduling of crawlers on different websites can be achieved completely. Based on their code structure, it is not difficult to write Gome, Tmall and other goods crawling, but it is estimated that it will take a lot of time and effort.Because it actually takes a lot of time to parse the data on the web, such as when I crawl the prices of Suning's easy-to-buy goods, the prices are asynchronously fetched, and its api is a long series of numbers, it took me several hours to discover its regularity, but I also admit that I have insufficient experience.

In addition to basic data crawling, the design of this system pays more attention to the following aspects:

1. How to achieve distributed, the same program packaged and distributed to different nodes when running, without affecting overall data crawling
2. How to achieve url random loop scheduling, the core is to make random for different top-level domain names
3. How to add a seed url to the url repository regularly to prevent the crawling system from stopping
4. How to monitor the crawler node program and send mail alarm
5. How to implement a random IP proxy library, similar to Point 2, for anti-crawling purposes

Here is a general introduction to this system. In fact, I have very detailed comments in the code. Interested friends can refer to the code. Finally, I will give some data analysis when I crawl.

Also note that this crawler system is based on Java, but the language itself is still not the most important, and interested friends can try Python.

2 Distributed Crawler System Architecture

The overall system architecture is as follows:

So from the above architecture, we can see that the whole system is mainly divided into three parts:

Crawler System
URL Scheduling System
Monitoring and Alarm System

The crawler system is used to crawl data because it is designed to be distributed, so the crawler itself can run on different server nodes.

The url warehouse is the core of the url dispatching system. The so-called url warehouse actually uses Redis to store the list of URLs to be crawled, and consumes the URLs in our url dispatcher according to certain policies. From this point of view, the url warehouse is also a url queue.

The monitoring and alarm system mainly monitors the crawler nodes. Although one of the crawler nodes executed in parallel has no effect on the overall data crawling itself (it only reduces the speed of the crawler), we still want to know that we can actively receive notifications of the node's hanging instead of passively discovering it.

The following will do some basic introduction to the design idea of the whole system combined with some code snippets for the above three aspects. Friends interested in the complete implementation of the system can refer to the source code directly.

3 Crawler System

(Note: zookeeper monitoring belongs to the monitoring and alarm system, and URL dispatcher belongs to the URL dispatch system)

The crawler system is a stand-alone process. We package our crawler system into jar packages and distribute them to different nodes to execute. This allows parallel crawling of data to improve the efficiency of crawlers.

3.1 Random IP Proxy

Joining random IP proxies is mainly for anti-crawling purposes, so if you have an IP proxy library and can use different proxies randomly when building http clients, anti-crawling will be very helpful.

To use the IP proxy Library in the system, you need to add the available proxy address information to the text file first:

# IPProxyRepository.txt 58.60.255.104:8118 219.135.164.245:3128 27.44.171.27:9999 219.135.164.245:3128 58.60.255.104:8118 58.252.6.165:9000 ......

It is important to note that the above proxy IP is some proxy IP that I got on the Spur Agent. It is not necessarily available. It is recommended that you buy a batch of proxy IP at your own expense, which can save a lot of time and effort in finding proxy IP.

Then in the tool class that builds the http client, when the tool class is first used, these proxy IP s are loaded into memory and into a HashMap in Java:

// IP Address Proxy Library Map private static Map<String, Integer> IPProxyRepository = new HashMap<>(); private static String[] keysArray = null; // keysArray is designed to facilitate the generation of random proxy objects /** * Load the IP proxy library into the set using static code blocks for the first time */ static { InputStream in = HttpUtil.class.getClassLoader().getResourceAsStream("IPProxyRepository.txt"); // Load text containing proxy IP // Building Buffer Stream Objects InputStreamReader isr = new InputStreamReader(in); BufferedReader bfr = new BufferedReader(isr); String line = null; try { // Loop through each line and add to the map while ((line = bfr.readLine()) != null) { String[] split = line.split(":"); // As a delimiter, the data format in the text should be 192.168.1.1:4893 String host = split[0]; int port = Integer.valueOf(split[1]); IPProxyRepository.put(host, port); } Set<String> keys = IPProxyRepository.keySet(); keysArray = keys.toArray(new String[keys.size()]); // keysArray is designed to facilitate the generation of random proxy objects } catch (IOException e) { e.printStackTrace(); } }

Then, each time you build an http client, you will first go to the map to see if there is a proxy IP, then use it, or not use it:

CloseableHttpClient httpClient = null; HttpHost proxy = null; if (IPProxyRepository.size() > 0) { // Set proxy if ip proxy address library is not empty proxy = getRandomProxy(); httpClient = HttpClients.custom().setProxy(proxy).build(); // Create httpclient object } else { httpClient = HttpClients.custom().build(); // Create httpclient object } HttpGet request = new HttpGet(url); // Build htttp get request ......

Random proxy objects are generated by the following methods:

/** * Random return of a proxy object * * @return */ public static HttpHost getRandomProxy() { // Randomly get host:port and build proxy object Random random = new Random(); String host = keysArray[random.nextInt(keysArray.length)]; int port = IPProxyRepository.get(host); HttpHost proxy = new HttpHost(host, port); // Set http proxy return proxy; }

In this way, the design above basically implements the function of random IP proxy. Of course, there are many other things that can be perfected, such as whether you can record this when a request fails using this IP proxy, delete it from the proxy library more than a certain number of times, and generate a log for developers or maintenance personnel to refer to. This isIt's possible, but I won't do that anymore.

3.2 Web Downloader

Web Downloader is used to download data from Web pages. It is mainly developed based on the following interfaces:

/** * Web Data Download */ public interface IDownload { /** * Download web page data for a given url * @param url * @return */ public Page download(String url); }

Based on this, only one http get Downloader is implemented in the system, but it can also complete the functions we need:

/** * Data Download Implementation Class */ public class HttpGetDownloadImpl implements IDownload { @Override public Page download(String url) { Page page = new Page(); String content = HttpUtil.getHttpContent(url); // Get Web Page Data page.setUrl(url); page.setContent(content); return page; } }

3.3 Page Parser

The Web Page Parser parses the data in the downloaded Web page that we are interested in and saves it to an object for further processing by the data store to be stored in a different persistent repository. It is developed based on the following interfaces:

/** * Web Page Data Parsing */ public interface IParser { public void parser(Page page); }

Web page parser is also a very important component in the development of the whole system. Its function is not complex, mainly the code is more. For different merchandises in different stores, the corresponding parser may be different, so it needs to be developed for the merchandises in special stores, because it is obvious that the web page templates used in Jingdong are definitely different from those used in Suning Easy-to-buy.Cats are certainly not the same as Jingdong, so this is entirely based on their own needs to develop, just that some duplicate code will be found in the process of parser development, and then these code can be abstracted out to develop a tool class.

Currently, the data of mobile phone commodities in Jingdong and Suning are crawled in the system, so these two implementation classes are written:

/** * Analysis of Implementation Class of Jingdong Commodity */ public class JDHtmlParserImpl implements IParser { ...... } /** * Analysis of Suning Easy Purchase Web Page */ public class SNHtmlParserImpl implements IParser { ...... }

3.4 Data Storage

Data storage mainly saves data objects parsed by web page parser to different places. For this crawled mobile phone product, the data object is the following Page object:

/** * Web page objects, mainly including web content and commodity data */ public class Page { private String content; // Page Content private String id; // Commodity Id private String source; // Source of Commodity private String brand; // Commodity Brand private String title; // Item Title private float price; // commodity price private int commentCount; // Number of Comments on Goods private String url; // Merchandise Address private String imgUrl; // Merchandise Picture Address private String params; // Commodity Specification Parameters private List<String> urls = new ArrayList<>(); // Container used to store parsed merchandise URLs when parsing list pages }

Correspondingly, in MySQL, the table data structure is as follows:

-- ---------------------------- -- Table structure for phone -- ---------------------------- DROP TABLE IF EXISTS `phone`; CREATE TABLE `phone` ( `id` varchar(30) CHARACTER SET armscii8 NOT NULL COMMENT 'commodity id', `source` varchar(30) NOT NULL COMMENT 'Sources of goods, such as jd suning gome etc.', `brand` varchar(30) DEFAULT NULL COMMENT 'Mobile phone brand', `title` varchar(255) DEFAULT NULL COMMENT 'Mobile Title on Commodity Page', `price` float(10,2) DEFAULT NULL COMMENT 'Mobile Price', `comment_count` varchar(30) DEFAULT NULL COMMENT 'Mobile Comment', `url` varchar(500) DEFAULT NULL COMMENT 'Mobile Detail Address', `img_url` varchar(500) DEFAULT NULL COMMENT 'Picture Address', `params` text COMMENT 'Mobile parameters, json Format Storage', PRIMARY KEY (`id`,`source`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The table structure in HBase is as follows:

## cf1 storage id source price comment brand url ## cf2 stores title params imgUrl create 'phone', 'cf1', 'cf2' ## View created tables in the HBase shell hbase(main):135:0> desc 'phone' Table phone is ENABLED phone COLUMN FAMILIES DESCRIPTION 2 row(s) in 0.0350 seconds

That is, two column families, cf1 and cf2, have been established in HBase, where cf1 holds the id source price comment brand url field information and CF2 holds the title params imgUrl field information.

Different data stores use different implementation classes, but they are all developed based on the same interface as the following:

/** * Storage of commodity data */ public interface IStore { public void store(Page page); }

Then MySQL's storage implementation class, HBase's storage implementation class and console's output implementation class, such as MySQL's storage implementation class, are simply data insertion statements:

/** * Write data to mysql table using dbc database connection pool */ public class MySQLStoreImpl implements IStore { private QueryRunner queryRunner = new QueryRunner(DBCPUtil.getDataSource()); @Override public void store(Page page) { String sql = "insert into phone(id, source, brand, title, price, comment_count, url, img_url, params) values(?, ?, ?, ?, ?, ?, ?, ?, ?)"; try { queryRunner.update(sql, page.getId(), page.getSource(), page.getBrand(), page.getTitle(), page.getPrice(), page.getCommentCount(), page.getUrl(), page.getImgUrl(), page.getParams()); } catch (SQLException e) { e.printStackTrace(); } } }

The storage implementation class for HBase is the common insert statement code for the HBase Java API:

...... // cf1:price Put pricePut = new Put(rowKey); // You must make a null judgment or you will get a null pointer exception pricePut.addColumn(cf1, "price".getBytes(), page.getPrice() != null ? String.valueOf(page.getPrice()).getBytes() : "".getBytes()); puts.add(pricePut); // cf1:comment Put commentPut = new Put(rowKey); commentPut.addColumn(cf1, "comment".getBytes(), page.getCommentCount() != null ? String.valueOf(page.getCommentCount()).getBytes() : "".getBytes()); puts.add(commentPut); // cf1:brand Put brandPut = new Put(rowKey); brandPut.addColumn(cf1, "brand".getBytes(), page.getBrand() != null ? page.getBrand().getBytes() : "".getBytes()); puts.add(brandPut); ......

Of course, you can manually choose where you want to store your data when initializing the crawler:

// 3. Injection Memory iSpider.setStore(new HBaseStoreImpl());

At present, the code has not been written so that it can be stored in multiple places at the same time. According to the current code architecture, it is easier to achieve this. Just modify the corresponding code.In fact, you can save your data in MySQL and then import it into HBase via Sqoop, referring to my Sqoop article for more details.

It is still important to note that if you are sure you need to save your data in HBase, make sure you have a cluster environment available, and add the following configuration documents to the classpath:

core-site.xml hbase-site.xml hdfs-site.xml

Students interested in data can toss this around. If you haven't touched it before, just use MySQL storage. Just inject MySQL storage when initializing the crawler:

// 3. Injection Memory iSpider.setStore(new MySQLStoreImpl());

4 URL Scheduling System

URL dispatching system is the bridge and key to achieve distributed crawler system. It is through the use of URL dispatching system that the whole crawler system can obtain URLs randomly and efficiently (Redis as storage) and realize the distribution of the whole system.

4.1 URL Repository

From the schema diagram, we can see that the so-called URL repository is just a Redis repository, that is, we use Redis to save the list of URL addresses in our system. This is what ensures that our programs are distributed. As long as the URLs are saved uniquely, no matter how many of our crawlers there are, the data saved will be unique and will not be duplicated.This is how distribution is achieved.

At the same time, the URL address in the URL repository is acquired through a queue, which will be known later by the URL scheduler.

In addition, in our url repository, we mainly store the following data:

Seed URL List

Redis has a list data type.

Seed URLs are persisted, and after a certain period of time, URLs are retrieved by the URL timer through the seed URLs and injected into the high-priority URL queue that our crawlers need to use, so that our crawlers can keep crawling data without stopping the program execution.

High priority URL queue

Redis has a set data type.

What is a high priority URL queue?It is actually used to save the list url.

So what is a list url?

To put it plainly, a list contains more than one commodity, with Jingdong as the column, we open a list of mobile phones for example:

This address contains not a specific commodity url, but a list of data (mobile phone goods) that we need to crawl. By analyzing each high-level url, we can get a lot of specific commodity urls. The specific commodity URL is the low-priority url, which will be saved in the low-priority URL queue.

In this case, the data stored in this system is similar to the following:

jd.com.higher --https://list.jd.com/list.html?cat=9987,653,655&page=1 ... suning.com.higher --https://list.suning.com/0-20006-0.html ...

Low Priority URL Queue

Redis has a set data type.

A low priority URL is actually the URL of a specific product, such as the following mobile phone product:

By downloading the url's data and parsing it, we can get the data we want.

In this case, the data stored in this system is similar to the following:

jd.com.lower --https://item.jd.com/23545806622.html ... suning.com.lower --https://product.suning.com/0000000000/690128156.html ...

4.2 URL Scheduler

The so-called URL scheduler, in fact, is the scheduling strategy of the java code of the URL warehouse. However, because its core is scheduling, it is put into the URL scheduler to explain. At present, its scheduling is based on the following interfaces:

/** * url Warehouse * Main functions: * Add URLs to the warehouse (high priority list, low priority commodity url) * Get URLs from the repository (get high priority URLs first, then low priority URLs if not) * */ public interface IRepository { /** * Method of getting url * Get URLs from the repository (get high priority URLs first, then low priority URLs if not) * @return */ public String poll(); /** * Add merchandise list url to high priority list * @param highUrl */ public void offerHigher(String highUrl); /** * Add commodity url to low priority list * @param lowUrl */ public void offerLower(String lowUrl); }

Its implementation as a URL repository based on Redis is as follows:

/** * Redis-based full-web crawlers, randomly retrieve crawl url s: * * Redis The data structure used to store the url in is as follows: * 1.Collection of domain names to crawl (storage data type set, this needs to be added in Redis first) * key * spider.website.domains * value(set) * jd.com suning.com gome.com * key Obtained by constant object SpiderConstants.SPIDER_WEBSITE_DOMAINS_KEY * 2.A high-low-priority URL queue for each domain name (the storage data type is list, which is dynamically added by the crawler after resolving the seed url) * key * jd.com.higher * jd.com.lower * suning.com.higher * suning.com.lower * gome.com.higher * gome.come.lower * value(list) * Corresponding list of URLs to resolve * key Obtained by random domain name + constant SpiderConstants.SPIDER_DOMAIN_HIGHER_SUFFIX or SpiderConstants.SPIDER_DOMAIN_LOWER_SUFFIX * 3.Seed url list * key * spider.seed.urls * value(list) * Seed url for data to crawl * key Obtained by the constant SpiderConstants.SPIDER_SEED_URLS_KEY * * The URLs in the seed url list are periodically queued by the url scheduler to the high and low priority URLs */ public class RandomRedisRepositoryImpl implements IRepository { /** * Construction method */ public RandomRedisRepositoryImpl() { init(); } /** * Initialization method, when initializing, delete all high and low priority url queues that exist in redis * Otherwise, stopping the start and run next time when the URLs in the last url queue are not exhausted will result in duplicate URLs in the url repository */ public void init() { Jedis jedis = JedisUtil.getJedis(); Set<String> domains = jedis.smembers(SpiderConstants.SPIDER_WEBSITE_DOMAINS_KEY); String higherUrlKey; String lowerUrlKey; for(String domain : domains) { higherUrlKey = domain + SpiderConstants.SPIDER_DOMAIN_HIGHER_SUFFIX; lowerUrlKey = domain + SpiderConstants.SPIDER_DOMAIN_LOWER_SUFFIX; jedis.del(higherUrlKey, lowerUrlKey); } JedisUtil.returnJedis(jedis); } /** * Getting URLs from a queue is the current strategy: * 1.Get first from the high priority url queue * 2.Retrieve from low priority url queue * For our actual scenario, the list url should be parsed before the merchandise url * However, it is important to note that in a distributed multithreaded environment, this is certainly not fully guaranteed, because at some point in the high priority url queue * The URL of is exhausted, but the program is actually parsing the next high priority url, at which point other threads must not be able to get the high priority queue URL * The url in the low priority queue is retrieved, which is particularly important when considering analysis in practice * @return */ @Override public String poll() { // Randomly get a top-level domain name from set Jedis jedis = JedisUtil.getJedis(); String randomDomain = jedis.srandmember(SpiderConstants.SPIDER_WEBSITE_DOMAINS_KEY); // jd.com String key = randomDomain + SpiderConstants.SPIDER_DOMAIN_HIGHER_SUFFIX; // jd.com.higher String url = jedis.lpop(key); if(url == null) { // If null, get from low priority key = randomDomain + SpiderConstants.SPIDER_DOMAIN_LOWER_SUFFIX; // jd.com.lower url = jedis.lpop(key); } JedisUtil.returnJedis(jedis); return url; } /** * Add URLs to the high priority url queue * @param highUrl */ @Override public void offerHigher(String highUrl) { offerUrl(highUrl, SpiderConstants.SPIDER_DOMAIN_HIGHER_SUFFIX); } /** * Add URLs to the low priority url queue * @param lowUrl */ @Override public void offerLower(String lowUrl) { offerUrl(lowUrl, SpiderConstants.SPIDER_DOMAIN_LOWER_SUFFIX); } /** * General method for adding url s, abstracted from offerHigher and offerLower * @param url url to be added * @param urlTypeSuffix url Type suffix.higher or.lower */ public void offerUrl(String url, String urlTypeSuffix) { Jedis jedis = JedisUtil.getJedis(); String domain = SpiderUtil.getTopDomain(url); // Get the corresponding top-level domain name for the url, such as jd.com String key = domain + urlTypeSuffix; // Split url queue key s, such as jd.com.higher jedis.lpush(key, url); // Add url to url queue JedisUtil.returnJedis(jedis); } }

Code analysis also lets you know how to schedule URLs in a url repository (Redis).

4.3 URL Timer

After a period of time, the URLs in both the high-priority and low-priority URL queues are consumed. In order to allow the program to continue crawling data while reducing human intervention, the seed URL can be inserted in Redis beforehand, and then the URL timer can periodically remove the URL from the seed URL and store it in the high-priority URL queue to achieve the purpose of the program to continuously crawl data.

Whether or not a url needs to be crawled repeatedly after it has been consumed varies according to individual business needs, so this step is not required, but it is provided.Because, in fact, the data we need to crawl is also updated every once in a while. If you want the data we crawl to be updated regularly, then the timer is very important.However, it is important to note that once you decide you need to crawl data iteratively and repeatedly, you need to consider the problem of duplicate data when designing memory implementations, that is, duplicate data should be an update operation. Currently, the memory I have designed does not include this function. Interested friends can do it themselves, just before inserting data, you need to determine if there is data in the database.

Also note that the URL timer is a separate process that needs to be started separately.

The timer is based on the Quartz implementation, and the following is the code for its job:

/** * Get the seed url from the url repository regularly every day and add it to the high priority list */ public class UrlJob implements Job { // log4j Logging private Logger logger = LoggerFactory.getLogger(UrlJob.class); @Override public void execute(JobExecutionContext context) throws JobExecutionException { /** * 1.Get the seed url from the specified url seed warehouse * 2.Add a seed url to the high priority list */ Jedis jedis = JedisUtil.getJedis(); Set<String> seedUrls = jedis.smembers(SpiderConstants.SPIDER_SEED_URLS_KEY); // Spider.seed.url s Redis data type is set to prevent duplicate addition of seed URLs for(String seedUrl : seedUrls) { String domain = SpiderUtil.getTopDomain(seedUrl); // Top-level domain name for seed url jedis.sadd(domain + SpiderConstants.SPIDER_DOMAIN_HIGHER_SUFFIX, seedUrl); logger.info("Get seeds:{}", seedUrl); } JedisUtil.returnJedis(jedis); // System.out.println("Scheduler Job Test..."); } }

The scheduler is implemented as follows:

/** * url Timing scheduler, which stores seed URLs in the corresponding url warehouse * * Business regulations: Store seed url in warehouse at 1:10 a.m. every day */ public class UrlJobScheduler { public UrlJobScheduler() { init(); } /** * Initialize Scheduler */ public void init() { try { Scheduler scheduler = StdSchedulerFactory.getDefaultScheduler(); // Scheduling of tasks will not start unless the following start method is executed scheduler.start(); String name = "URL_SCHEDULER_JOB"; String group = "URL_SCHEDULER_JOB_GROUP"; JobDetail jobDetail = new JobDetail(name, group, UrlJob.class); String cronExpression = "0 10 1 * * ?"; Trigger trigger = new CronTrigger(name, group, cronExpression); // Schedule Tasks scheduler.scheduleJob(jobDetail, trigger); } catch (SchedulerException e) { e.printStackTrace(); } catch (ParseException e) { e.printStackTrace(); } } public static void main(String[] args) { UrlJobScheduler urlJobScheduler = new UrlJobScheduler(); urlJobScheduler.start(); } /** * Schedule tasks regularly * Because we regularly get the seed url from the designated repository every day and store it in the list of high priority URLs * So it's a uninterrupted program, so it can't stop */ private void start() { while (true) { } } }

5 Monitoring and alarm system

Monitoring and alarm system is joined mainly to allow users to actively discover node downtime, rather than passively discover it. Because crawlers may be running continuously in practice and we deploy our crawlers on multiple nodes, it is necessary to monitor the nodes and detect and fix problems when they occur. Notes are neededThis means that the monitoring and alarm system is an independent process and needs to be started separately.

5.1 Basic Principles

First you need to create a / ispider node in zookeeper:

[zk: localhost:2181(CONNECTED) 1] create /ispider ispider Created /ispider

The development of the monitoring and alarm system mainly depends on the implementation of zookeeper, which monitors the directory of this node under zookeeper:

[zk: localhost:2181(CONNECTED) 0] ls /ispider []

A temporary node directory is registered under this node directory when the crawler starts:

[zk: localhost:2181(CONNECTED) 0] ls /ispider [192.168.43.166]

When a node is down, the temporary node directory is deleted by zookeeper

[zk: localhost:2181(CONNECTED) 0] ls /ispider []

Also because we listen to node directories/ispider s, when zookeeper deletes the node directories under it (or adds a node directory), zookeeper sends notifications to our monitors that our monitors will be callbacked, which allows us to perform alert system actions in the callback program, thus completing the function of monitoring alerts.

5.2 zookeeper Java API usage instructions

You can use zookeeper's native Java API, which I've written in another RPC framework (bottom-level Netty-based remote communication), but obviously the code is a lot more complex and requires more learning and understanding of zookeeper itself so it's easier to use.

So to reduce the difficulty of development, we use curator, a third-party encapsulated API, to develop zookeeper client program.

5.3 Crawler system zookeeper registration

When starting a crawler system, our program will start a zookeeper client to register its node information with zookeeper, mainly the IP address, and create a node named after the IP address of the node where the crawler resides in the / ispider node directory, such as / ispider/192.168.43.116, with the following code:

/** * Register zk */ private void registerZK() { String zkStr = "uplooking01:2181,uplooking02:2181,uplooking03:2181"; int baseSleepTimeMs = 1000; int maxRetries = 3; RetryPolicy retryPolicy = new ExponentialBackoffRetry(baseSleepTimeMs, maxRetries); CuratorFramework curator = CuratorFrameworkFactory.newClient(zkStr, retryPolicy); curator.start(); String ip = null; try { // Register a write node with the specific directory of zk to create a node ip = InetAddress.getLocalHost().getHostAddress(); curator.create().withMode(CreateMode.EPHEMERAL).forPath("/ispider/" + ip, ip.getBytes()); } catch (UnknownHostException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } }

It should be noted that the nodes we create are temporary nodes, in order to achieve monitoring and alarm function, they must be temporary nodes.

5.4 Monitor

First you need to listen to a node directory in zookeeper, which is designed to listen/ispider in our system:

public SpiderMonitorTask() { String zkStr = "uplooking01:2181,uplooking02:2181,uplooking03:2181"; int baseSleepTimeMs = 1000; int maxRetries = 3; RetryPolicy retryPolicy = new ExponentialBackoffRetry(baseSleepTimeMs, maxRetries); curator = CuratorFrameworkFactory.newClient(zkStr, retryPolicy); curator.start(); try { previousNodes = curator.getChildren().usingWatcher(this).forPath("/ispider"); } catch (Exception e) { e.printStackTrace(); } }

The watcher in zookeeper is registered above, which is a callback program that receives notifications and executes the logic of our alerts:

/** * This method is called when the directory corresponding to the zk being monitored changes * By comparing the latest node state with the initial or last one, we know who caused the node change. * @param event */ @Override public void process(WatchedEvent event) { try { List<String> currentNodes = curator.getChildren().usingWatcher(this).forPath("/ispider"); // HashSet<String> previousNodesSet = new HashSet<>(previousNodes); if(currentNodes.size() > previousNodes.size()) { // Latest Node Services, more than previous Node Services, new Nodes added for(String node : currentNodes) { if(!previousNodes.contains(node)) { // The current node is the new node logger.info("----There are new crawler nodes{}New Enhancement", node); } } } else if(currentNodes.size() < previousNodes.size()) { // A node has hung up to send an alert message or text message for(String node : previousNodes) { if(!currentNodes.contains(node)) { // The current node is hanging up and needs to send mail logger.info("----There are crawler nodes{}Hang up", node); MailUtil.sendMail("A crawler node has been dropped. Please check the crawler node manually. The node information is:", node); } } } // Hang-ups are the same as the number of new ones. This is not included above. Interested friends can monitor this situation directly. previousNodes = currentNodes; // Update the last node list to be the latest } catch (Exception e) { e.printStackTrace(); } // The native API needs to be monitored again, because each monitoring will only work once, so when changes are found above, it needs to be monitored again so that it can be monitored again // But you don't need to do this when using curator's API }

Of course, there is still some problem in determining whether a node is suspended or not. According to the above logic, if the events of adding and deleting nodes occur at the same time at a certain time, it can not be judged, so if you need more precision, you can modify the above program code.

5.5 Mail Sending Module

Use the template code, but be aware that when using it, the sender's information should be in your own mailbox.

Here are the messages received when the crawler node hangs up:

In fact, if you buy a SMS service, you can also send text messages to our mobile phone via the SMS API.

6 Actual Warfare: Crawl the data of mobile phone commodities in Jingdong and Suning

As mentioned earlier when introducing this system, I only wrote the web parsers of Jingdong and Suning Easy-to-buy, so the next step is to crawl the mobile phone commodity data of their entire network.

6.1 Environment Description

You need to make sure that Redis, Zookeeper services are available, and if you need to use HBase to store data, you need to make sure that HBase is available in the Hadoop cluster and that the associated configuration files have been added to the classpath of the crawler.

It is also important to note that the URL timer and the monitoring and alarm system run as separate processes and are optional.

6.2 Crawl Result

Two crawls were made, attempting to save the data to MySQL and HBase, respectively, giving the following data scenarios.

6.2.1 Save to MySQL

mysql> select count(*) from phone; +----------+ | count(*) | +----------+ | 12052 | +----------+ 1 row in set mysql> select count(*) from phone where source='jd.com'; +----------+ | count(*) | +----------+ | 9578 | +----------+ 1 row in set mysql> select count(*) from phone where source='suning .com'; +----------+ | count(*) | +----------+ | 2474 | +----------+ 1 row in set

View the data in the visualizer:

6.2.2 Save to HBase

hbase(main):225:0* count 'phone' Current count: 1000, row: 11155386088_jd.com Current count: 2000, row: 136191393_suning.com Current count: 3000, row: 16893837301_jd.com Current count: 4000, row: 19036619855_jd.com Current count: 5000, row: 1983786945_jd.com Current count: 6000, row: 1997392141_jd.com Current count: 7000, row: 21798495372_jd.com Current count: 8000, row: 24154264902_jd.com Current count: 9000, row: 25687565618_jd.com Current count: 10000, row: 26458674797_jd.com Current count: 11000, row: 617169906_suning.com Current count: 12000, row: 769705049_suning.com 12348 row(s) in 1.5720 seconds => 12348

View data in HDFS:

6.2.3 Data volume and actual situation analysis

JD.COM

There are about 160 pages in the list of Jingdong mobile phones, each list has 60 commodity data, so the total amount is around 9600. Our data is basically consistent. After log analysis, we can know that generally lost data is caused by connection timeout, so when choosing crawler environment, it is more recommended to do it on hosts with good network environment, and if there is an IP proxy.Address libraries are even better. In addition, connection timeouts can be further controlled in our program. If a url fails to crawl data, it can be added to the retry url queue. I have not done this yet. Interested students can try it.

Suning is easy to buy

Look at Suning's mobile phone list, which has about 100 pages and 60 commodity data per page, so the total amount is around 6000.But you can see that our data is only 3,000 orders of magnitude (what is missing is the connection failure caused by frequent crawls). Why?

This is because, after opening a list page in Suning, it loads 30 items first. When the mouse slides down, it loads another 30 items data through another API. This is true for every list page, so in fact, we are missing half of the items data and did not crawl.After knowing this reason, it is not difficult to achieve, but because of time, I did not do it. Interested friends can toss around.

6.3 Log analysis of crawler system performance

In our crawler system, every key place, such as web page download, data parsing, etc. is logger ed, so the log can roughly analyze the relevant time parameters.

2018-04-01 21:26:03 [pool-1-thread-1] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://List.jd.com/list.html?Cat=9987,653,655&page=1, elapsed time:590 ms, proxy information:null:null 2018-04-01 21:26:03 [pool-1-thread-1] [cn.xpleaf.spider.core.parser.Impl.JDHtmlParserImpl] [INFO] - Parse List Page:https://List.jd.com/list.html?Cat=9987,653,655&page=1, consumption time:46ms 2018-04-01 21:26:03 [pool-1-thread-3] [cn.xpleaf.spider.core.parser.Impl.SNHtmlParserImpl] [INFO] - Parse List Page:https://list.suning.com/0-20006-0.html, consumption time: 49ms 2018-04-01 21:26:04 [pool-1-thread-5] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://item.jd.com/6737464.html, consumption time: 219 ms, proxy information: null:null 2018-04-01 21:26:04 [pool-1-thread-2] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://List.jd.com/list.html?Cat=9987,653,655&page=2&sort=sort_rank_asc&trans=1&JL=6_0_0, consumption time: 276 ms, proxy information: null:null 2018-04-01 21:26:04 [pool-1-thread-4] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://list.suning.com/0-20006-99.html, consumption time: 300 ms, proxy information: null:null 2018-04-01 21:26:04 [pool-1-thread-4] [cn.xpleaf.spider.core.parser.Impl.SNHtmlParserImpl] [INFO] - Parse List Page:https://list.suning.com/0-20006-99.html, consumption time: 4ms ...... 2018-04-01 21:27:49 [pool-1-thread-3] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://club.jd.com/comment/productCommentSummaries.action?referenceIds=23934388891, elapsed time: 176 ms, proxy information: null:null 2018-04-01 21:27:49 [pool-1-thread-3] [cn.xpleaf.spider.core.parser.Impl.JDHtmlParserImpl] [INFO] - Parse commodity page:https://item.jd.com/23934388891.html, consumption time: 413ms 2018-04-01 21:27:49 [pool-1-thread-2] [cn.xpleaf.spider.utils.HttpUtil] [INFO] - Download Web page: https://Review.suning.com/ajax/review_satisfy/general-00000010017793337-0070079092-----satisfy.htm, consumption time: 308 ms, proxy information: null:null 2018-04-01 21:27:49 [pool-1-thread-2] [cn.xpleaf.spider.core.parser.Impl.SNHtmlParserImpl] [INFO] - Parse commodity page:https://product.suning.com/0070079092/10017793337.html, consumption time: 588ms ......

On average, the time to download data from a commodity web page ranges from 200 to 500 milliseconds, depending on the network conditions at that time.

In addition, if you really want to calculate the data for crawling a commodity, you can calculate it by following the data in the log:

Time to download data from a merchandise page
Time to get price data
Time to get comment data

On my host (CPU: E5 10 core, memory: 32GB, turn on one virtual machine and three virtual machines respectively), as follows:

Number of nodes Threads per node Quantity of Goods time 1 5 Jingdong+Suning Easy to Purchase Nearly 13,000 Commodity Data 141 minutes 3 5 Jingdong+Suning Easy to Purchase Nearly 13,000 Commodity Data 65 minutes

You can see that when three nodes are used, the time will not be reduced to 1/3 of the original, because the main problems affecting crawler performance at this time are network problems, the number of nodes, the number of threads and the number of network requests, but the bandwidth is certain, and in the absence of agents, requests are frequent, connection failures are also increased, and there is a certain amount of timeImpact, if you use a random proxy library, things will be much better.

However, it is certain that increasing the number of crawler nodes in the horizontal expansion can significantly reduce our crawl time, which is also a benefit of distributed crawling systems.

7 Anti-crawl strategies used in crawl systems

In the design of the whole crawler system, the following strategies are mainly used to achieve the purpose of anti-crawler:

Use proxy to access -->IP proxy library, random IP proxy
Random Top-Level Domain Name URL Access-->url Scheduling System
Each thread crawls a commodity data sleep for a short period of time before crawling

8 Summary

It should be noted that this system is based on Java, but I personally feel that the language itself is still not a problem. The core of this article is to design and understand the whole system. I want to share the architecture of such a distributed crawler system for you. If you are interested in the source code, you can view it on my GitHub.

GitHub: https://github.com/xpleaf/ispider

HelloWorld mover 336 original articles were published, 230 were praised, 460,000 visits+ His message board follow