Crawler Black Technology, how do I crawl indeed job data

Recently, I learned nodejs crawling technology and request module, so I want to write my own crawling project, study for half a day, and finally select indeed as the target site, by crawling indeed job data, and then develop a job search engine of my own. It is online now, although the function is still simple, but still paste the web address job search engine To prove that this crawl project is useful.Now let's talk about the whole idea of the crawler.

Determine entry page

As we all know, crawlers need entry pages, through which they constantly crawl links and finally crawl the entire site.In this first step, I encountered difficulties, generally select the first page and the list page as the entry page, but indeed's list page is limited to crawl the complete list, can only grab the first 100 pages, but this is not difficult for me, I found indeed has a Browse Jobs Page from which you can get all lists indeed searches by region and type.The parsing code for this page is pasted below.

start: async (page) => {
  const host = URL.parse(page.url).hostname;
  const tasks = [];
  try {
    const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
    $('#states > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() });
    });
    $('#categories > tbody > tr > td > a').each((i, ele) => {
      const url = URL.resolve(page.url, $(ele).attr('href'));
      tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() });
    });
    const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
    res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`);
    return 1;
  } catch (err) {
    console.error(`${host}-start parse ${page.url} ${err}`);
    return 0;
  }
}

Parse html content through cheerio, insert search by region and search by type links into the database.

Crawler architecture

Here is a brief description of my crawler architecture, the database uses mongodb.Each page to be crawled has a record page containing fields such as id,url,done,type,host, etc. The ID is generated with md5(url) to avoid duplication.Each type has a corresponding html content parsing method, and the main business logic is focused on these parsing methods, such as the code pasted above.

The crawl html uses the request module, which makes simple encapsulation, encapsulates the callback as promise, and makes it easy to invoke using async and await methods, with the following code.

const req = require('request');

const request = req.defaults({
  headers: {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
  },
  timeout: 30000,
  encoding: null
});

const fetch = (url) => new Promise((resolve) => {
    console.log(`down ${url} started`);
    request(encodeURI(url), (err, res, body) => {
      if (res && res.statusCode === 200) {
        console.log(`down ${url} 200`);
        resolve(body);
      } else {
        console.error(`down ${url} ${res && res.statusCode} ${err}`);
        if (res && res.statusCode) {
          resolve(res.statusCode);
        } else {
          // ESOCKETTIMEOUT timeout error returned 600
          resolve(600);
        }
      }
    });
  });

A simple anti-crawling process is done, changing the user-agent to the computer's universal user-agent, setting a timeout of 30 seconds, where encoding: null sets request to return buffer directly instead of parsed content. This has the advantage that if the page is GBK or utf-8, just specify the encoding when parsing html, if encoding:Utf-8, then when the page code is gbk, the page content will be scrambled.

request defaults to a callback function, encapsulated by promise, returns buffer for page content if successful, error status code if failed, and 600 if timed out, which should be well understood by nodejs.

Complete parsing code

const URL = require('url');
const md5 = require('md5');
const cheerio = require('cheerio');
const iconv = require('iconv-lite');

const json = (data) => {
  let res;
  try {
    res = JSON.parse(data);
  } catch (err) {
    console.error(err);
   }
  return res;
};

const rules = [
  /\/jobs\?q=.*&sort=date&start=\d+/,
  /\/jobs\?q=&l=.*&sort=date&start=\d+/
];

const fns = {

  start: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#states > tbody > tr > td > a').each((i, ele) => {
        const url = URL.resolve(page.url, $(ele).attr('href'));
        tasks.push({ _id: md5(url), type: 'city', host, url, done: 0, name: $(ele).text() });
      });
      $('#categories > tbody > tr > td > a').each((i, ele) => {
        const url = URL.resolve(page.url, $(ele).attr('href'));
        tasks.push({ _id: md5(url), type: 'category', host, url, done: 0, name: $(ele).text() });
      });
      const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-start insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-start parse ${page.url} ${err}`);
      return 0;
    }
  },

  city: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const cities = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#cities > tbody > tr > td > p.city > a').each((i, ele) => {
        // https://www.indeed.com/l-Charlotte,-NC-jobs.html
        let tmp = $(ele).attr('href').match(/l-(?<loc>.*)-jobs.html/u);
        if (!tmp) {
          tmp = $(ele).attr('href').match(/l=(?<loc>.*)/u);
        }
        const { loc } = tmp.groups;
        const url = `https://www.indeed.com/jobs?l=${decodeURIComponent(loc)}&sort=date`;
        tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
        cities.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url });
      });
      let res = await global.com.city.insertMany(cities, { ordered: false }).catch(() => {});
      res && console.log(`${host}-city insert ${res.insertedCount} from ${cities.length} cities`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-city insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-city parse ${page.url} ${err}`);
      return 0;
    }
  },

  category: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const categories = [];
    try {
      const $ = cheerio.load(iconv.decode(page.con, 'utf-8'), { decodeEntities: false });
      $('#titles > tbody > tr > td > p.job > a').each((i, ele) => {
        const { query } = $(ele).attr('href').match(/q-(?<query>.*)-jobs.html/u).groups;
        const url = `https://www.indeed.com/jobs?q=${decodeURIComponent(query)}&sort=date`;
        tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
        categories.push({ _id: `${$(ele).text()}_${page.name}`, parent: page.name, name: $(ele).text(), url });
      });
      let res = await global.com.category.insertMany(categories, { ordered: false }).catch(() => {});
      res && console.log(`${host}-category insert ${res.insertedCount} from ${categories.length} categories`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-category insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-category parse ${page.url} ${err}`);
      return 0;
    }
  },

  search: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const durls = [];
    try {
      const con = iconv.decode(page.con, 'utf-8');
      const $ = cheerio.load(con, { decodeEntities: false });
      const list = con.match(/jobmap\[\d+\]= {.*}/g);
      const jobmap = [];
      if (list) {
         // eslint-disable-next-line no-eval
        list.map((item) => eval(item));
      }
      for (const item of jobmap) {
        const cmplink = URL.resolve(page.url, item.cmplnk);
        const { query } = URL.parse(cmplink, true);
        let name;
        if (query.q) {
          // eslint-disable-next-line prefer-destructuring
          name = query.q.split(' #')[0].split('#')[0];
        } else {
          const tmp = cmplink.match(/q-(?<text>.*)-jobs.html/u);
          if (!tmp) {
            // eslint-disable-next-line no-continue
            continue;
          }
          const { text } = tmp.groups;
          // eslint-disable-next-line prefer-destructuring
          name = text.replace(/-/g, ' ').split(' #')[0];
        }
        const surl = `https://www.indeed.com/cmp/_cs/cmpauto?q=${name}&n=10&returnlogourls=1&returncmppageurls=1&caret=8`;
        const burl = `https://www.indeed.com/viewjob?jk=${item.jk}&from=vjs&vjs=1`;
        const durl = `https://www.indeed.com/rpc/jobdescs?jks=${item.jk}`;
        tasks.push({ _id: md5(surl), type: 'suggest', host, url: surl, done: 0 });
        tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 });
        durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 });
      }
      $('a[href]').each((i, ele) => {
        const tmp = URL.resolve(page.url, $(ele).attr('href'));
        const [url] = tmp.split('#');
        const { path, hostname } = URL.parse(url);
        for (const rule of rules) {
          if (rule.test(path)) {
            if (hostname == host) {
              // tasks.push({ _id: md5(url), type: 'list', host, url: decodeURI(url), done: 0 });
            }
            break;
          }
        }
      });

      let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`);

      res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`);

      return 1;
    } catch (err) {
      console.error(`${host}-search parse ${page.url} ${err}`);
      return 0;
    }
  },

  suggest: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const companies = [];
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      for (const item of data) {
        const id = item.overviewUrl.replace('/cmp/', '');
        const cmpurl = `https://www.indeed.com/cmp/${id}`;
        const joburl = `https://www.indeed.com/cmp/${id}/jobs?clearPrefilter=1`;
        tasks.push({ _id: md5(cmpurl), type: 'company', host, url: cmpurl, done: 0 });
        tasks.push({ _id: md5(joburl), type: 'jobs', host, url: joburl, done: 0 });
        companies.push({ _id: id, name: item.name, url: cmpurl });
      }

      let res = await global.com.company.insertMany(companies, { ordered: false }).catch(() => {});
      res && console.log(`${host}-suggest insert ${res.insertedCount} from ${companies.length} companies`);

      res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-suggest insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-suggest parse ${page.url} ${err}`);
      return 0;
    }
  },

  // list: () => {},

  jobs: async (page) => {
    const host = URL.parse(page.url).hostname;
    const tasks = [];
    const durls = [];
    try {
      const con = iconv.decode(page.con, 'utf-8');
      const tmp = con.match(/window._initialData=(?<text>.*);<\/script><script>window._sentryData/u);
      let data;
      if (tmp) {
        const { text } = tmp.groups;
        data = json(text);
        if (data.jobList && data.jobList.pagination && data.jobList.pagination.paginationLinks) {
          for (const item of data.jobList.pagination.paginationLinks) {
            // eslint-disable-next-line max-depth
            if (item.href) {
              item.href = item.href.replace(/\u002F/g, '/');
              const url = URL.resolve(page.url, decodeURI(item.href));
              tasks.push({ _id: md5(url), type: 'jobs', host, url: decodeURI(url), done: 0 });
            }
          }
        }
        if (data.jobList && data.jobList.jobs) {
          for (const job of data.jobList.jobs) {
            const burl = `https://www.indeed.com/viewjob?jk=${job.jobKey}&from=vjs&vjs=1`;
            const durl = `https://www.indeed.com/rpc/jobdescs?jks=${job.jobKey}`;
            tasks.push({ _id: md5(burl), type: 'brief', host, url: burl, done: 0 });
            durls.push({ _id: md5(durl), type: 'detail', host, url: durl, done: 0 });
          }
        }
      } else {
        console.log(`${host}-jobs ${page.url} has no _initialData`);
      }
      let res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${tasks.length} tasks`);

      res = await global.com.task.insertMany(durls, { ordered: false }).catch(() => {});
      res && console.log(`${host}-search insert ${res.insertedCount} from ${durls.length} tasks`);

      return 1;
    } catch (err) {
      console.error(`${host}-jobs parse ${page.url} ${err}`);
      return 0;
    }
  },

  brief: async (page) => {
    const host = URL.parse(page.url).hostname;
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      data.done = 0;
      data.views = 0;
      data.host = host;
      // format publish date
      if (data.vfvm && data.vfvm.jobAgeRelative) {
        const str = data.vfvm.jobAgeRelative;
        const tmp = str.split(' ');
        const [first, second] = tmp;
        if (first == 'Just' || first == 'Today') {
          data.publishDate = Date.now();
        } else {
          const num = first.replace(/\+/, '');
          if (second == 'hours') {
            const date = new Date();
            const time = date.getTime();
            // eslint-disable-next-line no-mixed-operators
            date.setTime(time - num * 60 * 60 * 1000);
            data.publishDate = date.getTime();
          } else if (second == 'days') {
            const date = new Date();
            const time = date.getTime();
            // eslint-disable-next-line no-mixed-operators
            date.setTime(time - num * 24 * 60 * 60 * 1000);
            data.publishDate = date.getTime();
          } else {
            data.publishDate = Date.now();
          }
        }
      }
      await global.com.job.updateOne({ _id: data.jobKey }, { $set: data }, { upsert: true }).catch(() => { });

      const tasks = [];
      const url = `https://www.indeed.com/jobs?l=${data.jobLocationModel.jobLocation}&sort=date`;
      tasks.push({ _id: md5(url), type: 'search', host, url, done: 0 });
      const res = await global.com.task.insertMany(tasks, { ordered: false }).catch(() => {});
      res && console.log(`${host}-brief insert ${res.insertedCount} from ${tasks.length} tasks`);
      return 1;
    } catch (err) {
      console.error(`${host}-brief parse ${page.url} ${err}`);
      return 0;
    }
  },

  detail: async (page) => {
    const host = URL.parse(page.url).hostname;
    try {
      const con = page.con.toString('utf-8');
      const data = json(con);
      const [jobKey] = Object.keys(data);
      await global.com.job.updateOne({ _id: jobKey }, { $set: { content: data[jobKey], done: 1 } }).catch(() => { });
      return 1;
    } catch (err) {
      console.error(`${host}-detail parse ${page.url} ${err}`);
      return 0;
    }
  },

  run: (page) => {
    if (page.type == 'list') {
      page.type = 'search';
    }
    const fn = fns[page.type];
    if (fn) {
      return fn(page);
    }
    console.error(`${page.url} parser not found`);
    return 0;
  }

};

module.exports = fns;

Each parsing method will insert some new links, and the new link record will have a type field, through which you can know how to parse the new links, so that you can completely parse all the pages.For example, the start method inserts a record whose type is city and category, and the page record whose type is city is parsed by the city method. The City method inserts a link whose type is search, which loops until the last brief and detail methods get an introduction and details of the position data, respectively.

In fact, the key to crawlers is these html parsing methods, with which you can get any structured content you want.

Data Index

That's easy. With the structured data you obtained earlier, follow elasticsearch to create a new schema, and then write a program to periodically add the job data to the es index.I did not add the content field to the index because there is a little more content in the job details, because it takes up too much memory and the server is running out of memory, ><.

DEMO

Finally, I'll put it on the website for you to review. job search engine.

Tags: Programming JSON encoding Database MongoDB

Posted on Fri, 20 Mar 2020 19:47:51 -0400 by Vasko