Teach you how to use Python web crawler for multi-threaded collection of HD game wallpaper

1, Background introduction

Hello, I'm Pippi. For different data, we use different capture methods, including pictures, video, audio and text. Because there are too many picture materials on the website, today we use multithreading to collect 4K HD Wallpaper of a station.

2, Page analysis

Target site:

http://www.bizhi88.com/3840x2160/

As shown in the figure, there are 278 pages. Here we crawl the first 100 pages of wallpaper pictures and save them locally;

Parse page

As shown in the figure, the picture of Yo fish is in a large box (< div class = "flex img auto MT" > < / div >), and each div below corresponds to a high-definition Wallpaper;

Then, various information of wallpaper picture data in div label on each page: 1. Link; 2. Name; The following is the parsing of xpath;

imgLink = each.xpath("./a[1]/img/@data-original")[0]
name = each.xpath("./a[1]/img/@alt")[0]

One note:

The picture tag has src attribute and data original attribute, both of which correspond to the url address of the picture. We generally use the latter, because data original src is a user-defined attribute and the actual address of the picture, and the src attribute will be displayed only after the page is loaded completely, otherwise we won't get the corresponding address;

3, Grab ideas

As mentioned above, there are too many picture data. We can't write a for loop to download one by one, so we must use multi threads or multi processes, and then throw so many data queues to the thread pool or process pool for processing; In python, multiprocessing Pool, multiprocessing.dummy, is very easy to use,

  • multiprocessing.dummy module: dummy module is multithreaded;
  • Multiprocessing module: multiprocessing is multiprocessing;

The APIs of multiprocessing.dummy module and multiprocessing module are common; Code switching is flexible;

Page url rule:

'http://www.bizhi88.com/s/470/1.html '# page 1
'http://www.bizhi88.com/s/470/2.html '# page 2
'http://www.bizhi88.com/s/470/3.html '# page 3

Build url:

page = 'http://www.bizhi88.com/s/470/{}.html'.format(i)

Then, we customize two functions, one for crawling and parsing the page (spider), and the other for downloading data (download), start the thread pool, use the for loop to build 13 page URLs, store them in the list as the url queue, and use the pool.map() method to operate the spider and crawler;

   def map(self, fn, *iterables, timeout=None, chunksize=1):
        """Returns an iterator equivalent to map(fn, iter)"""
    Here we use: pool.map(spider,page) # spider: crawler function; page:url queue

Function: extract each element in the list as a function parameter, create a process and put it into the process pool;

Parameter 1: function to execute;

Parameter 2: iterator, which passes the numbers in the iterator into the function as parameters in turn;

4, Data acquisition

Import related third-party libraries

from lxml import etree # analysis
import requests # request
from multiprocessing.dummy import Pool as ThreadPool # Concurrent
import time # efficiency

Page data analysis

def spider(url):
    html = requests.get(url, headers=headers)
    selector = etree.HTML(html.text)
    contents = selector.xpath("//div[@class='flex-img auto mt']/div")
    item = {}
    for each in contents:
        imgLink = each.xpath("./a[1]/img/@data-original")[0]
        name = each.xpath("./a[1]/img/@alt")[0]

        item['Link'] = imgLink
        item['name'] = name
        towrite(item)

Download download pictures

def download_pic(contdict):
    name = contdict['name']
    link = contdict['Link']
    with open('img/' + name + '.jpg','wb') as f:
        data = requests.get(link)
        cont = data.content
        f.write(cont)
        print('picture' + name + 'Download succeeded!')

main() main function

   pool = ThreadPool(6)
    page = []
    for i in range(1, 101):
        newpage = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
        page.append(newpage)
    result = pool.map(spider, page)
    pool.close()
    pool.join()

explain:

  1. In the main function, we preferred to create six thread pools;
  2. Dynamically build 100 URLs through the for loop;
  3. Use the map() function to parse and store the url in the thread pool;
  4. When the thread pool is closed, the thread pool is not closed, but the state is changed to the state where no more elements can be inserted;

5, Program running

if __name__ == '__main__':
    start = time.time() # Start timing
    main()
    print(end - start) # time difference

The results are as follows:

Of course, only some images are intercepted here, and a total of 2000 + images are crawled.

6, Summary

This time, we used multithreading to crawl the HD pictures of a wallpaper website. If we use requests, it is obvious that it is slow to synchronize requests and download data, so we use multithreading to download pictures, which improves the crawling efficiency.

Posted on Fri, 03 Dec 2021 05:24:27 -0500 by santosj