1, Background introduction
Hello, I'm Pippi. For different data, we use different capture methods, including pictures, video, audio and text. Because there are too many picture materials on the website, today we use multithreading to collect 4K HD Wallpaper of a station.
2, Page analysis
Target site:
http://www.bizhi88.com/3840x2160/
As shown in the figure, there are 278 pages. Here we crawl the first 100 pages of wallpaper pictures and save them locally;

Parse page

As shown in the figure, the picture of Yo fish is in a large box (< div class = "flex img auto MT" > < / div >), and each div below corresponds to a high-definition Wallpaper;
Then, various information of wallpaper picture data in div label on each page: 1. Link; 2. Name; The following is the parsing of xpath;
imgLink = each.xpath("./a[1]/img/@data-original")[0] name = each.xpath("./a[1]/img/@alt")[0]
One note:
The picture tag has src attribute and data original attribute, both of which correspond to the url address of the picture. We generally use the latter, because data original src is a user-defined attribute and the actual address of the picture, and the src attribute will be displayed only after the page is loaded completely, otherwise we won't get the corresponding address;
3, Grab ideas
As mentioned above, there are too many picture data. We can't write a for loop to download one by one, so we must use multi threads or multi processes, and then throw so many data queues to the thread pool or process pool for processing; In python, multiprocessing Pool, multiprocessing.dummy, is very easy to use,
- multiprocessing.dummy module: dummy module is multithreaded;
- Multiprocessing module: multiprocessing is multiprocessing;
The APIs of multiprocessing.dummy module and multiprocessing module are common; Code switching is flexible;
Page url rule:
'http://www.bizhi88.com/s/470/1.html '# page 1 'http://www.bizhi88.com/s/470/2.html '# page 2 'http://www.bizhi88.com/s/470/3.html '# page 3
Build url:
page = 'http://www.bizhi88.com/s/470/{}.html'.format(i)
Then, we customize two functions, one for crawling and parsing the page (spider), and the other for downloading data (download), start the thread pool, use the for loop to build 13 page URLs, store them in the list as the url queue, and use the pool.map() method to operate the spider and crawler;
def map(self, fn, *iterables, timeout=None, chunksize=1): """Returns an iterator equivalent to map(fn, iter)""" Here we use: pool.map(spider,page) # spider: crawler function; page:url queue
Function: extract each element in the list as a function parameter, create a process and put it into the process pool;
Parameter 1: function to execute;
Parameter 2: iterator, which passes the numbers in the iterator into the function as parameters in turn;
4, Data acquisition
Import related third-party libraries
from lxml import etree # analysis import requests # request from multiprocessing.dummy import Pool as ThreadPool # Concurrent import time # efficiency
Page data analysis
def spider(url): html = requests.get(url, headers=headers) selector = etree.HTML(html.text) contents = selector.xpath("//div[@class='flex-img auto mt']/div") item = {} for each in contents: imgLink = each.xpath("./a[1]/img/@data-original")[0] name = each.xpath("./a[1]/img/@alt")[0] item['Link'] = imgLink item['name'] = name towrite(item)
Download download pictures
def download_pic(contdict): name = contdict['name'] link = contdict['Link'] with open('img/' + name + '.jpg','wb') as f: data = requests.get(link) cont = data.content f.write(cont) print('picture' + name + 'Download succeeded!')
main() main function
pool = ThreadPool(6) page = [] for i in range(1, 101): newpage = 'http://www.bizhi88.com/s/470/{}.html'.format(i) page.append(newpage) result = pool.map(spider, page) pool.close() pool.join()
explain:
- In the main function, we preferred to create six thread pools;
- Dynamically build 100 URLs through the for loop;
- Use the map() function to parse and store the url in the thread pool;
- When the thread pool is closed, the thread pool is not closed, but the state is changed to the state where no more elements can be inserted;
5, Program running
if __name__ == '__main__': start = time.time() # Start timing main() print(end - start) # time difference
The results are as follows:

Of course, only some images are intercepted here, and a total of 2000 + images are crawled.
6, Summary
This time, we used multithreading to crawl the HD pictures of a wallpaper website. If we use requests, it is obvious that it is slow to synchronize requests and download data, so we use multithreading to download pictures, which improves the crawling efficiency.