python multithreaded crawler + batch download Doo Doo Doo picture project (attention, continuous update)

python multithreaded crawler project ()

Crawling target: doodle (starting url: http://www.doutula.com/photo/list/?page=1)

Crawling content: doutula pictures

Use tools: requests library to send requests and get responses.

Data analysis, extraction and cleaning with xpath

Multithreading crawler based on threading module

Crawling results:

Thought: because the crawler has network intensive IO and disk intensive IO, there is a lot of waiting time, so it uses multi-threaded crawling.

Design: in this paper, we use the idea of object-oriented packaging design, which is mostly structured code, and use the production consumer model to complete the multi-threaded scheduling and crawling.

Put the code directly (please refer to the notes for details, and welcome colleagues to exchange and learn from each other ~):

1 import os 2 import threading 3 import re 4 from queue import Queue 5 import requests 6 from urllib import request 7 from lxml import etree 8 9 # Define a global variable to store the request header headers data 10 headers = { 11 "User-Agent": "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" 12 } 13 14 class Producter(threading.Thread): 15 """ 16 Producer model: responsible from the beginning url Extract in queue url，Analyze and put the resulting image address into img In picture queue 17 """ 18 def __init__(self, page_queue, img_queue, *args, **kwargs): 19 # Overwrite parent class threading.Thread Of__init__Method, adding default values 20 super(Producter, self).__init__(*args, **kwargs) 21 # Add object properties 22 self.page_queue = page_queue 23 self.img_queue = img_queue 24 25 def run(self): 26 """ 27 Main business logic to implement consumer model 28 """ 29 while True: 30 # When the request queue is empty, the producer stops production 31 if self.page_queue.empty(): 32 break 33 # Get start url Objects in the queue for page parsing 34 url = self.page_queue.get() 35 self.parse_url(url) 36 37 def parse_url(self, url): 38 """ 39 Realize the function of parsing the specified page 40 :param url: Incoming pending pages url 41 """ 42 response = requests.get(url=url, headers=headers) 43 html = etree.HTML(response.text) 44 # Use lxml Cooley HTML Parsers for data analysis, using xpath Syntax parsing gets the specified data and returns a element Object list 45 url_gifs = html.xpath("//div[@class='page-content text-center']//img[@class!='gif']") 46 for url_gif in url_gifs: 47 # element.get(Attribute name)Can get property value 48 url_name = url_gif.get("alt") 49 # Regular expression replace illegal character 50 url_name = re.sub(r"[\!!\.\?？]", "", url_name).strip() 51 url_link = url_gif.get("data-original") 52 # os Module os.path.splitext()Can get url Suffix name 53 url_suffix = os.path.splitext(url_link)[1] 54 filename = url_name + url_suffix 55 # Queued put()It's a tuple or a list 56 self.img_queue.put((url_link, filename)) 57 58 class Consumer(threading.Thread): 59 """ 60 Main business logic of consumer model 61 """ 62 63 def __init__(self, page_queue, img_queue, *args, **kwargs): 64 super(Consumer, self).__init__(*args, **kwargs) 65 self.page_queue = page_queue 66 self.img_queue = img_queue 67 68 def run(self): 69 """ 70 Realize reading pictures url Function of content 71 """ 72 while True: 73 if self.page_queue.empty() and self.img_queue.empty(): 74 break 75 url, filename = self.img_queue.get() 76 # urllib Inside the library request Module can read pictures url content 77 request.urlretrieve(url, "GIF/" + filename) 78 # Console output prompt 79 print(filename + "-------Download complete!") 80 81 def main(): 82 # Establish page Queue, hold request start url;Establish img Queues, storing pictures data Of url 83 page_queue = Queue(100) # Set the maximum number of storage for the queue 84 img_queue = Queue(1000) # Set the maximum number of storage for the queue 85 for i in range(100): 86 start_url_format = "http://www.doutula.com/photo/list/?page={}".format(i) 87 # print(start_url_format) #For debugging code 88 page_queue.put(start_url_format) #Start to get url Put in queue 89 # Generate multithreaded objects (multiple producers, consumers). Realize sending request, getting response, parsing page, getting data 90 for i in range(10): 91 t = Producter(page_queue, img_queue) 92 t.start() 93 for i in range(10): 94 t = Consumer(page_queue, img_queue) 95 t.start() 96 97 if __name__ == '__main__': 98 main()