python multithreaded crawler + batch download Doo Doo Doo picture project (attention, continuous update)

python multithreaded crawler project ()

Crawling target: doodle (starting url: http://www.doutula.com/photo/list/?page=1)

Crawling content: doutula pictures

Use tools: requests library to send requests and get responses.   

Data analysis, extraction and cleaning with xpath

Multithreading crawler based on threading module

Crawling results:

 

Thought: because the crawler has network intensive IO and disk intensive IO, there is a lot of waiting time, so it uses multi-threaded crawling.

Design: in this paper, we use the idea of object-oriented packaging design, which is mostly structured code, and use the production consumer model to complete the multi-threaded scheduling and crawling.

Put the code directly (please refer to the notes for details, and welcome colleagues to exchange and learn from each other ~):

 1 import os
 2 import threading
 3 import re
 4 from queue import Queue
 5 import requests
 6 from urllib import request
 7 from lxml import etree
 8 
 9 # Define a global variable to store the request header headers data
10 headers = {
11     "User-Agent": "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
12 }
13 
14 class Producter(threading.Thread):
15     """
16     Producer model: responsible from the beginning url Extract in queue url,Analyze and put the resulting image address into img In picture queue
17     """
18     def __init__(self, page_queue, img_queue, *args, **kwargs):
19         # Overwrite parent class threading.Thread Of__init__Method, adding default values
20         super(Producter, self).__init__(*args, **kwargs)
21         # Add object properties
22         self.page_queue = page_queue
23         self.img_queue = img_queue
24 
25     def run(self):
26         """
27         Main business logic to implement consumer model
28         """ 
29         while True:
30             # When the request queue is empty, the producer stops production
31             if self.page_queue.empty():
32                 break
33             # Get start url Objects in the queue for page parsing
34             url = self.page_queue.get()
35             self.parse_url(url)
36 
37     def parse_url(self, url):
38         """
39         Realize the function of parsing the specified page
40         :param url: Incoming pending pages url
41         """
42         response = requests.get(url=url, headers=headers)
43         html = etree.HTML(response.text)
44         # Use lxml Cooley HTML Parsers for data analysis, using xpath Syntax parsing gets the specified data and returns a element Object list
45         url_gifs = html.xpath("//div[@class='page-content text-center']//img[@class!='gif']")
46         for url_gif in url_gifs:
47             # element.get(Attribute name)Can get property value
48             url_name = url_gif.get("alt")
49             # Regular expression replace illegal character
50             url_name = re.sub(r"[\!!\.\??]", "", url_name).strip()
51             url_link = url_gif.get("data-original")
52             # os Module os.path.splitext()Can get url Suffix name
53             url_suffix = os.path.splitext(url_link)[1]
54             filename = url_name + url_suffix
55             # Queued put()It's a tuple or a list
56             self.img_queue.put((url_link, filename)) 
57 
58 class Consumer(threading.Thread):
59     """
60     Main business logic of consumer model
61     """
62     
63     def __init__(self, page_queue, img_queue, *args, **kwargs):
64         super(Consumer, self).__init__(*args, **kwargs)
65         self.page_queue = page_queue
66         self.img_queue = img_queue
67 
68     def run(self):
69         """
70         Realize reading pictures url Function of content
71         """
72         while True:
73             if self.page_queue.empty() and self.img_queue.empty():
74                 break
75             url, filename = self.img_queue.get()
76             # urllib Inside the library request Module can read pictures url content
77             request.urlretrieve(url, "GIF/" + filename)
78             # Console output prompt
79             print(filename + "-------Download complete!")
80 
81 def main():
82     # Establish page Queue, hold request start url;Establish img Queues, storing pictures data Of url
83     page_queue = Queue(100) # Set the maximum number of storage for the queue
84     img_queue = Queue(1000) # Set the maximum number of storage for the queue
85     for i in range(100):
86         start_url_format = "http://www.doutula.com/photo/list/?page={}".format(i)
87         # print(start_url_format) #For debugging code
88         page_queue.put(start_url_format) #Start to get url Put in queue
89     # Generate multithreaded objects (multiple producers, consumers). Realize sending request, getting response, parsing page, getting data
90     for i in range(10):
91         t = Producter(page_queue, img_queue)
92         t.start()
93     for i in range(10):
94         t = Consumer(page_queue, img_queue)
95         t.start()
96 
97 if __name__ == '__main__':
98     main()

Tags: Python network Attribute

Posted on Tue, 03 Dec 2019 00:52:36 -0500 by anita999