Basic concepts and operations of asynchronous Crawlers

Catalog

1) Concepts:

2) Multi-threaded asynchronous Crawlers

Open threads for related blocking operations

First

Second

Third

Full code:

3) Process Pool & Thread Pool

4) Multi-threaded and multi-process

5) Use aiohttp

1. Ordinary requests

2. Add Request Parameters

3. Customize User-Agent in Request Header

4. Custom cookies in request header

1) Concepts:

Crawling is an IO-intensive task, for example, if we use the requests library to crawl a site, after making a request, the program has to wait for the site to return to respond before it can run. While waiting for a response, the whole crawler is waiting and actually does nothing (that is, blocking) Do we have an optimization plan for this situation?

Simply put: The difference with synchronous crawlers is that only one crawl at a time can block the get method

2) Multi-threaded asynchronous Crawlers

Open threads for related blocking operations

Step 2: Prepend def with the async method         Async-decorated function, a coprocess object returned after invocation

import asyncio
async def request1(url):
    print("Requesting url yes",url)
    print("Request succeeded:",url)
    return url
#async-decorated function, a coprocess object returned after invocation
c = request1('https://www.baidu.com/')

Step 2: We have three options

First

Create event loop object   +  Register the coordinator object in the loop and start the loop

#1. Create an event loop object
loop = asyncio.get_event_loop()

#2. Register the coordinator object in the loop and start the loop
loop.run_until_complete(c)

Second

Create event loop object  +   Creating task coprocess objects based on loop   +   Register the task coordinator object into the loop and start the task

#task uses
loop = asyncio.get_event_loop()

#Creating a collaboration object based on loop
task = loop.create_task(c)
print(task)

loop.run_until_complete(task)
print(task)

Third

Create event loop object loop +   Instantiate Future Object+   Register the futureInstantiated Partner object in the loop and start future_task

#future use
loop = asyncio.get_event_loop()
future_task = asyncio.ensure_future(c)
print(future_task)
loop.run_until_complete(future_task)
print(future_task)

Full code:

""" 
CSDN : heart_6662
PYTHON amateur   
"""

import asyncio
async def request1(url):
    print("Requesting url yes",url)
    print("Request succeeded:",url)
    return url
#async-decorated function, a coprocess object returned after invocation
c = request1('https://www.baidu.com/')



def callback_func(task):
    #result returns the return value of the corresponding function of the process object encapsulated in the task object
    print(task.result())

#Binding Callback ()
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(c)
loop.run_until_complete(task)

#Bind callback functions to task objects
task.add_done_callback(callback_func)
loop.run_until_complete(task)

 

3) Process Pool & Thread Pool

Reduce system creation of process or thread destruction

First, single-threaded

#Single Thread
 import time
#
 def get_page(str):
     print("Downloading:",str)
     time.sleep(2)
     print("Download successful:",str)

name_list = ["aa","cc","bb","dd"]

start_time = time.time()

for i in range(len(name_list)-1):
     get_page(name_list[i])

end_time = time.time()
print("%d second"%(end_time-start_time))

 

 

After using thread pool

""" 
CSDN : heart_6662
PYTHON amateur   
"""


# 
#Use thread pool
import time
from multiprocessing.dummy import Pool

start_time = time.time()
def get_page(str):
    print("Downloading:",str)
    time.sleep(2)
    print("Download successful:",str)

name_list = ["aa","cc","bb","dd"]

#Instantiate Thread Pool Object
pool = Pool(4)
pool.map(get_page,name_list)
end_time = time.time()
print("%d second"%(end_time-start_time))

Much faster  

4) Multi-threaded and multi-process

1. Multiple URLs in {}

2.for loop parses data one by one

Note: You cannot be asynchronous in a multitask collaboration (time.sleep(2))
  Blocking operations encountered in asyncio must be manually suspended

await asyncio.sleep(2)

""" 
CSDN : heart_6662
PYTHON amateur   
"""
import asyncio
import time
async def request1(url):
    print("Requesting url yes",url)
    #You cannot be asynchronous in a multitask collaboration (time.sleep(2))
    #Blocking operations encountered in asyncio must be manually suspended
    await asyncio.sleep(2)
    print("Request succeeded:",url)
    return url

start = time.time()
urls = {
    'www.baidu.com',
    'www.sogou.com',
    'www.doubanjia.com'
}
tasks=[]
for url in urls:
    c = request1(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))

print(time.time()-start)


 

5) Use aiohttp

Create environment     In cmd

pip install aiohttp

1. Ordinary requests

import aiohttp  
  import asyncio  
    
    
  async def fetch():  
      async with aiohttp.ClientSession() as session:  
          async with session.get('https://www.csdn.net/') as response:  
              print(await response.text())  
    
    
  loop = asyncio.get_event_loop()  
  tasks = [fetch(), ]  
  loop.run_until_complete(asyncio.wait(tasks))

2. Add Request Parameters

import aiohttp
import asyncio

params = {'name': 'zhangsan', 'age': 10}


async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get('https://www.baidu.com/s', params=params) as response:
            print(response.url)


loop = asyncio.get_event_loop()
tasks = [fetch(), ]
loop.run_until_complete(asyncio.wait(tasks))

3. Customize User-Agent in Request Header

import aiohttp
import asyncio

headers = {
            "User-Agent": "my-user-agent"
        }


async def fetch():
    async with aiohttp.ClientSession() as session:
        async with session.get('http://httpbin.org/user-agent', headers=headers) as response:
            print(await response.text())


loop = asyncio.get_event_loop()
tasks = [fetch(), ]
loop.run_until_complete(asyncio.wait(tasks))

4. Custom cookies in request header

import aiohttp  
import asyncio  
  
url = 'http://httpbin.org/cookies'  
cookies = {'cookies_name': 'test_cookies'}  
  
  
async def fetch():  
    async with aiohttp.ClientSession() as session:  
        async with session.get(url, cookies=cookies) as response:  
            print(await response.text())  
  
  
loop = asyncio.get_event_loop()  
tasks = [fetch(), ]  
loop.run_until_complete(asyncio.wait(tasks))

Tags: Python crawler

Posted on Fri, 03 Dec 2021 18:56:25 -0500 by les4017