Financial data acquisition (2): Take the multi-process parallel highway

Summary:

1: This article will use multiple processes to transform a simple US stock crawler program;

2: There are many ways to create a process. This paper only uses the multiprocessing module to invoke objects to create a process. In addition, this paper discusses the efficiency and communication problems in the multi-process and shows the code.

3: The development environment selected Pycharm, all programs are based on Python 3.8;

4: This article is still based on dry goods sharing;
 

Catalog

1. Concepts

1.1 Last Review

1.2 Supplement to multi-process

3-STATE of 1.2.1 process: ready, running, and blocking

1.2.2 Synchronous and asynchronous, blocking and non-blocking

Disadvantages of 1.2.3 multi-process

2. Rebuilding crawler programs based on multiple processes

2.1 About multiprocess efficiency

2.2 Inter-process communication issues

2.2.1 Manager

2.2.2 Queue and Pipe

3. Summary

1. Concepts

1.1 Last Review

Readers not interested in theory can skip concepts and start with Part Two

In the previous section, the author discussed the detailed principles and logic behind the four concepts of multithreading, multiprocess, concurrency and parallelism, and used multithreading to transform a small US stock crawler program (Link to the original:https://blog.csdn.net/simon1223z/ Article/details/120321212)As an analytic wet who once opened a Ferrari in his dream with one hand, the writer is naturally not satisfied with the threaded pleasure.

Multiprocessing is also a powerful tool for improving program efficiency, as we know from the previous content, and due to the existence of GIL locks at the beginning of Python design, multiprocess parallelism that relies on multicore CPU s can theoretically be faster than multithread concurrency! However, goose, pseudo multithreading may not be really good, and true multiprocess may not be perfect.

1.2 Supplement to multi-process

Despite the description of multithreading in the previous period, the main content is on threads, which briefly mentions that memory usage by multiprocesses can affect efficiency; in fact, multiprocesses have other drawbacks (the author does not intentionally criticize multithreading, in fact, multithreading is still fragrant; just the last period has said nothing about the advantages, this period will naturally add a few)

In this paper, the author gives a vivid example to illustrate in detail that we all know household chores as children: how can we increase efficiency by boiling water for 5 minutes, mopping for 10 minutes, cooking for 25 minutes, and cooking for 10 minutes? We obviously don't wait for the meal to be ready before we do anything else. The same is true for computer programs. Borrow this example below to illustrate how to improve efficiency.Three states of the process: ready, running, and blocking; and two other concepts: synchronous and asynchronous, blocking, and non-blocking.

3-STATE of 1.2.1 process: ready, running, and blocking

In fact, this is a very understandable concept. If you think of boiling water and cooking as a program, it is ready before the power switch is pressed and running after the power switch is pressed. When there is a power outage and the boiling water and cooking has to stop, the program is blocked. However, it is worth noting that when the computer program is blocked, it needs to beTo go back to the ready state and wait for the power switch to turn on, you can enter the running state instead of directly returning to the running state from the blocking state, as shown in Figure 1.

Figure 1: Three states of a process

ps. The three-state concept was introduced to explain blocking and non-blocking, as will be used later

1.2.2 Synchronous and asynchronous, blocking and non-blocking

1): Synchronization and asynchronization are the two ways a program submits for execution

Synchronization is the most clumsy way to solve household chores: when cooking, wait until it's ready to do another thing. A program is stuck when it runs.

Asynchronous is doing other things while cooking. It is shown that a program submits a program from memory to the CPU and does not wait for the program to finish executing, but submits other programs. The result of when the program is returned depends on when the asynchronous callback mechanism is triggered.(Interested in learning about the conditions of the asynchronous callback mechanism, there are extensions at the end of this article).

Show the principle of synchronization in code:

import time

def task_1():
    a = 1
    b = 2
    c = a + b
    time.sleep(5)

def task_2(a):
    print(a)

if __name__ == '__main__'
    task_1()
    task_2("I am a synchronized submission task")

The main program contains two tasks, task_1 and task_2. When the main program runs, task_2 must wait for task_1 to complete before it can execute. That is, there is nothing to do in 5 seconds of task_1 sleep. Most of the code lines we usually write execute are typical synchronous submissions.

2): Blocking and non-blocking are two states of program execution

Blocking is already blocking in the process state three; non-blocking is ready and running.

Disadvantages of 1.2.3 multi-process

After understanding the four concepts above, consider: Which combination of the above concepts is most effective?

synchronizationasynchronous
Non-blockingSynchronization & Non-blockingAsynchronous-non-blocking
blockSynchronization & BlockingAsynchronous & Blocking

Obviously: Asynchronous > Synchronous; Non-blocking > Blocking, Asynchronous &Non-blocking is the most efficient

ps: We should try to keep the code asynchronous-non-blocking, but goose is hard to achieve, and many software on the market can't even run completely in a non-blocking state.

Since asynchronous-non-blocking programs are most efficient, let's look at how to compress time using housework ex amp les in conjunction with code implementations.

Import the necessary modules first:

import datetime,time
import threading

1): Housework done by synchronous submission

def bio_water():
    a = 5
    time.sleep(a)
    print("The water boiled")
    
def mop():
    a = 10
    time.sleep(a)
    print("The floor is wiped clean")
    
def rice():
    a = 25
    time.sleep(a)
    print("The rice is ready")

def cooking():
    a = 10
    time.sleep(a)
    print("The rice is ready")

if __name__ == '__main__':
    start = datetime.datetime.now() 
    bio_water()
    mop()
    rice()
    cooking()
    end = datetime.datetime.now() 
    print("Housework time", end - start)

Run results:

All the household chores above are non-blocking, eliminating redundant creation variables and memory creation time. It takes 50 minutes to complete all the household chores. (ps: Housework time is too long, seconds are used below to represent minutes)

2): Asynchronous submission

Since no multi-process has been introduced, here the author will use the last period of multi-threading to model (zhuang) an asynchronous submission:

## Create 4 threads
t1 = threading.Thread(target = bio_water, name="Boiling water")
t2 = threading.Thread(target = mop, name="Wipe the floor")
t3 = threading.Thread(target = rice, name="Cook rice")
t4 = threading.Thread(target = cooking, name="cook")

start = datetime.datetime.now() 
Pool = [t1,t2,t3,t4]
for t in Pool:
    t.start()
for t in Pool:
    t.join()
end = datetime.datetime.now() 
print("Housework time", end - start)

Run results:

All the above household chores are non-blocking, excluding redundant creation variables and memory creation time. It takes 25 minutes to complete all household chores.

This 25 minute is the 25 minute cooking time. In fact, this 25 minute can no longer be compressed, whether it is a multi-threaded operation, or a multi-threaded asynchronous-non-blocked operation, or a multi-process asynchronous-non-blocked parallel operation, or even two rice cookers (assuming the cookers are the same, the cooking time is 25 minutes)It takes 25 minutes to call Mom and Dad to cook together.

The author thinks this is the biggest disadvantage that multi-process cannot overcome - it is not as simple and direct as creating so many threads, which wastes time and takes up memory; or it is characteristic of task execution that problems like housework still cannot continue to compress time from the point of program design, and the time the program runs can be compressed toHow much depends entirely on the task that takes the longest.

However, tasks like crawlers can be very useful if they are multi-process. Next, the author will combine multi-process to modify the previous crawler program to see if time can continue to be compressed and efficiency can be improved.

2. Rebuilding crawler programs based on multiple processes

Multiprocessing is implemented in Python through the multiprocessing module, but the implementation of creating multiprocesses and multithreading code is very similar, with a different name.

from multiprocessing import Process

You can see that the main parameter settings of the Process object under this module are similar to multithreading.

class multiprocessing.Process(group=None, target=None, name=None, args=(), kwargs={}, *, daemon=None)

multiprocessing.Process subclassifications can also create multiprocesses, which are not shown here. Reference documents are available at the end of this article:

import requests
import re
import datetime # Only for testing program run time
 
start_org = datetime.datetime.now() 
stock_list=[]
#pages={"NQ":164,"SP500":26,"DJ":2}
page=[164,26,2]
a = 0
for i in range(len(page)):
    markets=i+1
    market_list=[]
    for i2 in range(int(page[i])):
        pages=i2+1
        url="https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page="+str(pages)+"&num=20&sort=&asc=0&market=&id=&type="+ str(markets)
        #The real web address Sina uses to turn pages is hidden in js
        response = requests.get(url).text
        symbol = re.findall('"symbol":"(.*?)","cname"',response,re.S)
        market_list.extend(symbol)
        print("Crawled{}".format(symbol))
    stock_list.append(market_list)
 
stock_list
end_org = datetime.datetime.now()
print("Execute program time",end_org-start_org)
--------
Copyright notice: This is CSDN Blogger「Simon Cao」Original articles, follow CC 4.0 BY-SA Copyright Agreement, please attach a link to the original source and this statement for reproducing.
Original Link: https://blog.csdn.net/simon1223z/article/details/120321212

Write the code as a function and join two processes:

def craw(*data): 
    print("jdfjdsngfjidsbnijgfndsgf")
    url = data[0]
    stock_list = data [1]
    for i in url:
        lock.acquire()
        response = requests.get(i).text
        symbol = re.findall('"symbol":"(.*?)","cname"',response,re.S)
        stock_list.extend(symbol)
        print(threading.current_thread().getName() +"Crawled{}".format(symbol))
    print("Crawl complete, total crawl%d Bar data"% len(stock_list))
    return stock_list

start_org = datetime.datetime.now() 
page, url_list=[164,26,2], []
a = 0

for i in range(len(page)):
    markets=i+1
    for i2 in range(int(page[i])):
        pages=i2+1
        url="https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page="+str(pages)+"&num=20&sort=&asc=0&market=&id=&type="+ str(markets)
        url_list.append(url)
        
t1 = multiprocessing.Process(target = craw, args=(url_list[:len(url_list)//2]),name="task1")
t2 = multiprocessing.Process(target = craw, args=(url_list[len(url_list)//2:]),name="task1")
t1.start()
t2.start()
t1.join()
t2.join()
print("Total List",stock_list, len(stock_list),id(stock_list))

Street Pop with proper results:

Did the pipe break??

This is a point where it is easy to roll over. One of the biggest differences between creating processes and creating threads is that processes under the Windows system must be created in the main. Because processes do not interfere with each other, such as the app fragmentation feature on Android phones, in fact the same set of code is used for both the app fragmentation and the main app.G itself can be multiplexed multiple times, and mains can be used to import application fragments in a way similar to importing modules, rather than multiprocessing repeating the code itself and getting stuck in an endless loop.

Simply put, just remember that creating multiple processes under win must be created under main.

import datetime, multiprocessing, requests, re

page, url_list = [164, 26, 2], []
a = 0
for i in range(len(page)):
    markets = i + 1
    for i2 in range(int(page[i])):
        pages = i2 + 1
        url = "https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page=" + str(
            pages) + "&num=20&sort=&asc=0&market=&id=&type=" + str(markets)
        url_list.append(url)


def craw(*data):  
    url = data[0]
    stock_list = data[1]
    for i in url:
        response = requests.get(i).text
        symbol = re.findall('"symbol":"(.*?)","cname"', response, re.S)
        stock_list.extend(symbol)
        print("Crawled{}".format(symbol))
    print("Crawl complete, total crawl%d Bar data" % len(stock_list))
    return stock_list


if __name__ == '__main__':
    stock_list = []
    start = datetime.datetime.now()
    t1 = multiprocessing.Process(target=craw, args=(url_list[:len(url_list) // 2], stock_list), name="task1")
    t2 = multiprocessing.Process(target=craw, args=(url_list[len(url_list) // 2:], stock_list), name="task2")
    t1.start()
    t2.start()
    t1.join()
    t2.join()
    print("Total List", stock_list, len(stock_list), id(stock_list))
    end = datetime.datetime.now()
    print("Dual Process Execution Time", end - start)

Run:

Keep in mind that the last multithreading took only 41 seconds, but considering network and other situations, dual threading can almost be considered as fast as two processes. Then there is a serious problem, the company name is not properly stored in stock_list. One of the two issues above can be solved, first is efficiency.

2.1 About multiprocess efficiency

Remember that the last time a multi-threaded writer created five threads in one breath, the result was 1 second faster than a two-threaded one. This time, the author used the same trick to create four processes and run them:

 

* 34 seconds! This result exceeds the speed of 5 threads! The author keeps building and gives it 10 tries in one breath!

The following constructs a process pool of ten processes:

if __name__ == '__main__':
    devisor, process_pool, stock_list = len(url_list) // 10, [], []
    for i in range(0, 10):
        t1 = multiprocessing.Process(target=craw, args=(url_list[devisor * i:devisor * (i + 1)], stock_list),
                                     name="task{}".format(i))
        process_pool.append(t1)
        t1.start()
    for i in process_pool:
        i.join()

    print("Total List", stock_list, len(stock_list), id(stock_list))
    end = datetime.datetime.now()
    print("Ten process execution times", end - start)

Run:

Okay, it's the last two-threaded and five-threaded tragedy.

In theory, multi-process is actually different from multi-process, and largely related to computer performance. Turn on Explorer and Monitor, you can see that my computer is only a 4-core CPU, and has many applications burning continuously for 8 days. If you change to a high-level computer with many boxes, the multi-process capability should be much stronger.

In order to cause tons of injury to the overwhelmed computer, the author directly counts the process up to 100, which surprises the author:

The computer had hardly any jacks when it was shortened to 20 seconds.

*The author tried 1-200 threads, and within the permissible range of computer performance, the number of processes and crawl efficiency showed a positive correlation, but with the increase of the number of processes, there will be a marginal decline, and the marginal will start to be negative when a stage exceeds the computer load. Because everyone's computers, network conditions and so on are quite different, the author wrote a code from single process to 200 processes.Run it all and export the csv finally, you can generate a scatterplot to view, interested friends can try it themselves, this code takes a lot of time to run, in this paper, we will no longer expand the relationship between efficiency and number of processes.

import datetime, multiprocessing, requests, re, time
import pandas as pd

page, url_list = [164, 26, 2], []
a = 0

for i in range(len(page)):
    markets = i + 1
    for i2 in range(int(page[i])):
        pages = i2 + 1
        url = "https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page=" + str(
            pages) + "&num=20&sort=&asc=0&market=&id=&type=" + str(markets)
        url_list.append(url)


def craw(*data): 
    url = data[0]
    stock_list = data[1]
    for i in url:
        response = requests.get(i).text
        symbol = re.findall('"symbol":"(.*?)","cname"', response, re.S)
        stock_list.extend(symbol)
        print("Crawled{}".format(symbol))


if __name__ == '__main__':
    num , timer = [], []
    for a in range(1,3):
        start = time.time()
        devisor, process_pool, stock_list = len(url_list) // a, [], []
        for i in range(0, a):
            t1 = multiprocessing.Process(target=craw, args=(url_list[devisor * i:devisor * (i + 1)], stock_list),
                                     name="task{}".format(i))
            process_pool.append(t1)
            t1.start()
        for i in process_pool:
            i.join()
        num.append(a)
    #Print (total list, stock_list, len(stock_list), id(stock_list))
        end = time.time()
        print(a,"Process Execution Time", end - start)
        t = end - start
        timer.append(t)
    time = {"number":num, "run_time":timer }
    df = pd.DataFrame(time)
    df.to_csv("C:/Users/Administrator/Desktop/program_data.csv")

2.2 Inter-process communication issues

*A serious problem mentioned above is that the company name is not properly stored in stock_list, which is actually caused by communication problems among multiple processes. Programs that run between processes have the characteristic of being isolated from each other and not affected by each other. Unlike threads, threads are more like workers on the same pipeline, sharing data and memory; and processes are more like two different processesIndividuals on the pipeline have no influence on each other.

When the factory owner orders two pipelines, they will only produce according to their own pipelining process, which is the same example used in the above code. As shown in Figure 2, there is actually a "firewall" between processes (only a figurative analogy by the author, but there is no such concept in the actual process).When the owner gives Stock_list to two processes, the two processes only produce according to their own production process, and the processed Stock_list does not interfere with each other. This is similar to a variable in a function, the process I created did add Stock_list, but since it was added in the process, print() follows it.It is still an empty list, and this isolated variable cannot be solved even with global globalization.

Figure 2: Inter-process firewall

Currently, there are three main ways to achieve interprocess communication: 1): process-to-column Queue; 2): pipeline Pipe; 3): Managers

Queue and IPE only interact with data and do not share data, while Managers allow one process to change the data of another.

2.2.1 Manager

Manager () is also an object under multiprocessing, and its power lies in the diversity of data that can be shared, as well as the sharing of dictionaries, lists, and Arra arrays.

import multiprocessing, requests, re, time

page, url_list = [164, 26, 2], []
a = 0

for i in range(len(page)):
    markets = i + 1
    for i2 in range(int(page[i])):
        pages = i2 + 1
        url = "https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page=" + str(
            pages) + "&num=20&sort=&asc=0&market=&id=&type=" + str(markets)
        url_list.append(url)


def craw(*data):  
    url = data[0]
    sl = data[1] # List shared by processes
    stock_list = []

    for i in url:
        response = requests.get(i).text
        symbol = re.findall('"symbol":"(.*?)","cname"', response, re.S)
        stock_list.extend(symbol)
        print("Crawled{}".format(symbol))
    sl.append(stock_list)  # Add to Shared List
    #print("Crawl complete, this process crawls%d pieces of data"%len(stock_list)))
    return stock_list


if __name__ == '__main__':
    m = multiprocessing.Manager()
    sl = m.list()  # Instantiate and share as a list

    for a in range(4,5):Only 4 threads are used as an example. Changing parameters here controls the number of processes
        start = time.time()
        devisor, process_pool, stock_list = len(url_list) // a, [], []
        for i in range(0, a):
            t1 = multiprocessing.Process(target=craw, args=(url_list[devisor * i:devisor * (i + 1)],  sl),
                                     name="task{}".format(i)) 
            process_pool.append(t1)
            t1.start()
        for i in process_pool:
            i.join()

    print(Total list, sl)

        

The most important key above is

    m = multiprocessing.Manager()
    sl = m.list() 

It creates a shared list that is then simply passed to each thread to add processed data to the shared list, which is simple, useful, and powerful in general.

2.2.2 Queue and Pipe

Simply understand that a queue acts as a product warehouse, storing new data when a process processes process processes it, and finally extracting it when data needs to be used. Unlike Manager, sharing between queues and pipelines is limited, and Queue.get() is required to extract processed data.To get the data. Other operations are just like Manager, so here's a code demonstration of Queue, Pipe doesn't show you much more.

import multiprocessing, requests, re, time
from multiprocessing import Queue

page, url_list = [164, 26, 2], []
a = 0

for i in range(len(page)):
    markets = i + 1
    for i2 in range(int(page[i])):
        pages = i2 + 1
        url = "https://stock.finance.sina.com.cn/usstock/api/jsonp.php/IO.XSRV2.CallbackList['fTqwo9s8$wLka1yh']/US_CategoryService.getChengfen?page=" + str(
            pages) + "&num=20&sort=&asc=0&market=&id=&type=" + str(markets)
        url_list.append(url)


def craw(*data):  
    url = data[0]
    stock_list = []
    sl = data[1]
    for i in url:
        response = requests.get(i).text
        symbol = re.findall('"symbol":"(.*?)","cname"', response, re.S)
        stock_list.extend(symbol)
        print("Crawled{}".format(symbol))
    q.put(stock_list)
    #print("Crawl complete, this process crawls%d pieces of data"%len(stock_list)))
    return stock_list


if __name__ == '__main__':
    q = Queue()
    num , timer = [], []
    for a in range(4,5):
        start = time.time()
        devisor, process_pool, stock_list = len(url_list) // a, [], []
        for i in range(0, a):
            t1 = multiprocessing.Process(target=craw, args = (url_list[devisor * i:devisor * (i + 1)],  q),
                                     name="task{}".format(i))
            process_pool.append(t1)
            t1.start()
        for i in process_pool:
            i.join()

    print("Total List", q.get())

Additionally, Queue can communicate across multiple threads, except that the last time I used global variables roughly, we did not discuss communication between multiple threads.

3. Summary

From the first 2 minutes of single threading to 20 seconds of more than 100 processes today, the result is quite satisfactory.

Multithreaded and multi-process is a big topic and has a wide range of scenarios. For most programs, multi-threaded and multi-process really help to improve efficiency. For my own research, the multi-threaded and multi-process approach is the basis for efficient access to more financial data. Witty women can't cook without rice. After that, we need to think about how we can do this.Data is managed effectively.

Write here before dawn. Creation is not easy. Passing friends appreciate it.

If you don't give up, we'll work together.

Expanding Data

(67 messages) Introduction and example of Python multiprocessing.Manager (shared data between processes)_weixin_38170065 Blog-CSDN Bloghttps://blog.csdn.net/weixin_38170065/article/details/99895724?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_title~default-4.no_search_link&spm=1001.2101.3001.4242Is multithreading bound to increase processing speed? -Short book (jianshu.com)https://www.jianshu.com/p/616d074214d2(67 messages) [python] Details multiprocessing multiprocess-process module (1)_brucewong0516's blog-CSDN bloghttps://blog.csdn.net/brucewong0516/article/details/85776194
One article finishes Python multi-process (full) - knowledgeable (zhihu.com)https://zhuanlan.zhihu.com/p/64702600

One article finishes Python multi-process (full) - knowledgeable (zhihu.com)https://zhuanlan.zhihu.com/p/64702600

Tags: Python crawler

Posted on Thu, 23 Sep 2021 17:30:39 -0400 by commandonz