python multithreading and multiprocessing

Recently, when I was working on a project, I wanted to process a large number of data, so I immediately thought of using multithreading / multiprocessing to speed up processing. However, there are still some differences between multithreading and multiprocessing in python, so the use scenarios are also different.

The difference between multithreading and multiprocessing

Remember the classic introduction to multithreading and multiprocessing in the textbook: "process is the smallest unit of resource allocation, and thread is the smallest unit of cpu scheduling ". to put it simply, multithreading shares memory with low memory consumption, simple cpu scheduling and high cpu utilization; multi processes each process has exclusive memory, high memory consumption, complex cpu scheduling and low cpu utilization. After understanding these, programmers who have used c + + or java will choose multithreading / multiprocessing according to the scenario, but in python, if they also follow this rule There will be some problems if we choose according to the law.

python multithreading

python is an interpretative language. For security reasons, a GIL (Global Interpreter Lock) is designed Each process has a GIL. If there are multiple threads under the process, the threads compete to obtain locks for execution, which means that multiple threads are not parallel, and only one thread is executing at a time, that is, many people say that python multithreading is false.

Then some people will have questions. Why do you implement it if it is fake? Because there is a task. If you use less cpu and are waiting for other things, you can use multithreading for acceleration, such as IO operation / network request operation. A typical example is crawler. Many times, crawlers are making http requests , requests are generally time-consuming and often face request timeout failure. At this time, the thread will give up the cpu to other tasks to make rational use of the cpu.

Scenario 1: pywsgi multi process

By default, pywsgi is a single process and will be blocked when the number of requests is large. If the deployed server is multi-core, you can start multi-process to increase the number of concurrent requests and make full use of the computing power of multi-core:

def serve_forever(server):
    server.start_accepting()
    server._stop_event.wait()

app = Flask(__name__, static_folder='app', static_url_path='')
server = pywsgi.WSGIServer(('0.0.0.0', ASK_SERVICE_PORT), app)
server.start()
for i in range(cpu_count() * 2 + 1):
    Process(target=serve_forever, args=(server,)).start()

The number of processes is cpu * 2 + 1.

Scenario 2: asynchronous execution of api interface

In the project, some api requests take a long time to execute. If the results are returned after waiting for execution, the caller will be blocked. At this time, the interface can be made asynchronous. After receiving the parameters, it can be added to the queue or redis, and multi threads or multi processes can be started to execute the following tasks.

For example, the submission interface, after receiving the request, puts the data into redis and returns the result.

def submit_material_by_url(self, url, tree_name, node_id):
    data = {
        "url": url,
        "tree_name": tree_name,
        "node_id": node_id
    }
    self.data_redis.put_url(json.dumps(data))
    return {'err': ErrorCode.SUCCESS}

Turn on multithreading and start the download task.

 # Download the data processing process according to the url
 self.process = Thread(target=self._process, name="process_submit_material_by_url")
 self.process.start()


def _process(self):
    while True:
        data = self.data_redis.get_url() 
        try:
            if data is None:
                continue
            data_dict = json.loads(data)
            url = data_dict["url"]
            tree_name = data_dict["tree_name"]
            node_id = data_dict["node_id"]
            self.query_skill_tree_view.submit_material_by_url(url, tree_name, node_id)
        except Exception as e:
            logger.error(traceback.format_exc())
            if data is not None:
                self.data_redis.put_url(data)
            time.sleep(2)

Scenario 3: more word segmentation

When word segmentation is performed on a large number of data, if a single process is used for processing, the speed will be very slow and often need to be processed for a long time. At this time, multiple processes can be used for acceleration, and cpu is mainly used for word segmentation.

Here, the map operation of the thread pool is used to write a file for the results after word segmentation.

def _cut_data(self, processes=4):
    """Word segmentation of corpus"""
    new_data = []
    pool = Pool(processes)
    data_list = []
    with open(self._blog_data_selected_path) as file:
        for line in file:
            data_list.append(line.strip().lower())
    new_data = pool.map(self._simple_cut, data_list)

    with open(self._blog_data_selected_seg_path, "w") as file:
        file.writelines([line + "\n" for line in new_data])

    return True

Tags: Python Multithreading multiple processes

Posted on Tue, 21 Sep 2021 07:43:26 -0400 by TodManPlaa