python crawler multithreading

Multi process

The applications running in the system, such as opening the browser and pycham, are all applications that can run at the same time. An application is a process, and multiple are multiple processes. For example, the computer is stuck, but you can open the task manager to close the resource consuming applications. At first, the computer has only one CPU, which can only execute one process, and other processes will be blocked. The reason why it feels that multiple processes are running at the same time is that the CPU performs high-speed switching processing. In case of multi-core, multiple CPUs can execute multiple tasks at the same time.

Multithreading

The CPU executes the process through threads. The execution unit in the process is threads. The execution unit contained in the process is threads. A process can contain multiple threads. A wechat is a process, and each chat window is a thread. In python, only one thread can be executed at a time. After opening multiple threads, there will be thread locks to solve the problem of resource competition. Multithreading in python is pseudo multithreading, not pure multithreading. Only one thread is in the execution state at the same time.
Make full use of the waiting time. For example, thread 1 sends a request to url1. Within the corresponding waiting time, thread 2 can send a request to url2. Within the waiting time, thread 3 can send a request to url3. At this time, url1 obtains the response content, and can carry out the next operation. Make full use of the time to do other things during the waiting time, Maximize the climbing efficiency.

Multithreading creation

Create by function

Through the Thread class in the threading module, there is a parameter target. The function object is passed through the parameter. When passing, the function does not need parentheses to realize the multi-threaded logic. Write all the things that need to be done into the function, pass them through the target, and use start to give a start state.
Note: first create a function to store the implementation functions of multithreading. Pass the function through the Thread class in the threading module.

Create by class

Customize a class. If you want to implement multithreading, you must inherit the parent class, threading.Thread, and override the run() method.

import threading import time # 1. Create multiple threads through functions def demo1(): #Function events for threads print("Child thread!") if __name__ == '__main__': for i in range(7): # Create a multithread t, and pass demo1 to the target through the Thread class in threading t = threading.Thread(target=demo1) #Just passing function events without creating threads t.start() # Create and start multithreading (a startup state) to tell the CPU that multithreading can be called # If you need to start multiple times, use the for loop """def __init__(self, group=None(Group threads), target=None(Receive function events), name=None(The name of the thread group), args=()(Incoming tuples), kwargs=None, *, daemon=None): """ # 2. Create multithreading through class # Create MyThread class and inherit the function of the parent class threading.Thread class MyThread(threading.Thread): # Override run method def run(self): for i in range(5): print("This is a child thread!") if __name__ == '__main__': # Instantiation class m = MyThread() # Instantiate object # Start start child thread m.start() # Multithreading through start # Small case def test(): for i in range(4): print("Child thread") # time.sleep(1) # If there is no forced wait, the running time of the main thread 111 is not fixed if __name__ == '__main__': t = threading.Thread(target=test) t.start() print("111") # When the function runs, 111 is executed last. After execution, the function runs at the end. # After using multithreading, it means a little synchronization. First print a "sub thread", # Then print 111, and then continue to execute the remaining print "sub threads" in the multithread

Summary: when a normal function is running, the last line of code running in the main function ends. In multithreading, the sub thread will continue to run. No matter whether the main thread has finished running (print(111) in the example), it will wait for the sub thread (print("sub thread") to finish running before exiting. The main thread will wait for the child thread to finish running.
If you have to execute the main thread after the sub thread runs, you need to add time.sleep before the main thread to force waiting, or use t.join(). No matter how long the previous sub thread runs, you need to wait for the sub thread to run before running the main thread.

# For small cases, run the sub thread first and then the main thread def test(): for i in range(4): print("Child thread") # time.sleep(1) # If there is no forced wait, the running time of the main thread 111 is not fixed if __name__ == '__main__': t = threading.Thread(target=test) t.start() # The first method is to use time to force waiting time.sleep(3) # The second method is to use join t.join() print("111")

View the number of threads

The main thread is inherent. In addition to the main thread, we will create multiple sub threads. How can we check the number of sub threads? Let's review enumerate first.

# enumerate() review test_lst = ['xxx', 'yyy', 'zzz'] # The method of retrieving list data: 1 is the subscript and 2 is the index for i in test_lst: print(i) for i in enumerate(test_lst): print(type(i), i) # Tuple type data, including index and list element values for index, i in enumerate(test_lst): print(index, i) # Returns two values, an index and an element value

threading.enumerate(), returns the list of surviving threads. For those that do not die, the surviving threads will be returned in the form of a list. The main thread usually waits until the sub thread runs to the end before exiting.

import threading import time # threading.enumerate() # Returns the list of surviving threads. If there is no death, the surviving threads will be returned in the form of a list. def test1(): for i in range(5): print("demo1--%d" % i) def test2(): for i in range(5): print("demo2--%d" % i) if __name__ == '__main__': t1 = threading.Thread(target=test1) t2 = threading.Thread(target=test2) t1.start() t2.start() print(threading.enumerate()) # One or two surviving threads will be returned according to the execution speed of the child thread and the execution order of the main thread import threading import time def test1(): for i in range(5): time.sleep(1) print("demo1--%d" % i) def test2(): for i in range(5): time.sleep(1) print("demo2--%d" % i) if __name__ == '__main__': t1 = threading.Thread(target=test1) t2 = threading.Thread(target=test2) t1.start() t2.start() print(threading.enumerate()) # Return the three currently alive threads. When t1 and t2 are forced to wait, run the main thread first

Make a slight change to the program. When there is only one thread left, exit.

import threading import time def test1(): for i in range(8): time.sleep(1) print("demo1--%d" % i) def test2(): for i in range(5): time.sleep(1) print("demo2--%d" % i) if __name__ == '__main__': t1 = threading.Thread(target=test1) t2 = threading.Thread(target=test2) t1.start() t2.start() while True: print(threading.enumerate()) time.sleep(1) if len(threading.enumerate()) <= 1: break # Let the program run all the time. If there is only one thread left, exit. demo1 runs 8 times and demo2 runs 5 times # There were three threads running before. When demo2 runs to 4, the thread has died, and there are only two threads left # When demo1 runs to 7, the thread also dies, leaving only the main thread

threading.Thread() just passes in the function event, and start is really creating and starting a thread.

import threading import time def test1(): for i in range(3): time.sleep(1) print("demo1--%d" % i) def test2(): for i in range(3): time.sleep(1) print("demo2--%d" % i) if __name__ == '__main__': print('front', threading.enumerate()) # 1 thread t1 = threading.Thread(target=test1) # This method simply passes in the function time print('in', threading.enumerate()) # 1 thread t2 = threading.Thread(target=test2) t2.start() # Create and start a thread print('after', threading.enumerate()) # 2 threads # Only after start is the thread really created and started

How multithreading works

Before multithreading is created and started, only the main thread is running. A sub thread is added. For example, the company recruits a person and assigns tasks to him. The sub thread goes down to do things. At the same time, the main thread will continue to do what it should do. The two threads are running at the same time, and the two do not conflict. If it is added, the three operate at the same time and perform their respective duties.

Resource competition between threads

Use thread lock to solve the problem of resource competition between.
Review local and global variables

a = 10 # global variable def fn(): # global a # After it is declared as global, the external 1 is 10, and the internal and external prints are 99 a = 99 # local variable print("Function internal a by%d" % a) #99 print("Function external 1 a by%d" % a) # 10 fn() print("Function external a by%d" % a) # 10

Multithreading without resource contention.

num = 100 def demo1(): global num num += 1 print("demo1--%d" % num) def demo2(): print("demo1--%d" % num) def main(): t1 = threading.Thread(target=demo1) # Pass the function in t2 = threading.Thread(target=demo2) t1.start() # Create and start a thread t2.start() print("main--%d" % num) if __name__ == '__main__': main() """"The running result is 101 for all three. When the program starts, t1 function, num For all variables, execute+1 Operation, return to 101 t2 When running, use+1 Yes num. I don't rule it out t2 The probability of preempting resources and starting first is very small. The main thread only accesses resources"""

Pass parameters into the function in the form of parameters. When the passed parameters are large enough, the problem of resource competition will occur.

import threading import time num = 0 def demo1(nums): global num for i in range(nums): num += 1 print("demo1--%d" % num) def demo2(nums): global num for i in range(nums): num += 1 print("demo1--%d" % num) def main(): # Note that when passing parameters, they are passed in the form of tuples. Only one content is passed, and a comma is added t1 = threading.Thread(target=demo1, args=(1000000,)) # Pass the function in t2 = threading.Thread(target=demo2, args=(1000000,)) t1.start() # Create and start a thread t2.start() time.sleep(3) print("main--%d" % num) if __name__ == '__main__': main() """When 10000 is passed in, demo1--10000，demo1--20000，main--20000； When 1000000 is passed in, the values of the three change, not as much as they are added directly, resulting in the problem of resource competition CPU When running, it depends on which gets the resources first. When the incoming value is relatively small, the number of cycles is less, If the number of cycles is too many, resources may be robbed."""

In the above example, the initial value of num is 0. It is possible that demo1 grabs the resource first, performs + 1 operation on num, re assigns it to num, and then re obtains the variable of num. at this time, Num is 1, and then performs + 1 operation. It is possible that demo2 grabs the resource before re assigning it to num. at this time, the num obtained by demo2 is 1, and re assigns it to num after + 1 operation, At this time, three + 1 operations were performed, but only two were effective and one was lost; In the process of operation, there will be competition, resulting in less final results.
At this time, we need to use thread lock to solve the problem of resource competition. After t1.start, we can use time to forcibly wait for 3 seconds, let demo1 finish the operation first, and then execute demo2. At this time, if the operation takes less than 3 seconds, it will cause a waste of program resources. If you force to wait for 1 second, demo1 will not run completely, and the actual problem can not be solved. You can use a lock to solve the problem. threading.Lock(), which corresponds to a class, has many methods. acquire is locking and release is unlocking.

import threading import time num = 0 # Lock can only have one lock lock = threading.Lock() def demo1(nums): global num # Lock lock.acquire() for i in range(nums): num += 1 # Unlock lock.release() print("demo1--%d" % num) def demo2(nums): global num # Lock lock.acquire() for i in range(nums): num += 1 # Unlock lock.release() print("demo1--%d" % num) def main(): # Note that when passing parameters, they are passed in the form of tuples. Only one content is passed, and a comma is added t1 = threading.Thread(target=demo1, args=(1000000,)) # Pass the function in t2 = threading.Thread(target=demo2, args=(1000000,)) t1.start() # Create and start a thread t2.start() time.sleep(3) print("main--%d" % num) if __name__ == '__main__': main()

We should put the lock in the place where resource competition may occur. Locking and unlocking correspond one by one. We can't lock two locks in the same place. For example, if the door has been locked, it's meaningless to lock again. The program will stop at the locked place, because we can't find a place to lock.
rlock can be used to lock and unlock the same place for many times, and the times of locking and unlocking are also one-to-one.

import threading import time num = 0 # RLock can lock as many locks as it needs to unlock rlock = threading.RLock() def demo1(nums): global num # Lock rlock.acquire() rlock.acquire() for i in range(nums): num += 1 # Unlock rlock.release() rlock.release() print("demo1--%d" % num) def demo2(nums): global num # Lock rlock.acquire() rlock.acquire() rlock.acquire() for i in range(nums): num += 1 # Unlock rlock.release() rlock.release() rlock.release() print("demo1--%d" % num) def main(): # Note that when passing parameters, they are passed in the form of tuples. Only one content is passed, and a comma is added t1 = threading.Thread(target=demo1, args=(1000000,)) # Pass the function in t2 = threading.Thread(target=demo2, args=(1000000,)) t1.start() # Create and start a thread t2.start() time.sleep(3) print("main--%d" % num) if __name__ == '__main__': main()

Whether it is Lock or Rlock, problems will occur if the locking and unlocking times are inconsistent. The Lock should be in a position where resource competition may occur.

Thread queue

Features: first in first out

from queue import Queue """ empty():Determine whether the queue is empty full():Determine whether the queue is full get():Fetch data from queue put():Put a data into a queue """ # Instantiate the object, and then you can use the methods inside q = Queue() # Returns Boolean, the queue is empty, and the return is True; Queue is not empty, return False print(q.empty()) # True # If it is not empty, you can use get to get values from the queue print(q.full()) # False # Judge whether the queue is full. True indicates full, and False indicates not full # If the queue is not full, you can use put to add data to it

Now add data to test

from queue import Queue # During initialization, there is no set capacity, which can store about 2G of data # q = Queue() # You can specify the capacity size during initialization q = Queue(3) # Initialization capacity is 3 print(q.empty()) print(q.full()) q.put(1) q.put(2) q.put(3) print('*'*50) print(q.empty()) print(q.full()) """No capacity is specified. Before adding 1, the queue is empty, not full; After adding 1, the queue is not empty, but it is not full, The result is True，False，False，False If you specify a queue length of 3 and add 1, 2 and 3, the queue capacity reaches the upper limit. The result is True，False，False，True If you add content to the queue after the queue is full, the program will get stuck there, q.put(4,timeout=2)，It will prompt in 2 seconds queue.full Error of Follow q.put_nowait(4)Same effect""" print(q.get()) print(q.get()) print(q.get()) print(q.get(timeout=2)) """First in, first out, first in, first out. If you take one more program, it will be stuck there, q.get(timeout=2),2 Prompt in seconds queue.Empty Error of"""