anaconda will bring its own dask after it is installed. For the Chinese version, see https://www.heywhale.com/mw/project/610c8f40fe727700176ae461
Dask has four available schedulers:
·threaded: scheduler supported by thread pool
·processes: scheduler supported by the process pool
·Single threaded (also known as "sync"): a synchronous scheduler for debugging
·Distributed: a distributed scheduler for executing graphics on multiple computers. Distributed is recommended. We will only use this later.
1. Basic environment configuration
- Configure dask on the cluster. anaconda comes with dask. If not, pip install. Note that the configuration of python needs to be consistent. You can use the docker image to keep the environment consistent.
- Install the latest version of anaconda on any machine, and then install dask labextension. You can install pip install, or search the following page for installation, and then execute jupyter lab build.
2. Distributed computing environment dask.distributed
2.1 creation method on the machine
from dask.distributed import Client,LocalCluster # The simplest way is to create the number of worker s according to the number of cpu cores by default c = Client() # You can also specify parameters c = Client(LocalCluster(n_workers=3, threads_per_worker=7, processes=True))
2.2 distributed creation method
If you want to use distributed, execute dask scheduler on the master node, and you will get similar results as follows:
$ dask-scheduler # Create scheduler distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://192.168.10.100:8786 distributed.scheduler - INFO - bokeh at: :8787 distributed.scheduler - INFO - Local Directory: /tmp/scheduler-wydqn90b distributed.scheduler - INFO - ----------------------------------------------- distributed.scheduler - INFO - Receive client connection: Client-237debe6-cd07-11e8-9edd-a0c589feaf42
Then execute dask worker on the work node tcp://192.168.10.100:8786
Then connect the master node on any machine:
client = Client('tcp://localhost:8786')
3. Data type
It mainly includes dask.array, dask.dataframes and dask.bag
In addition to linalg library, numpy series is basically implemented.
import dask.array as da x = da.random.uniform(low=0, high=10, size=(10000, 10000), # normal numpy code chunks=(1000, 1000)) # break into chunks of size 1000x1000 y = x + x.T - x.mean(axis=0) # Use normal syntax for high level algorithms
Similar to pandas:
import dask.dataframe as dd df = dd.read_csv('2018-*-*.csv', parse_dates='timestamp', # normal Pandas code blocksize=64000000) # break text into 64MB chunks s = df.groupby('name').balance.mean() # Use normal syntax for high level algorithms
3.3 Bags / lists
Daskbag can read the data line by line from the file, and then output the data of the specified number of lines with the. take method.
Dask Bag implements operations such as map, filter, fold, and groupby. It uses Python iterators to do this in parallel, taking up very little memory. It is similar to the parallel version of PyToolz or the python version of PySpark RDD. Here is an example:
import dask.bag as db b = db.read_text('*.json').map(json.loads) total = (b.filter(lambda d: d['name'] == 'Alice') .map(lambda d: d['balance']) .sum()) filtered.compute()
4. Calculation process
3.1 future series
Similar to future in async, there are several important methods:
client.scatter(list) client.map(function, list of parameters) #starmap similar client.submit(function, parameter) client.gather(res) res.result()
Once the submit method is called, asynchronous execution in the background starts immediately. The result can be obtained through the gather method or the result function.
3.2 delay series
Similar to yield, you need to import from dask delayed first. There are two modes of execution:
- Use the @ delayed decorator
- Use delayed(function)
from dask import delayed def inc(x): return x + 1 def double(x): return x * 2 def add(x, y): return x + y data = [1, 2, 3, 4, 5] output =  for x in data: a = delayed(inc)(x) b = delayed(double)(x) c = delayed(add)(a, b) output.append(c) total = dask.delayed(sum)(output) total.compute()
3.3 some comparison results
For the following tasks, the overhead is very large. Let's evaluate the performance
Using local, no matter what kind of configuration, it is basically about 8min30s.
Using a cluster of two machines and dask.delayed takes about 6 minutes.
Using a cluster of three machines and dask.delayed takes about 5min and 20s.
The data pulling part, whether single machine running or distributed running, takes about 3min30s. After removing this part, the remaining calculation time under 3, 2 and 1 machine is about 1.83, 2.5 and 5min respectively.
dask.future was also tested on the other three machines, which took about 6 minutes; In addition, the method of using multiple processes under single machine is also tested, which takes 443s(8 processes) and 482s(24 processes).