Big data series 14: introduction to dask

anaconda will bring its own dask after it is installed. For the Chinese version, see

Dask has four available schedulers:
·threaded: scheduler supported by thread pool
·processes: scheduler supported by the process pool
·Single threaded (also known as "sync"): a synchronous scheduler for debugging
·Distributed: a distributed scheduler for executing graphics on multiple computers. Distributed is recommended. We will only use this later.

1. Basic environment configuration

  • Configure dask on the cluster. anaconda comes with dask. If not, pip install. Note that the configuration of python needs to be consistent. You can use the docker image to keep the environment consistent.
  • Install the latest version of anaconda on any machine, and then install dask labextension. You can install pip install, or search the following page for installation, and then execute jupyter lab build.

2. Distributed computing environment dask.distributed

2.1 creation method on the machine

from dask.distributed import Client,LocalCluster
# The simplest way is to create the number of worker s according to the number of cpu cores by default
c = Client()
# You can also specify parameters
c = Client(LocalCluster(n_workers=3, threads_per_worker=7, processes=True))

2.2 distributed creation method

If you want to use distributed, execute dask scheduler on the master node, and you will get similar results as follows:

$ dask-scheduler       # Create scheduler

distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-wydqn90b
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Receive client connection: Client-237debe6-cd07-11e8-9edd-a0c589feaf42

Then execute dask worker on the work node tcp://
Then connect the master node on any machine:

client = Client('tcp://localhost:8786')

3. Data type

It mainly includes dask.array, dask.dataframes and dask.bag

3.1 Array

In addition to linalg library, numpy series is basically implemented.

import dask.array as da
x = da.random.uniform(low=0, high=10, size=(10000, 10000),  # normal numpy code
                      chunks=(1000, 1000))  # break into chunks of size 1000x1000
y = x + x.T - x.mean(axis=0)  # Use normal syntax for high level algorithms

3.2 DataFrames

Similar to pandas:

import dask.dataframe as dd
df = dd.read_csv('2018-*-*.csv', parse_dates='timestamp',  # normal Pandas code
                 blocksize=64000000)  # break text into 64MB chunks
s = df.groupby('name').balance.mean()  # Use normal syntax for high level algorithms

3.3 Bags / lists

Daskbag can read the data line by line from the file, and then output the data of the specified number of lines with the. take method.
Dask Bag implements operations such as map, filter, fold, and groupby. It uses Python iterators to do this in parallel, taking up very little memory. It is similar to the parallel version of PyToolz or the python version of PySpark RDD. Here is an example:

import dask.bag as db
b = db.read_text('*.json').map(json.loads)
total = (b.filter(lambda d: d['name'] == 'Alice')
          .map(lambda d: d['balance'])

4. Calculation process

3.1 future series

Similar to future in async, there are several important methods:

client.scatter(list), list of parameters) #starmap similar
client.submit(function, parameter)

Once the submit method is called, asynchronous execution in the background starts immediately. The result can be obtained through the gather method or the result function.

3.2 delay series

Similar to yield, you need to import from dask delayed first. There are two modes of execution:

  1. Use the @ delayed decorator
  2. Use delayed(function)
from dask import delayed
def inc(x):
    return x + 1
def double(x):
    return x * 2
def add(x, y):
    return x + y
data = [1, 2, 3, 4, 5]
output = []
for x in data:
    a = delayed(inc)(x)
    b = delayed(double)(x)
    c = delayed(add)(a, b)
total = dask.delayed(sum)(output)

3.3 some comparison results

For the following tasks, the overhead is very large. Let's evaluate the performance

Using local, no matter what kind of configuration, it is basically about 8min30s.
Using a cluster of two machines and dask.delayed takes about 6 minutes.
Using a cluster of three machines and dask.delayed takes about 5min and 20s.
The data pulling part, whether single machine running or distributed running, takes about 3min30s. After removing this part, the remaining calculation time under 3, 2 and 1 machine is about 1.83, 2.5 and 5min respectively.
dask.future was also tested on the other three machines, which took about 6 minutes; In addition, the method of using multiple processes under single machine is also tested, which takes 443s(8 processes) and 482s(24 processes).

Tags: Python Docker

Posted on Fri, 17 Sep 2021 23:25:26 -0400 by Mindwreck