Module dask parallel task scheduling

Dask parallel task scheduling

Introduction to Dask

Dask Is a flexible library for parallel computing in Python.

Darth consists of two parts:

  • Dynamic task scheduling is optimized for computing. This is similar to Airflow, Luigi, gallery, or Make, but has been optimized for interactive computing workloads.
  • "Big data" collections (such as parallel arrays, data frames, and lists) extend generic interfaces such as NumPy, Pandas, or Python iterators to memory or distributed environments. These parallel collections run on top of the dynamic task scheduler.

Dasco emphasizes the following advantages:

  • Familiarity: provides parallel NumPy arrays and Pandas DataFrame objects
  • Flexible: provides task planning interface for more custom workloads and integration with other projects.
  • Native: enables distributed computing in pure Python and has access to the PyData stack.
  • Fast: minimal serialization required for low overhead, low latency, and fast numerical algorithms
  • Scale up: flexible operation on a cluster with 1000 cores
  • Zoom out: easily set up and run on a laptop in a single process
  • Responsive: interactive computing is considered in the design, which provides fast feedback and diagnostic functions to help people

Dask analysis

(env36) [scfan@fdm tools]$ dask-scheduler
(env36) [scfan@fdm ~]$ dask-worker 10.0.2.14:8786

python3

Dask resource analysis
Dask task management

Advantages and disadvantages of Dask

Advantage

  • Support single machine and distributed environment
  • Pandas like style, low modification cost

shortcoming

  • Dask-DataFrame
    • excel is not supported for reading files. Support read? CSV read? Table read? Fwf read? Parquet read? HDF read? JSON read? Orc

Dask deployment

Enclosure

performance testing

Test the performance of pandas & dask with self modeling field processing node

Reference resources

Dask & pandas syntax difference table

Github-Dask Collections API compatibility

Example

# Dask does not have pandas.core.series.Series
if data_mode.upper() == 'DASK':
    pass
else:
    if varname.startswith('df') and not isinstance(argls[index], pandas.core.series.Series):
        raise RuntimeError('The first%s Argument (s) must be a column' % (index + 1))

# Dask DataFrame.replace has no inplace parameter
if data_mode == 'DASK':
    data = data.replace(to_replace='nan',value='')
else:
    data.replace(to_replace='nan',value='',inplace=True)

# Dask DataFrame.to_csv
# Data. To [CSV ('a1. CSV ') creates a directory
# Data. To [CSV (['a1. CSV ']) creates a file
# Data. To "CSV ('a - *. CSV ') creates partition files and multiple files
if data_mode == 'DASK':
    data.to_csv(['a1.csv'],index=False)
else:
    data.to_csv('a.csv',index=False)

Dask & pandas detailed syntax performance differences

Opening procedure

Dask-scheduler

Open dask scheduler

(env36) [scfan@fdm tools]$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.scheduler - INFO - Local Directory:    /tmp/scheduler-bdk4b7li
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:      tcp://10.0.2.14:8786
distributed.scheduler - INFO -   dashboard at:                     :8787
distributed.scheduler - INFO - Register tcp://10.0.2.14:30547
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.2.14:30547
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.0.2.14:9190
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.0.2.14:9190
distributed.core - INFO - Starting established connection

Dask scheduler visual interface

Dask-Worker

Open Worker

(env36) [scfan@fdm tools]$ dask-worker 10.0.2.14:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.0.2.14:12075'
distributed.diskutils - INFO - Found stale lock file and directory '/home/scfan/project/FISAMS/branches/branch_scfan/src/server/fdm/tools/worker-yyz2l21f', purging
distributed.dashboard.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: pip install jupyter-server-proxy
distributed.worker - INFO -       Start worker at:      tcp://10.0.2.14:17181
distributed.worker - INFO -          Listening to:      tcp://10.0.2.14:17181
distributed.worker - INFO -          dashboard at:            10.0.2.14:36300
distributed.worker - INFO - Waiting to connect to:       tcp://10.0.2.14:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                   10.32 GB
distributed.worker - INFO -       Local Directory: /home/scfan/project/FISAMS/branches/branch_scfan/src/server/fdm/tools/worker-5304u4tp
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:       tcp://10.0.2.14:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

Dask worker visual interface

Dask comparison

Dask drawback

  • dataframe
    • No sql support is provided. You can use dask. Dataframe. From? sql
    • Supported data formats
      • Tabular: Parquet, ORC, CSV, Line Delimited JSON, Avro, text
      • Arrays: HDF5, NetCDF, Zarr, GRIB
      • Excel is not supported

Dask advantages

  • Dask can resist working node failure
  • Dask is relatively new in 2015, but it is mature and updated with Pandas update
  • Dask is a general parallel programming solution. Similar to Pandsa, easy to use, slightly different from pandas
  • Support profile analysis and check execution of local Dask
    • https://docs.dask.org/en/latest/diagnostics-local.html#example

Dask support items

  • Support stand-alone scheduler, distributed scheduler (local or cluster)
  • Dask worker resource control
    –resources
    Resources for task constraints, such as GPU = 2 MEM = 10e9.
    Resources are applied to each worker process separately (only relevant if multiple worker processes are started using "– nprocs").
  • Visual interface
    • http://192.168.172.72:27831/status Scheduler
    • http://192.168.172.72:8787/status worker
Published 41 original articles, won praise 10, visited 40000+
Private letter follow

Tags: Python jupyter Excel JSON

Posted on Sat, 01 Feb 2020 09:02:05 -0500 by flashmonkey