Setup Prometheus monitoring

Prometheus is a widely popular tool for monitoring and alerting a wide variety of systems. A distributed cluster offers a number of Prometheus metrics if the prometheus_client package is installed. The metrics are exposed in Prometheus’ text-based format at the /metrics endpoint on both schedulers and workers.

Available metrics

Apart from the metrics exposed per default by the prometheus_client, schedulers and workers expose a number of Dask-specific metrics.

Scheduler metrics

The scheduler exposes the following metrics about itself:

Metric name

Description

dask_scheduler_clients

Number of clients connected

dask_scheduler_desired_workers

Number of workers scheduler needs for task graph

dask_scheduler_workers

Number of workers known by scheduler

dask_scheduler_tasks

Number of tasks known by scheduler

dask_scheduler_tasks_suspicious_total

Total number of times a task has been marked suspicious

dask_scheduler_tasks_forgotten_total

Total number of processed tasks no longer in memory and already removed from the scheduler job queue

Note: Task groups on the scheduler which have all tasks in the forgotten state are not included.

dask_scheduler_prefix_state_totals_total

Accumulated count of task prefix in each state

dask_scheduler_tick_duration_maximum_seconds

Maximum tick duration observed since Prometheus last scraped metrics

dask_scheduler_tick_count_total

Total number of ticks observed since the server started

Semaphore metrics

The following metrics about semaphores are available on the scheduler:

Metric name

Description

dask_semaphore_max_leases

Maximum leases allowed per semaphore

Note: This will be constant for each semaphore during its lifetime.

dask_semaphore_active_leases

Amount of currently active leases per semaphore

dask_semaphore_pending_leases

Amount of currently pending leases per semaphore

dask_semaphore_acquire_total

Total number of leases acquired per semaphore

dask_semaphore_release_total

Total number of leases released per semaphore

Note: If a semaphore is closed while there are still leases active, this count will not equal semaphore_acquired_total after execution.

dask_semaphore_average_pending_lease_time_s

Exponential moving average of the time it took to acquire a lease per semaphore

Note: This only includes time spent on scheduler side, it does not include time spent on communication.

Note: This average is calculated based on order of leases instead of time of lease acquisition.

Work-stealing metrics

If work-stealing is enabled, the scheduler exposes these metrics:

Metric name

Description

dask_stealing_request_count_total

Total number of stealing requests

dask_stealing_request_cost_total

Total cost of stealing requests

Worker metrics

The worker exposes these metrics about itself:

Metric name

Description

dask_worker_tasks

Number of tasks at worker

dask_worker_threads

Number of worker threads

dask_worker_latency_seconds

Latency of worker connection

dask_worker_memory_bytes

Memory breakdown

dask_worker_transfer_incoming_bytes

Total size of open data transfers from other workers

dask_worker_transfer_incoming_count

Number of open data transfers from other workers

dask_worker_transfer_incoming_count_total

Total number of data transfers from other workers since the worker was started

dask_worker_transfer_outgoing_bytes

Total size of open data transfers to other workers

dask_worker_transfer_outgoing_count

Number of open data transfers to other workers

dask_worker_transfer_outgoing_count_total

Total number of data transfers to other workers since the worker was started

dask_worker_concurrent_fetch_requests

Deprecated: This metric has been renamed to transfer_incoming_count.

Number of open fetch requests to other workers

dask_worker_tick_duration_maximum_seconds

Maximum tick duration observed since Prometheus last scraped metrics

dask_worker_tick_count_total

Total number of ticks observed since the server started

If the crick package is installed, the worker additionally exposes:

Metric name

Description

dask_worker_tick_duration_median_seconds

Median tick duration at worker

dask_worker_task_duration_median_seconds

Median task runtime at worker

dask_worker_transfer_bandwidth_median_bytes

Bandwidth for transfer at worker