SC17 Denver, CO

Slurm User Group Meeting

Authors: Morris Jette (SchedMD LLC)

Abstract: Slurm is an open source workload manager used many on TOP500 systems and provides a rich set of features including topology aware optimized resource allocation, the ability to expand and shrink jobs on demand, the ability to power down idle nodes and restart them as needed, hierarchical bank accounts with fair-share job prioritization and many resource limits. The meeting will consist of three parts: The Slurm development team will present details about changes in the new version 17.11, describe the Slurm roadmap, and solicit user feedback. Everyone interested in Slurm use and/or development is encouraged to attend.

Long Description: Slurm is a free open source workload manager in widespread use today with a steadily growing customer base. As of the June 2017 TOP500 list, Slurm was used on 5 of the top 10 systems. Slurm is vendor-neutral with about 250 individual contributors from a multitude of computer vendors, national laboratories, and universities. SC is our best venue to gather such a diverse global community.

The Slurm BOF has been held at the previous six SC conferences with attendance increasing each year (approximately 45, 60, 80, 120, 170 and 200 people in the previous six meetings). The format has been similar in each year, developers presenting users with information about recent work and gathering requirements for future work.

The goals of the Slurm BOF are to inform users about recent developments, plans for future work, and gather requirements for future work. There are two major releases of Slurm each year and the advances in each are substantial. SC is a great venue to keep the user community informed about these developments.

The first presentation will highlight changes in the Slurm version 17.11 to be released in November 2017, including support for heterogeneous resource allocations plus managing the workload on an enterprise-wide basis and spanning multiple clusters. Slurm has previously supported only homogeneous resource allocations, with each spawned task getting identical resources (i.e. memory, CPUs, GPUs, etc. identical for every task). Heterogeneous resource support will permit an arbitrary collection of resource allocations to be combined in a single Slurm job and used to spawn one or more applications. Slurm has only supported resource management on a single cluster in the past. The latest version of Slurm permits multiple clusters to be configured into a “federation,” with the workload and resources of that entire federation optimized with respect to resource allocations.

A second presentation will highlight changes planned for future releases of Slurm, especially version 18.08 to be released in August 2018.

We also seek user guidance at the BOF and via a survey and discussion in order to help prioritize development for future work.

Survey questions: * Name * Organization * Email * Slurm user (yes or no) * Computer description (node counts and vendors) * Typical workload (job sizes and run times) * Current features that are most important to you * Additional features desired (priority ordered) * Interested in participating in Slurm consortium? * Other comments

Conference Presentation: pdf

Birds of a Feather Index