slurm-bridge

Download the slurm-bridge repository here, start using bridge with the quickstart guide, or read on to learn more.

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. Kubernetes excels at scheduling workloads that run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, and a known resource pool.

Why you need slurm-bridge and what it can do


This project enables users to take advantage of the best features of both workload managers. It contains a Kubernetes scheduler to manage select workloads from Kubernetes, which allows for co-location of Kubernetes and Slurm workloads within the same cluster. This means the same hardware can be used to run both traditional HPC and cloud-like workloads, reducing operating costs.

Using slurm-bridge, workloads can be submitted from within a Kubernetes context as a Pod, PodGroup, Job, or JobSet, or from a Slurm context using salloc or sbatch. Workloads submitted via Slurm will execute as they would in a Slurm-only environment, using slurmd. Workloads submitted from Kubernetes will have their resource requirements translated into a representative Slurm job by slurm-bridge. That job will serve as a placeholder and will be scheduled by the Slurm controller. Upon resource allocation to a K8s workload by the Slurm controller, slurm-bridge will bind the workload’s pod(s) to the allocated node(s). At that point, the kubelet will launch and run the pod the same as it would within a standard Kubernetes instance.

Image

For additional architectural notes, see the architecture docs.

Features


slurm-bridge enables scheduling of Kubernetes workloads using the Slurm scheduler, and can take advantage of most of the scheduling features of Slurm itself. These include:

  • Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
  • Preemption: stop one or more low-priority jobs to let a high-priority job run.
  • QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
  • Fairshare: distribute resources equitably among users and accounts based on historical usage.
  • Reservations: reserve resources for select users or groups

Supported Versions


  • Kubernetes Version: >= v1.29
  • Slurm Version: >= 25.05

Current Limitations


  • Exclusive, whole node allocations are made for each pod.

Get started using slurm-bridge with the quickstart guide!

Versions: