Kubernetes Operator for Slurm Clusters

Run Slurm on Kubernetes, by SchedMD. A Slinky project.

Table of Contents

Kubernetes Operator for Slurm Clusters
- Table of Contents
- Overview
  - Slurm Cluster
- Features
  - Controller
  - NodeSets
    - `StatefulSet (default) <#statefulset-default>`__
    - `DaemonSet <#daemonset>`__
  - LoginSets
  - Hybrid Support
  - Slurm
- Compatibility
- Quick Start
- Upgrades
  - 1.Y Releases
  - 0.Y Releases
- Documentation
- Support and Development
- License

Overview

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.

This project enables the best of both workload managers, unified on Kubernetes. It contains a Kubernetes operator to deploy and manage certain components of Slurm clusters. This repository implements custom-controllers and custom resource definitions (CRDs) designed for the lifecycle (creation, upgrade, graceful shutdown) of Slurm clusters.

For additional architectural notes, see the architecture docs.

Slurm Cluster

Slurm clusters are very flexible and can be configured in various ways. Our Slurm helm chart provides a reference implementation that is highly customizable and tries to expose everything Slurm has to offer.

For additional information about Slurm, see the slurm docs.

Features

Controller

The Slurm control-plane is responsible for scheduling Slurm workload onto its worker nodes and managing their states.

Slurm High Availability (HA) is effectively achieved though Kubernetes regenerating the Slurm controller pod if it crashes. This is generally faster than the time it takes for a backup controller to assume control if the primary crashes. Because Slurm’s version of HA is not being used, a shared filesystem is not required for this.

Changes to the Slurm configuration files are automatically detected and the Slurm cluster is reconfigured seamlessly with zero downtime of the Slurm control-plane.

[!NOTE] The kubelet’s configMapAndSecretChangeDetectionStrategy and syncFrequency settings directly affect when pods have their mounted ConfigMaps and Secrets updated. By default, the kubelet is in Watch mode with a polling frequency of 60 seconds.

NodeSets

A set of homogeneous Slurm workers (compute nodes), which are delegated to execute the Slurm workload.

The operator will take into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes will be marked as drain before their eventual termination pending scale-in or upgrade.

Slurm node states (e.g. Idle, Allocated, Mixed, Down, Drain, Not Responding, etc…) are applied to each NodeSet pod via their pod conditions; each NodeSet pod contains a pod status that reflects their own Slurm node state.

The NodeSet CRD supports a scalingMode field that controls how many pods are created and how they are scaled. This allows you to choose between replica-based scaling (like a StatefulSet) or one-pod-per-node scaling (like a DaemonSet).

`StatefulSet` (default)

Behavior: The controller maintains a fixed number of pods according to the replicas field.
Use when: A fixed or scalable number of Slurm worker pods is needed. Scale-to-zero and horizontal autoscaling (e.g. HPA) apply to this mode.
Note: Each pod has a stable identity (e.g. ordinal-based naming)

`DaemonSet`

Behavior: The controller schedules one pod per Kubernetes node that matches the NodeSet’s pod template (e.g. nodeSelector, tolerations). Pod count follows the number of matching nodes. Adding or removing nodes automatically adds or removes pods.
Use when: 1:1 alignment between Kubernetes and Slurm (slurmd) nodes is needed.
Note: The replicas field is ignored. Pod identity is tied to the node (e.g. node name) rather than an ordinal.

The operator supports NodeSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with NodeSets.

NodeSets can be resolved by hostname. This enables hostname-based resolution between login pods and worker pods, enabling direct pod-to-pod communication using predictable hostnames (e.g., cpu-1-0, gpu-2-1).

LoginSets

A set of homogeneous login nodes (submit node, jump host) for Slurm, which manage user identity via SSSD.

The operator supports LoginSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with LoginSets.

Hybrid Support

Sometimes a Slurm cluster has some, but not all, of its components in Kubernetes. The operator and its CRDs are designed support these use cases.

Slurm

Slurm is a full featured HPC workload manager. To highlight a few features:

Accounting: collect accounting information for every job and job step executed.
Partitions: job queues with sets of resources and constraints (e.g. job size limit, job time limit, users permitted).
Reservations: reserve resources for jobs being executed by select users and/or select accounts.
Job Dependencies: defer the start of jobs until the specified dependencies have been satisfied.
Job Containers: jobs which run an unprivileged OCI container bundle.
MPI: launch parallel MPI jobs, supports various MPI implementations.
Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
Preemption: stop one or more low-priority jobs to let a high-priority job run.
QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
Fairshare: distribute resources equitably among users and accounts based on historical usage.
Node Health Check: periodically check node health via script.

Compatibility

Software	Minimum Version
Kubernetes	v1.29
Slurm	25.11
Cgroup	v2

Quick Start

Install the cert-manager with its CRDs:

helm install \
  cert-manager oci://quay.io/jetstack/charts/cert-manager \
  --namespace cert-manager --create-namespace \
  --set crds.enabled=true

Install the slurm-operator and its CRDs:

helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --namespace=slinky --create-namespace

Install a Slurm cluster:

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --namespace=slurm --create-namespace

For additional instructions, see the installation guide.

Upgrades

Slinky versions are expressed as X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version, following Semantic Versioning terminology.

See versioning for more details.

1.Y Releases

New Slinky versions may update the Slinky CRDs with new fields and deprecate old ones. During CRD version changes (e.g. v1beta1 => v1beta2), deprecated fields may be removed. Through the Kubernetes API, CRD versions are automatically converted to the stored version. Therefore old CRD versions will still work, but it is recommended to use the new CRD version as indicated by the installed Slinky CRDs.

To upgrade between Slinky v1.Y versions (e.g. v1.0.Z => v1.1.Z), upgrade the slurm-operator-crds chart followed by the slurm-operator chart, or both at the same time by upgrading the slurm-operator chart when using crds.enabled=true.

helm upgrade slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds \
  --version $SLINKY_VERSION
helm upgrade slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --namespace slinky --version $SLINKY_VERSION

All Slurm charts may remain on the old Slinky release series (e.g. v1.0.x) despite the slurm-operator and its CRDs being on a newer Slinky release series (e.g. v1.1.x). It is still recommended to upgrade Slurm charts to the new Slinky release series coinciding with the slurm-operator’s Slinky release series to make use of the new fields, features, and functionality.

Please review changes made to the CRDs and the Slurm chart. Update your values.yaml appropriately and upgrade the Slurm chart.

helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --namespace slurm --version $SLINKY_VERSION

0.Y Releases

Breaking changes may be introduced into existing Slinky CRDs versions. To upgrade between v0.Y versions (e.g. v0.1.Z => v0.2.Z), uninstall all Slinky charts and delete Slinky CRDs, then install the new release like normal.

helm --namespace=slurm uninstall slurm
helm --namespace=slinky uninstall slurm-operator
helm uninstall slurm-operator-crds

If the CRDs were not installed via slurm-operator-crds helm chart:

kubectl delete customresourcedefinitions.apiextensions.k8s.io accountings.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io clusters.slinky.slurm.net # defunct
kubectl delete customresourcedefinitions.apiextensions.k8s.io loginsets.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io nodesets.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io restapis.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io tokens.slinky.slurm.net

Documentation

Project documentation is located in the docs directory of this repository.

Slinky documentation is hosted on the web.

Support and Development

Feature requests, code contributions, and bug reports are welcome!

Github/Gitlab submitted issues and PRs/MRs are handled on a best effort basis.

The SchedMD official issue tracker is at https://support.schedmd.com/.

To schedule a demo or simply to reach out, please contact SchedMD.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.