Kubernetes Operator for Slurm Clusters
Run Slurm on Kubernetes, by SchedMD. A Slinky project.
Table of Contents
Kubernetes Operator for Slurm Clusters
-
-
`StatefulSet(default) <#statefulset-default>`__`DaemonSet<#daemonset>`__
Overview
Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.
This project enables the best of both workload managers, unified on Kubernetes. It contains a Kubernetes operator to deploy and manage certain components of Slurm clusters. This repository implements custom-controllers and custom resource definitions (CRDs) designed for the lifecycle (creation, upgrade, graceful shutdown) of Slurm clusters.
“Slurm Operator Architecture”
For additional architectural notes, see the architecture docs.
Slurm Cluster
Slurm clusters are very flexible and can be configured in various ways. Our Slurm helm chart provides a reference implementation that is highly customizable and tries to expose everything Slurm has to offer.
“Slurm Architecture”
For additional information about Slurm, see the slurm docs.
Features
Controller
The Slurm control-plane is responsible for scheduling Slurm workload onto its worker nodes and managing their states.
Slurm High Availability (HA) is effectively achieved though Kubernetes regenerating the Slurm controller pod if it crashes. This is generally faster than the time it takes for a backup controller to assume control if the primary crashes. Because Slurm’s version of HA is not being used, a shared filesystem is not required for this.
Changes to the Slurm configuration files are automatically detected and the Slurm cluster is reconfigured seamlessly with zero downtime of the Slurm control-plane.
[!NOTE] The kubelet’s
configMapAndSecretChangeDetectionStrategyandsyncFrequencysettings directly affect when pods have their mounted ConfigMaps and Secrets updated. By default, the kubelet is inWatchmode with a polling frequency of 60 seconds.
NodeSets
A set of homogeneous Slurm workers (compute nodes), which are delegated to execute the Slurm workload.
The operator will take into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes will be marked as drain before their eventual termination pending scale-in or upgrade.
Slurm node states (e.g. Idle, Allocated, Mixed, Down, Drain, Not Responding, etc…) are applied to each NodeSet pod via their pod conditions; each NodeSet pod contains a pod status that reflects their own Slurm node state.
The NodeSet CRD supports a scalingMode field that controls how many
pods are created and how they are scaled. This allows you to choose
between replica-based scaling (like a StatefulSet) or one-pod-per-node
scaling (like a DaemonSet).
StatefulSet (default)
Behavior: The controller maintains a fixed number of pods according to the
replicasfield.Use when: A fixed or scalable number of Slurm worker pods is needed. Scale-to-zero and horizontal autoscaling (e.g. HPA) apply to this mode.
Note: Each pod has a stable identity (e.g. ordinal-based naming)
DaemonSet
Behavior: The controller schedules one pod per Kubernetes node that matches the NodeSet’s pod template (e.g.
nodeSelector,tolerations). Pod count follows the number of matching nodes. Adding or removing nodes automatically adds or removes pods.Use when: 1:1 alignment between Kubernetes and Slurm (slurmd) nodes is needed.
Note: The
replicasfield is ignored. Pod identity is tied to the node (e.g. node name) rather than an ordinal.
The operator supports NodeSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with NodeSets.
NodeSets can be resolved by hostname. This enables hostname-based
resolution between login pods and worker pods, enabling direct
pod-to-pod communication using predictable hostnames (e.g., cpu-1-0,
gpu-2-1).
LoginSets
A set of homogeneous login nodes (submit node, jump host) for Slurm, which manage user identity via SSSD.
The operator supports LoginSet scale to zero, scaling the resource down to zero replicas. Hence, any Horizontal Pod Autoscaler (HPA) that also support scale to zero can be best paired with LoginSets.
Hybrid Support
Sometimes a Slurm cluster has some, but not all, of its components in Kubernetes. The operator and its CRDs are designed support these use cases.
Slurm
Slurm is a full featured HPC workload manager. To highlight a few features:
Accounting: collect accounting information for every job and job step executed.
Partitions: job queues with sets of resources and constraints (e.g. job size limit, job time limit, users permitted).
Reservations: reserve resources for jobs being executed by select users and/or select accounts.
Job Dependencies: defer the start of jobs until the specified dependencies have been satisfied.
Job Containers: jobs which run an unprivileged OCI container bundle.
MPI: launch parallel MPI jobs, supports various MPI implementations.
Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
Preemption: stop one or more low-priority jobs to let a high-priority job run.
QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
Fairshare: distribute resources equitably among users and accounts based on historical usage.
Node Health Check: periodically check node health via script.
Compatibility
Software |
Minimum Version |
|---|---|
Kubernetes |
|
Slurm |
|
Cgroup |
Quick Start
Install the cert-manager with its CRDs:
helm install \
cert-manager oci://quay.io/jetstack/charts/cert-manager \
--namespace cert-manager --create-namespace \
--set crds.enabled=true
Install the slurm-operator and its CRDs:
helm install slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--namespace=slinky --create-namespace
Install a Slurm cluster:
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--namespace=slurm --create-namespace
For additional instructions, see the installation guide.
Upgrades
Slinky versions are expressed as X.Y.Z, where X is the major version, Y is the minor version, and Z is the patch version, following Semantic Versioning terminology.
See versioning for more details.
1.Y Releases
New Slinky versions may update the Slinky
CRDs
with new fields and deprecate old ones. During CRD version changes
(e.g. v1beta1 => v1beta2), deprecated fields may be removed.
Through the Kubernetes API, CRD versions are automatically converted to
the stored version. Therefore old CRD versions will still work, but it
is recommended to use the new CRD version as indicated by the installed
Slinky CRDs.
To upgrade between Slinky v1.Y versions (e.g. v1.0.Z =>
v1.1.Z), upgrade the slurm-operator-crds chart followed by the
slurm-operator chart, or both at the same time by upgrading the
slurm-operator chart when using crds.enabled=true.
helm upgrade slurm-operator-crds oci://ghcr.io/slinkyproject/charts/slurm-operator-crds \
--version $SLINKY_VERSION
helm upgrade slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--namespace slinky --version $SLINKY_VERSION
All Slurm charts may remain on the old Slinky release series
(e.g. v1.0.x) despite the slurm-operator and its CRDs being on a
newer Slinky release series (e.g. v1.1.x). It is still recommended
to upgrade Slurm charts to the new Slinky release series coinciding with
the slurm-operator’s Slinky release series to make use of the new
fields, features, and functionality.
Please review changes made to the CRDs and the Slurm chart. Update your
values.yaml appropriately and upgrade the Slurm chart.
helm upgrade slurm oci://ghcr.io/slinkyproject/charts/slurm \
--namespace slurm --version $SLINKY_VERSION
0.Y Releases
Breaking changes may be introduced into existing Slinky
CRDs
versions. To upgrade between v0.Y versions (e.g. v0.1.Z =>
v0.2.Z), uninstall all Slinky charts and delete Slinky CRDs, then
install the new release like normal.
helm --namespace=slurm uninstall slurm
helm --namespace=slinky uninstall slurm-operator
helm uninstall slurm-operator-crds
If the CRDs were not installed via slurm-operator-crds helm chart:
kubectl delete customresourcedefinitions.apiextensions.k8s.io accountings.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io clusters.slinky.slurm.net # defunct
kubectl delete customresourcedefinitions.apiextensions.k8s.io loginsets.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io nodesets.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io restapis.slinky.slurm.net
kubectl delete customresourcedefinitions.apiextensions.k8s.io tokens.slinky.slurm.net
Documentation
Project documentation is located in the docs directory of this repository.
Slinky documentation is hosted on the web.
Support and Development
Feature requests, code contributions, and bug reports are welcome!
Github/Gitlab submitted issues and PRs/MRs are handled on a best effort basis.
The SchedMD official issue tracker is at https://support.schedmd.com/.
To schedule a demo or simply to reach out, please contact SchedMD.
License
Copyright (C) SchedMD LLC.
Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.