This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
slurm-bridge
slurm-bridge
Run Slurm as a Kubernetes scheduler. A Slinky project.
Code Respository
The code repository can be accessed at https://github.com/SlinkyProject/slurm-bridge
Overview
Slurm and Kubernetes are workload managers originally designed for different
kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads
that typically run for an indefinite amount of time, with potentially vague
resource requirements, on a single node, with loose policy, but can scale its
resource pool infinitely to meet demand; Slurm excels at quickly scheduling
workloads that run for a finite amount of time, with well defined resource
requirements and topology, on multiple nodes, with strict policy, but its
resource pool is known.
This project enables the best of both workload managers. It contains a
Kubernetes scheduler to manage select workload from Kubernetes.
For additional architectural notes, see the architecture docs.
Supported Slurm Versions
Data Parser: v41
Overall Architecture
This is a basic architecture. A more in depth description can be found in the docs directory.

Known Issues
- CGroups is currently disabled, due to difficulties getting core information into the pods.
- Updates may be slow, due to needing to wait for sequencing before the slurm-controller can be deployed.
License
Copyright (C) SchedMD LLC.
Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
1.1 - Overview
Slurm and Kubernetes are workload managers originally designed for different
kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads
that typically run for an indefinite amount of time, with potentially vague
resource requirements, on a single node, with loose policy, but can scale its
resource pool infinitely to meet demand; Slurm excels at quickly scheduling
workloads that run for a finite amount of time, with well defined resource
requirements and topology, on multiple nodes, with strict policy, but its
resource pool is known.
This project enables the best of both workload managers. It contains a
Kubernetes scheduler to manage select workload from Kubernetes.
For additional architectural notes, see the architecture docs.
Features
Slurm
Slurm is a full featured HPC workload manager. To highlight a few features:
- Priority: assigns priorities to jobs upon submission and
on an ongoing basis (e.g. as they age).
- Preemption: stop one or more low-priority jobs to let a
high-priority job run.
- QoS: sets of policies affecting scheduling priority,
preemption, and resource limits.
- Fairshare: distribute resources equitably among users
and accounts based on historical usage.
Limitations
- Kubernetes Version: >=
v1.29
- Slurm Version: >=
25.05
Installation
Install the slurm-bridge scheduler:
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
--namespace=slinky --create-namespace
For additional instructions, see the quickstart guide.
License
Copyright (C) SchedMD LLC.
Licensed under the
Apache License, Version 2.0 you
may not use project except in compliance with the license.
Unless required by applicable law or agreed to in writing, software distributed
under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR
CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
1.2 - Admission
Overview
The admission controller is a webhook for Pods. Any pods created in certain
namespaces will be modified so our scheduler will schedule them instead of the
default scheduler.
Design
Any pods created in certain namespaces will have their .spec.schedulerName
changed to our scheduler.
Managed namespaces are defined as a list of namespace as configured in the
admission controller’s values.yaml
for managedNamespaces[]
.
Sequence Diagram
sequenceDiagram
autonumber
participant KAPI as Kubernetes API
participant SBA as Slurm-Bridge Admission
KAPI-->>SBA: Watch Pod Create/Update
opt Pod in managed Namespaces
SBA->>KAPI: Update `.spec.schedulerName` and Tolerations
KAPI-->>SBA: Update Response
end %% opt Pod in managed Namespaces
1.3 - Architecture
Overview
This document describes the high-level architecture of the Slinky slurm-bridge
.
Big Picture

Directory Map
This project follows the conventions of:
api/
Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.
cmd/
Contains code to be compiled into binary commands.
config/
Contains yaml configuration files used for kustomize deployments
docs/
Contains project documentation.
hack/
Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.
helm/
Contains helm deployments, including the configuration files such as values.yaml.
Helm is the recommended method to install this project into your Kubernetes cluster.
internal/
Contains code that is used internally. This code is not externally importable.
internal/controller/
Contains the controllers.
Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.
internal/scheduler/
Contains scheduling framework plugins. Currently, this consists of slurm-bridge
.
1.4 - Controllers
Overview
This component is comprised of multiple controllers with specialized tasks.
Node Controller
The node controller is responsible for tainting the managed nodes so the
scheduler component is fully in control of all workload that is bound to those
nodes.
Additionally, this controller will reconcile certain node states for scheduling
purposes. Slurm becomes the source of truth for scheduling among managed nodes.
A managed node is defined as a node that has a colocated kubelet
and slurmd
on the same physical host, and the slurm-bridge can schedule on.
sequenceDiagram
autonumber
participant KAPI as Kubernetes API
participant SWC as Slurm Workload Controller
participant SAPI as Slurm REST API
loop Reconcile Loop
KAPI-->>SWC: Watch Kubernetes Nodes
alt Node is managed
SWC->>KAPI: Taint Node
KAPI-->>SWC: Taint Node
else
SWC->>KAPI: Untaint Node
KAPI-->>SWC: Untaint Node
end %% alt Node is managed
alt Node is schedulable
SWC->>SAPI: Drain Node
SAPI-->>SWC: Taint Node
else
SWC->>SAPI: Undrain Node
SAPI-->>SWC: Undrain Node
end %% alt Node is schedulable
end %% loop Reconcile Loop
Workload Controller
The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the
source of truth for what workload is allowed to run on which managed nodes.
sequenceDiagram
autonumber
participant KAPI as Kubernetes API
participant SWC as Slurm Workload Controller
participant SAPI as Slurm REST API
loop Reconcile Loop
critical Map Slurm Job to Pod
KAPI-->>SWC: Watch Kubernetes Pods
SAPI-->>SWC: Watch Slurm Jobs
option Pod is Terminated
SWC->>SAPI: Terminate Slurm Job
SAPI-->>SWC: Return Status
option Job is Terminated
SWC->>KAPI: Evict Pod
KAPI-->>SWC: Return Status
end %% critical Map Slurm Job to Pod
end %% loop Reconcile Loop
1.5 - Quickstart
Overview
This quickstart guide will help you get the slurm-bridge running and configured
with your existing Slurm cluster.
Slurm Configuration
There are a set of assumptions that the slurm-bridge-scheduler must make. The
Slurm admin must satisfy those assumption in their Slurm cluster.
Install slurm-bridge
Pre-Requisites
Install the pre-requisite helm charts.
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
Slurm Bridge
Download values and install the slurm-bridge
from OCI package:
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
-o values-bridge.yaml
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
--values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace
You can check if your cluster deployed successfully with:
kubectl --namespace=slinky get pods
Your output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-bridge-admission-85f89cf884-8c9jt 1/1 Running 0 1m0s
slurm-bridge-controllers-757f64b875-bsfnf 1/1 Running 0 1m0s
slurm-bridge-scheduler-5484467f55-wtspk 1/1 Running 0 1m0s
Scheduling Workload
Generally speaking, slurm-bridge
translates one or more pods into a
representative Slurm workload, where Slurm does the underlying scheduling.
Certain optimizations can be made, depending on which resource is being
translated.
slurm-bridge
has specific scheduling support for JobSet and
PodGroup resources and their pods. If your workload requires or
benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider
representing your workload as a JobSet or PodGroup.
Admission Controller
slurm-bridge
will only schedule pods who requests slurm-bridge
as its
scheduler. The slurm-bridge
admission controller can be configured to
automatically make slurm-bridge
the scheduler for all pods created in the
configured namespaces.
Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler
from any namespace.
Annotations
Users can better inform or influence slurm-bridge
how to represent their
Kubernetes workload within Slurm by adding
annotations on the parent Object.
Example “pause” bare pod to illustrate annotations:
apiVersion: v1
kind: Pod
metadata:
name: pause
# slurm-bridge annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
Example “pause” deployment to illustrate annotations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause
# slurm-bridge annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
replicas: 2
selector:
matchLabels:
app: pause
template:
metadata:
labels:
app: pause
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
JobSets
This section assumes JobSets is installed.
JobSet pods will be coscheduled and launched together. The JobSet controller is
responsible for managing the JobSet status and other Pod interactions once
marked as completed.
PodGroups
This section assumes PodGroups CRD is installed and the
out-of-tree kube-scheduler is installed and configured as a (non-primary)
scheduler.
Pods contained within a PodGroup will be co-scheduled and launched together. The
PodGroup controller is responsible for managing the PodGroup status and other
Pod interactions once marked as completed.
1.6 - Scheduler
Overview
The scheduler controller is responsible for scheduling pending pods onto
nodes.
This scheduler is designed to be a non-primary scheduler (e.g. should not
replace the default kube-scheduler). This means that only certain pods should
be scheduled via this scheduler (e.g. non-critical pods).
This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to
schedule the Job, then informs Kubernetes on which nodes to allocate the
represented Pods. Tis scheduler defers scheduling decisions to Slurm, hence
certain assumptions about the environment must be met for this to function
correctly.
Sequence Diagram
sequenceDiagram
autonumber
actor user as User
participant KAPI as Kubernetes API
participant SBS as Slurm-Bridge Scheduler
participant SAPI as Slurm REST API
loop Workload Submission
user->>KAPI: Submit Pod
KAPI-->>user: Return Request Status
end %% loop Workload Submission
loop Scheduling Loop
SBS->>KAPI: Get Next Pod in Workload Queue
KAPI-->>SBS: Return Next Pod in Workload Queue
note over SBS: Honor Slurm scheduling decision
critical Lookup Slurm Placeholder Job
SBS->>SAPI: Get Placeholder Job
SAPI-->>SBS: Return Placeholder Job
option Job is NotFound
note over SBS: Translate Pod(s) into Slurm Job
SBS->>SAPI: Submit Placeholder Job
SAPI-->>SBS: Return Submit Status
option Job is Pending
note over SBS: Check again later...
SBS->>SBS: Requeue
option Job is Allocated
note over SBS: Bind Pod(s) to Node(s) from the Slurm Job
SBS->>KAPI: Bind Pod(s) to Node(s)
KAPI-->>SBS: Return Bind Request Status
end %% Lookup Slurm Placeholder Job
end %% loop Scheduling Loop