Quickstart
This quickstart guide will help you get slurm-bridge
running and configured
with your existing cluster.
If you’d like to try out slurm-bridge
locally before deploying it on a
cluster, consider following our guide for configuring a local test environment
instead.
This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.
Pre-requisites
- A functional Slurm cluster with:
- A functional Kubernetes cluster that includes the hosts running colocated
kubelet and slurmd and:
- helm installed
- A “standard” StorageClass configured
- The DefaultStorageClass admission controller configured on your cluster’s API server
- Matching NodeNames in Slurm and Kubernetes for all overlapping nodes
- In the event that the colocated node’s Slurm NodeName does not match the
Kubernetes Node name, you should patch the Kubernetes node with a label to
allow
slurm-bridge
to map the colocated Kubernetes and Slurm node.kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
- In the event that the colocated node’s Slurm NodeName does not match the
Kubernetes Node name, you should patch the Kubernetes node with a label to
allow
cgroups/v2
configured on all hosts with a colocated kubelet and slurmd
Installation
1. Install the required helm charts:
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
2. Download and configure values.yaml
for the slurm-bridge
helm chart
The helm chart used by slurm-bridge
has a number of parameters in
values.yaml
that can be modified to tweak various parameters of slurm-bridge. Most of these
values should work without modification.
Downloading values.yaml
:
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
-o values-bridge.yaml
Depending on your Slurm configuration, you may need to configure the following variables:
schedulerConfig.partition
- this is the default partition with whichslurm-bridge
will associate jobs. This partition should only include nodes that have both [slurmd] and the [kubelet] running. The default value of this variable isslurm-bridge
.sharedConfig.slurmRestApi
- the URL used byslurm-bridge
to interact with the Slurm REST API. Changing this value may be necessary if you run the REST API on a different URL or port. The default value of this variable ishttp://slurm-restapi.slurm:6820
3. Download and install the slurm-bridge
package from OCI:
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
--values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace
You can check if your cluster deployed successfully with:
kubectl --namespace=slinky get pods
Your output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-bridge-admission-85f89cf884-8c9jt 1/1 Running 0 1m0s
slurm-bridge-controllers-757f64b875-bsfnf 1/1 Running 0 1m0s
slurm-bridge-scheduler-5484467f55-wtspk 1/1 Running 0 1m0s
Running Your First Job
Now that slurm-bridge
is configured, we can write a workload. slurm-bridge
schedules Kubernetes workloads using the Slurm scheduler by translating a
Kubernetes workload in the form of a Jobs, JobSets, Pods, and PodGroups
into a representative Slurm job, which is used for scheduling purposes. Once a
workload is allocated resources, the Kubelet binds the Kubernetes workload to
the allocated resources and executes it. There are sample workload definitions
in the slurm-bridge
repo
here.
Here’s an example of a simple job, found in hack/examples/single.yaml
:
---
apiVersion: batch/v1
kind: Job
metadata:
name: job-sleep-single
namespace: slurm-bridge
annotations:
slinky.slurm.net/job-name: job-sleep-single
spec:
completions: 1
parallelism: 1
template:
spec:
containers:
- name: sleep
image: busybox:stable
command: [sh, -c, sleep 3]
resources:
requests:
cpu: "1"
memory: 100Mi
limits:
cpu: "1"
memory: 100Mi
restartPolicy: Never
Let’s run this job:
❯ kubectl apply -f hack/examples/job/single.yaml
job.batch/job-sleep-single created
At this point, Kubernetes has dispatched our job, it was scheduled by Slurm, and executed to completion. Let’s take a look at each place that our job shows up.
On the Slurm side, we can observe the placeholder job that was used to schedule our workload:
slurm@slurm-controller-0:/tmp$ scontrol show jobs
JobId=1 JobName=job-sleep-single
UserId=slurm(401) GroupId=slurm(401) MCS_label=kubernetes
Priority=1 Nice=0 Account=(null) QOS=normal
JobState=CANCELLED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:08 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2025-07-10T15:52:53 EligibleTime=2025-07-10T15:52:53
AccrueTime=2025-07-10T15:52:53
StartTime=2025-07-10T15:52:53 EndTime=2025-07-10T15:53:01 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-07-10T15:52:53 Scheduler=Main
Partition=slurm-bridge AllocNode:Sid=10.244.5.5:1
ReqNodeList=(null) ExcNodeList=(null)
NodeList=slurm-bridge-1
BatchHost=slurm-bridge-1
StepMgrEnabled=Yes
NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=96046M,node=1,billing=1
AllocTRES=cpu=4,mem=96046M,node=1,billing=4
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
Command=(null)
WorkDir=/tmp
AdminComment={"pods":["slurm-bridge/job-sleep-single-8wtc2"]}
OOMKillStep=0
Note that the Command
field is equal to
(null)
, and that the JobState
field is equal to CANCELLED
. These are so
(null)
, and that the JobState
field is equal to CANCELLED
. This is because
this Slurm job is only a placeholder - no work is actually done by the
placeholder. Instead, the job is cancelled upon allocation so that the Kubelet
can bind the workload to the selected node(s) for the duration of the job.
We can also look at this job using kubectl
:
❯ kubectl describe job --namespace=slurm-bridge job-sleep-single
Name: job-sleep-single
Namespace: slurm-bridge
Selector: batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
Labels: batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
batch.kubernetes.io/job-name=job-sleep-single
controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
job-name=job-sleep-single
Annotations: slinky.slurm.net/job-name: job-sleep-single
Parallelism: 1
Completions: 1
Completion Mode: NonIndexed
Start Time: Thu, 10 Jul 2025 09:52:53 -0600
Completed At: Thu, 10 Jul 2025 09:53:02 -0600
Duration: 9s
Pods Statuses: 0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
Labels: batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
batch.kubernetes.io/job-name=job-sleep-single
controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
job-name=job-sleep-single
Containers:
sleep:
Image: busybox:stable
Port: <none>
Host Port: <none>
Command:
sh
-c
sleep 3
Limits:
cpu: 1
memory: 100Mi
Requests:
cpu: 1
memory: 100Mi
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 14m job-controller Created pod: job-sleep-single-8wtc2
Normal Completed 14m job-controller Job completed
As Kubernetes is the context in which this job actually executed, this is generally the more useful of the two outputs.
Celebrate!
At this point, you should have a cluster running slurm-bridge
.
Recommended next steps involve reading through
creating a workload, learning more about the
architecture of slurm-bridge
, or browsing our
how-to-guides on administrative tasks.