Quickstart

This quickstart guide will help you get slurm-bridge running and configured with your existing cluster.

If you’d like to try out slurm-bridge locally before deploying it on a cluster, consider following our guide for configuring a local test environment instead.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites


  • A functional Slurm cluster with:
    • A set of hosts within the cluster that are running both a kubelet and slurmd
    • At least one partition consisting solely of nodes with the above configuration
    • MCS labels enabled:
      # slurm.conf
      ...
      MCSPlugin=mcs/label
      MCSParameters=ondemand,ondemandselect
      
  • A functional Kubernetes cluster that includes the hosts running colocated kubelet and slurmd and:
  • Matching NodeNames in Slurm and Kubernetes for all overlapping nodes
    • In the event that the colocated node’s Slurm NodeName does not match the Kubernetes Node name, you should patch the Kubernetes node with a label to allow slurm-bridge to map the colocated Kubernetes and Slurm node.
      kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
      
  • cgroups/v2 configured on all hosts with a colocated kubelet and slurmd

Installation


1. Install the required helm charts:
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
2. Download and configure values.yaml for the slurm-bridge helm chart

The helm chart used by slurm-bridge has a number of parameters in values.yaml that can be modified to tweak various parameters of slurm-bridge. Most of these values should work without modification.

Downloading values.yaml:

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
  -o values-bridge.yaml

Depending on your Slurm configuration, you may need to configure the following variables:

  • schedulerConfig.partition - this is the default partition with which slurm-bridge will associate jobs. This partition should only include nodes that have both [slurmd] and the [kubelet] running. The default value of this variable is slurm-bridge.
  • sharedConfig.slurmRestApi - the URL used by slurm-bridge to interact with the Slurm REST API. Changing this value may be necessary if you run the REST API on a different URL or port. The default value of this variable is http://slurm-restapi.slurm:6820
3. Download and install the slurm-bridge package from OCI:
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
  --values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace

You can check if your cluster deployed successfully with:

kubectl --namespace=slinky get pods

Your output should be similar to:

NAME                                        READY   STATUS    RESTARTS      AGE
slurm-bridge-admission-85f89cf884-8c9jt     1/1     Running   0             1m0s
slurm-bridge-controllers-757f64b875-bsfnf   1/1     Running   0             1m0s
slurm-bridge-scheduler-5484467f55-wtspk     1/1     Running   0             1m0s

Running Your First Job


Now that slurm-bridge is configured, we can write a workload. slurm-bridge schedules Kubernetes workloads using the Slurm scheduler by translating a Kubernetes workload in the form of a Jobs, JobSets, Pods, and PodGroups into a representative Slurm job, which is used for scheduling purposes. Once a workload is allocated resources, the Kubelet binds the Kubernetes workload to the allocated resources and executes it. There are sample workload definitions in the slurm-bridge repo here.

Here’s an example of a simple job, found in hack/examples/single.yaml:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-sleep-single
  namespace: slurm-bridge
  annotations:
    slinky.slurm.net/job-name: job-sleep-single
spec:
  completions: 1
  parallelism: 1
  template:
    spec:
      containers:
        - name: sleep
          image: busybox:stable
          command: [sh, -c, sleep 3]
          resources:
            requests:
              cpu: "1"
              memory: 100Mi
            limits:
              cpu: "1"
              memory: 100Mi
      restartPolicy: Never

Let’s run this job:

❯ kubectl apply -f hack/examples/job/single.yaml
job.batch/job-sleep-single created

At this point, Kubernetes has dispatched our job, it was scheduled by Slurm, and executed to completion. Let’s take a look at each place that our job shows up.

On the Slurm side, we can observe the placeholder job that was used to schedule our workload:

slurm@slurm-controller-0:/tmp$ scontrol show jobs
JobId=1 JobName=job-sleep-single
   UserId=slurm(401) GroupId=slurm(401) MCS_label=kubernetes
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2025-07-10T15:52:53 EligibleTime=2025-07-10T15:52:53
   AccrueTime=2025-07-10T15:52:53
   StartTime=2025-07-10T15:52:53 EndTime=2025-07-10T15:53:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-07-10T15:52:53 Scheduler=Main
   Partition=slurm-bridge AllocNode:Sid=10.244.5.5:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=slurm-bridge-1
   BatchHost=slurm-bridge-1
   StepMgrEnabled=Yes
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=96046M,node=1,billing=1
   AllocTRES=cpu=4,mem=96046M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp
   AdminComment={"pods":["slurm-bridge/job-sleep-single-8wtc2"]}
   OOMKillStep=0

Note that the Command field is equal to (null), and that the JobState field is equal to CANCELLED. These are so (null), and that the JobState field is equal to CANCELLED. This is because this Slurm job is only a placeholder - no work is actually done by the placeholder. Instead, the job is cancelled upon allocation so that the Kubelet can bind the workload to the selected node(s) for the duration of the job.

We can also look at this job using kubectl:

❯ kubectl describe job --namespace=slurm-bridge job-sleep-single
Name:             job-sleep-single
Namespace:        slurm-bridge
Selector:         batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
Labels:           batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  batch.kubernetes.io/job-name=job-sleep-single
                  controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  job-name=job-sleep-single
Annotations:      slinky.slurm.net/job-name: job-sleep-single
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Thu, 10 Jul 2025 09:52:53 -0600
Completed At:     Thu, 10 Jul 2025 09:53:02 -0600
Duration:         9s
Pods Statuses:    0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           batch.kubernetes.io/job-name=job-sleep-single
           controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           job-name=job-sleep-single
  Containers:
   sleep:
    Image:      busybox:stable
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      sleep 3
    Limits:
      cpu:     1
      memory:  100Mi
    Requests:
      cpu:        1
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  14m   job-controller  Created pod: job-sleep-single-8wtc2
  Normal  Completed         14m   job-controller  Job completed

As Kubernetes is the context in which this job actually executed, this is generally the more useful of the two outputs.

Celebrate!

At this point, you should have a cluster running slurm-bridge.

Recommended next steps involve reading through creating a workload, learning more about the architecture of slurm-bridge, or browsing our how-to-guides on administrative tasks.