This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

0.3.x

1 - Quickstart

This quickstart guide will help you get slurm-bridge running and configured with your existing cluster.

If you’d like to try out slurm-bridge locally before deploying it on a cluster, consider following our guide for configuring a local test environment instead.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites


  • A functional Slurm cluster with:
    • A set of hosts within the cluster that are running both a kubelet and slurmd
    • At least one partition consisting solely of nodes with the above configuration
    • MCS labels enabled:
      # slurm.conf
      ...
      MCSPlugin=mcs/label
      MCSParameters=ondemand,ondemandselect
      
  • A functional Kubernetes cluster that includes the hosts running colocated kubelet and slurmd and:
  • Matching NodeNames in Slurm and Kubernetes for all overlapping nodes
    • In the event that the colocated node’s Slurm NodeName does not match the Kubernetes Node name, you should patch the Kubernetes node with a label to allow slurm-bridge to map the colocated Kubernetes and Slurm node.
      kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
      
  • cgroups/v2 configured on all hosts with a colocated kubelet and slurmd

Installation


1. Install the required helm charts:
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
2. Download and configure values.yaml for the slurm-bridge helm chart

The helm chart used by slurm-bridge has a number of parameters in values.yaml that can be modified to tweak various parameters of slurm-bridge. Most of these values should work without modification.

Downloading values.yaml:

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
  -o values-bridge.yaml

Depending on your Slurm configuration, you may need to configure the following variables:

  • schedulerConfig.partition - this is the default partition with which slurm-bridge will associate jobs. This partition should only include nodes that have both [slurmd] and the [kubelet] running. The default value of this variable is slurm-bridge.
  • sharedConfig.slurmRestApi - the URL used by slurm-bridge to interact with the Slurm REST API. Changing this value may be necessary if you run the REST API on a different URL or port. The default value of this variable is http://slurm-restapi.slurm:6820
3. Download and install the slurm-bridge package from OCI:
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
  --values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace

You can check if your cluster deployed successfully with:

kubectl --namespace=slinky get pods

Your output should be similar to:

NAME                                        READY   STATUS    RESTARTS      AGE
slurm-bridge-admission-85f89cf884-8c9jt     1/1     Running   0             1m0s
slurm-bridge-controllers-757f64b875-bsfnf   1/1     Running   0             1m0s
slurm-bridge-scheduler-5484467f55-wtspk     1/1     Running   0             1m0s

Running Your First Job


Now that slurm-bridge is configured, we can write a workload. slurm-bridge schedules Kubernetes workloads using the Slurm scheduler by translating a Kubernetes workload in the form of a Jobs, JobSets, Pods, and PodGroups into a representative Slurm job, which is used for scheduling purposes. Once a workload is allocated resources, the Kubelet binds the Kubernetes workload to the allocated resources and executes it. There are sample workload definitions in the slurm-bridge repo here.

Here’s an example of a simple job, found in hack/examples/single.yaml:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-sleep-single
  namespace: slurm-bridge
  annotations:
    slinky.slurm.net/job-name: job-sleep-single
spec:
  completions: 1
  parallelism: 1
  template:
    spec:
      containers:
        - name: sleep
          image: busybox:stable
          command: [sh, -c, sleep 3]
          resources:
            requests:
              cpu: "1"
              memory: 100Mi
            limits:
              cpu: "1"
              memory: 100Mi
      restartPolicy: Never

Let’s run this job:

❯ kubectl apply -f hack/examples/job/single.yaml
job.batch/job-sleep-single created

At this point, Kubernetes has dispatched our job, it was scheduled by Slurm, and executed to completion. Let’s take a look at each place that our job shows up.

On the Slurm side, we can observe the placeholder job that was used to schedule our workload:

slurm@slurm-controller-0:/tmp$ scontrol show jobs
JobId=1 JobName=job-sleep-single
   UserId=slurm(401) GroupId=slurm(401) MCS_label=kubernetes
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2025-07-10T15:52:53 EligibleTime=2025-07-10T15:52:53
   AccrueTime=2025-07-10T15:52:53
   StartTime=2025-07-10T15:52:53 EndTime=2025-07-10T15:53:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-07-10T15:52:53 Scheduler=Main
   Partition=slurm-bridge AllocNode:Sid=10.244.5.5:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=slurm-bridge-1
   BatchHost=slurm-bridge-1
   StepMgrEnabled=Yes
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=96046M,node=1,billing=1
   AllocTRES=cpu=4,mem=96046M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp
   AdminComment={"pods":["slurm-bridge/job-sleep-single-8wtc2"]}
   OOMKillStep=0

Note that the Command field is equal to (null), and that the JobState field is equal to CANCELLED. These are so (null), and that the JobState field is equal to CANCELLED. This is because this Slurm job is only a placeholder - no work is actually done by the placeholder. Instead, the job is cancelled upon allocation so that the Kubelet can bind the workload to the selected node(s) for the duration of the job.

We can also look at this job using kubectl:

❯ kubectl describe job --namespace=slurm-bridge job-sleep-single
Name:             job-sleep-single
Namespace:        slurm-bridge
Selector:         batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
Labels:           batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  batch.kubernetes.io/job-name=job-sleep-single
                  controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  job-name=job-sleep-single
Annotations:      slinky.slurm.net/job-name: job-sleep-single
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Thu, 10 Jul 2025 09:52:53 -0600
Completed At:     Thu, 10 Jul 2025 09:53:02 -0600
Duration:         9s
Pods Statuses:    0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           batch.kubernetes.io/job-name=job-sleep-single
           controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           job-name=job-sleep-single
  Containers:
   sleep:
    Image:      busybox:stable
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      sleep 3
    Limits:
      cpu:     1
      memory:  100Mi
    Requests:
      cpu:        1
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  14m   job-controller  Created pod: job-sleep-single-8wtc2
  Normal  Completed         14m   job-controller  Job completed

As Kubernetes is the context in which this job actually executed, this is generally the more useful of the two outputs.

Celebrate!

At this point, you should have a cluster running slurm-bridge.

Recommended next steps involve reading through creating a workload, learning more about the architecture of slurm-bridge, or browsing our how-to-guides on administrative tasks.

2 - Concepts

Concepts related to slurm-bridge internals and design.

2.1 - Admission

Overview


The Kubernetes documentation defines admission controllers as:

a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the resource, but after the request is authenticated and authorized.

It also states that:

Admission control mechanisms may be validating, mutating, or both. Mutating controllers may modify the data for the resource being modified; validating controllers may not.

The slurm-bridge admission controller is a mutating controller. It modifies any pods within certain namespaces (slurm-bridge, by default) to use the slurm-bridge scheduler instead of the default Kube scheduler.

Design


Any pods created in certain namespaces will have their .spec.schedulerName changed to our scheduler.

Managed namespaces are defined as a list of namespace as configured in the admission controller’s values.yaml for managedNamespaces[].

Sequence Diagram

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SBA as Slurm-Bridge Admission

  KAPI-->>SBA: Watch Pod Create/Update
  opt Pod in managed Namespaces
    SBA->>KAPI: Update `.spec.schedulerName` and Tolerations
    KAPI-->>SBA: Update Response
  end %% opt Pod in managed Namespaces

2.2 - Architecture

Overview


This document describes the high-level architecture of the Slinky slurm-bridge.

Big Picture


Image

Directory Map


This project follows the conventions of:

api/

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

cmd/

Contains code to be compiled into binary commands.

config/

Contains yaml configuration files used for kustomize deployments

docs/

Contains project documentation.

hack/

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

helm/

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

internal/

Contains code that is used internally. This code is not externally importable.

internal/controller/

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

internal/scheduler/

Contains scheduling framework plugins. Currently, this consists of slurm-bridge.

2.3 - Controllers

Overview


The Kubernetes documentation defines controllers as:

control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.

Within slurm-bridge, there are multiple controllers that manage the state of different bridge components:

  • Node Controller - Responsible for the state of nodes in the bridge cluster
  • Workload Controller - Responsible for the state of pods and other workloads running on slurm-bridge

Node Controller


The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.

Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.

A managed node is defined as a node that has a colocated kubelet and slurmd on the same physical host, and the slurm-bridge can schedule on.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop
    KAPI-->>SWC: Watch Kubernetes Nodes

    alt Node is managed
      SWC->>KAPI: Taint Node
      KAPI-->>SWC: Taint Node
    else
      SWC->>KAPI: Untaint Node
      KAPI-->>SWC: Untaint Node
    end %% alt Node is managed

    alt Node is schedulable
      SWC->>SAPI: Drain Node
      SAPI-->>SWC: Taint Node
    else
      SWC->>SAPI: Undrain Node
      SAPI-->>SWC: Undrain Node
    end %% alt Node is schedulable

  end %% loop Reconcile Loop

Workload Controller


The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop

  critical Map Slurm Job to Pod
    KAPI-->>SWC: Watch Kubernetes Pods
    SAPI-->>SWC: Watch Slurm Jobs
  option Pod is Terminated
    SWC->>SAPI: Terminate Slurm Job
    SAPI-->>SWC: Return Status
  option Job is Terminated
    SWC->>KAPI: Evict Pod
    KAPI-->>SWC: Return Status
  end %% critical Map Slurm Job to Pod

  end %% loop Reconcile Loop

2.4 - Scheduler

Overview


In Kubernetes, scheduling refers to making sure that pods are matched to nodes so that the kubelet can run them.

The scheduler controller in slurm-bridge is responsible for scheduling eligible pods onto nodes that are managed by slurm-bridge. In doing so, the slurm-bridge scheduler interacts with the Slurm REST API in order to acquire allocations for its’ workloads. In slurm-bridge, slurmctld serves as the source of truth for scheduling decisions.

Design


This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).

This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. This scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.

Sequence Diagram

sequenceDiagram
  autonumber

  actor user as User
  participant KAPI as Kubernetes API
  participant SBS as Slurm-Bridge Scheduler
  participant SAPI as Slurm REST API

  loop Workload Submission
    user->>KAPI: Submit Pod
    KAPI-->>user: Return Request Status
  end %% loop Workload Submission

  loop Scheduling Loop
    SBS->>KAPI: Get Next Pod in Workload Queue
    KAPI-->>SBS: Return Next Pod in Workload Queue

    note over SBS: Honor Slurm scheduling decision
    critical Lookup Slurm Placeholder Job
      SBS->>SAPI: Get Placeholder Job
      SAPI-->>SBS: Return Placeholder Job
    option Job is NotFound
      note over SBS: Translate Pod(s) into Slurm Job
      SBS->>SAPI: Submit Placeholder Job
      SAPI-->>SBS: Return Submit Status
    option Job is Pending
      note over SBS: Check again later...
      SBS->>SBS: Requeue
    option Job is Allocated
      note over SBS: Bind Pod(s) to Node(s) from the Slurm Job
      SBS->>KAPI: Bind Pod(s) to Node(s)
      KAPI-->>SBS: Return Bind Request Status
    end %% Lookup Slurm Placeholder Job
  end %% loop Scheduling Loop

3 - Tasks

Guides to tasks related to the administration of a cluster running slurm-bridge.

3.1 - Running slurm-bridge locally

You may want to run slurm-bridge on a single machine in order to test the software or familiarize yourself with it prior to installing it on your cluster. This should only be done for testing and evaluation purposes and should not be used for production environments.

We have provided a script to do this using Kind and the hack/kind.sh script.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites


  • go 1.17+ must be installed on your system

Setting up your environment


  1. Install Kind using go install:
go install sigs.k8s.io/kind@v0.29.0

If you get kind: command not found when running the next step, you may need to add GOPATH to your PATH:

export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
  1. Confirm that kind is working properly by running the following commands:
kind create cluster

kubectl get nodes --all-namespaces

kind delete cluster
  1. Clone the slurm-bridge repo and enter it:
git clone git@github.com:SlinkyProject/slurm-bridge.git
cd slurm-bridge

Installing slurm-bridge within your environment


Provided with slurm-bridge is the script hack/kind.sh that interfaces with kind to deploy the slurm-bridge helm chart within your local environment.

  1. Create your cluster using hack/kind.sh:
hack/kind.sh --bridge
  1. Familiarize yourself with and use your test environment:
kubectl get pods --namespace=slurm-bridge
kubectl get pods --namespace=slurm
kubectl get pods --namespace=slinky

Celebrate!

At this point, you should have a kind cluster running slurm-bridge.

Cleaning up


hack/kind.sh provides a mechanism by which to destroy your test environment.

Run:

hack/kind.sh --delete

To destroy your kind cluster.

3.2 - Creating a Workload

In Slurm, all workloads are represented by jobs. In slurm-bridge, however, there are a number of forms that workloads can take. While workloads can still be submitted as a Slurm job, slurm-bridge also enables users to submit workloads through Kubernetes. Most workloads that can be submitted to slurm-bridge from within Kubernetes are represented by an existing Kubernetes batch workload primitive.

At this time, slurm-bridge has scheduling support for Jobs, JobSets, Pods, and PodGroups. If your workload requires or benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider representing your workload as a JobSet or PodGroup.

Using the slurm-bridge Scheduler


slurm-bridge uses an admission controller to control which resources are scheduled using the slurm-bridge-scheduler. The slurm-bridge-scheduler is designed as a non-primary scheduler and is not intended to replace the default kube-scheduler. The slurm-bridge admission controller only schedules pods that request slurm-bridge as their scheduler or are in a configured namespace. By default, the slurm-bridge admission controller is configured to automatically use slurm-bridge as the scheduler for all pods in the configured namespaces.

Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler from any namespace to indicate that it should be scheduler using the slurm-bridge-scheduler.

You can learn more about the slurm-bridge admission controller here.

Annotations


Users can better inform or influence slurm-bridge how to represent their Kubernetes workload within Slurm by adding annotations on the parent Object.

Example “pause” bare pod to illustrate annotations:

apiVersion: v1
kind: Pod
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  schedulerName: slurm-bridge-scheduler
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.6
      resources:
        limits:
          cpu: "1"
          memory: 100Mi

Example “pause” deployment to illustrate annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      schedulerName: slurm-bridge-scheduler
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.6
          resources:
            limits:
              cpu: "1"
              memory: 100Mi

JobSets


This section assumes JobSets is installed.

JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.

PodGroups


This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.

Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.