Welcome to Slinky, a set of powerful integration tools designed to bring Slurm capabilities into Kubernetes. Whether you’re managing high-performance computing (HPC) workloads or operating within cloud-native environments, Slinky helps bring together the best of both worlds for efficient resource management and scheduling.

Slinky is made by SchedMD, the lead developers of Slurm.

1 - slurm-bridge

Download the slurm-bridge repository here, start using bridge with the quickstart guide, or read on to learn more.

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. Kubernetes excels at scheduling workloads that run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, and a known resource pool.

Why you need `slurm-bridge` and what it can do

This project enables users to take advantage of the best features of both workload managers. It contains a Kubernetes scheduler to manage select workloads from Kubernetes, which allows for co-location of Kubernetes and Slurm workloads within the same cluster. This means the same hardware can be used to run both traditional HPC and cloud-like workloads, reducing operating costs.

Using slurm-bridge, workloads can be submitted from within a Kubernetes context as a Pod, PodGroup, Job, or JobSet, or from a Slurm context using salloc or sbatch. Workloads submitted via Slurm will execute as they would in a Slurm-only environment, using slurmd. Workloads submitted from Kubernetes will have their resource requirements translated into a representative Slurm job by slurm-bridge. That job will serve as a placeholder and will be scheduled by the Slurm controller. Upon resource allocation to a K8s workload by the Slurm controller, slurm-bridge will bind the workload’s pod(s) to the allocated node(s). At that point, the kubelet will launch and run the pod the same as it would within a standard Kubernetes instance.

For additional architectural notes, see the architecture docs.

Features

slurm-bridge enables scheduling of Kubernetes workloads using the Slurm scheduler, and can take advantage of most of the scheduling features of Slurm itself. These include:

Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
Preemption: stop one or more low-priority jobs to let a high-priority job run.
QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
Fairshare: distribute resources equitably among users and accounts based on historical usage.
Reservations: reserve resources for select users or groups

Supported Versions

Kubernetes Version: >= v1.29
Slurm Version: >= 25.05

Current Limitations

Exclusive, whole node allocations are made for each pod.

Get started using `slurm-bridge` with the quickstart guide!

Versions:

1.1 - 0.3.x

1.1.1 - Quickstart

This quickstart guide will help you get slurm-bridge running and configured with your existing cluster.

If you’d like to try out slurm-bridge locally before deploying it on a cluster, consider following our guide for configuring a local test environment instead.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites

A functional Slurm cluster with:
- A set of hosts within the cluster that are running both a kubelet and slurmd
- At least one partition consisting solely of nodes with the above configuration
- MCS labels enabled:
```
# slurm.conf
...
MCSPlugin=mcs/label
MCSParameters=ondemand,ondemandselect
```
A functional Kubernetes cluster that includes the hosts running colocated kubelet and slurmd and:
- helm installed
- A “standard” StorageClass configured
- The DefaultStorageClass admission controller configured on your cluster’s API server
Matching NodeNames in Slurm and Kubernetes for all overlapping nodes
- In the event that the colocated node’s Slurm NodeName does not match the Kubernetes Node name, you should patch the Kubernetes node with a label to allow slurm-bridge to map the colocated Kubernetes and Slurm node.
```
kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
```
cgroups/v2 configured on all hosts with a colocated kubelet and slurmd

Installation

1. Install the required helm charts:

helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true

2. Download and configure `values.yaml` for the `slurm-bridge` helm chart

The helm chart used by slurm-bridge has a number of parameters in values.yaml that can be modified to tweak various parameters of slurm-bridge. Most of these values should work without modification.

Downloading values.yaml:

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
  -o values-bridge.yaml

Depending on your Slurm configuration, you may need to configure the following variables:

schedulerConfig.partition - this is the default partition with which slurm-bridge will associate jobs. This partition should only include nodes that have both [slurmd] and the [kubelet] running. The default value of this variable is slurm-bridge.
sharedConfig.slurmRestApi - the URL used by slurm-bridge to interact with the Slurm REST API. Changing this value may be necessary if you run the REST API on a different URL or port. The default value of this variable is http://slurm-restapi.slurm:6820

3. Download and install the `slurm-bridge` package from OCI:

helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
  --values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace

You can check if your cluster deployed successfully with:

kubectl --namespace=slinky get pods

Your output should be similar to:

NAME                                        READY   STATUS    RESTARTS      AGE
slurm-bridge-admission-85f89cf884-8c9jt     1/1     Running   0             1m0s
slurm-bridge-controllers-757f64b875-bsfnf   1/1     Running   0             1m0s
slurm-bridge-scheduler-5484467f55-wtspk     1/1     Running   0             1m0s

Running Your First Job

Now that slurm-bridge is configured, we can write a workload. slurm-bridge schedules Kubernetes workloads using the Slurm scheduler by translating a Kubernetes workload in the form of a Jobs, JobSets, Pods, and PodGroups into a representative Slurm job, which is used for scheduling purposes. Once a workload is allocated resources, the Kubelet binds the Kubernetes workload to the allocated resources and executes it. There are sample workload definitions in the slurm-bridge repo here.

Here’s an example of a simple job, found in hack/examples/single.yaml:

---
apiVersion: batch/v1
kind: Job
metadata:
  name: job-sleep-single
  namespace: slurm-bridge
  annotations:
    slinky.slurm.net/job-name: job-sleep-single
spec:
  completions: 1
  parallelism: 1
  template:
    spec:
      containers:
        - name: sleep
          image: busybox:stable
          command: [sh, -c, sleep 3]
          resources:
            requests:
              cpu: "1"
              memory: 100Mi
            limits:
              cpu: "1"
              memory: 100Mi
      restartPolicy: Never

Let’s run this job:

❯ kubectl apply -f hack/examples/job/single.yaml
job.batch/job-sleep-single created

At this point, Kubernetes has dispatched our job, it was scheduled by Slurm, and executed to completion. Let’s take a look at each place that our job shows up.

On the Slurm side, we can observe the placeholder job that was used to schedule our workload:

slurm@slurm-controller-0:/tmp$ scontrol show jobs
JobId=1 JobName=job-sleep-single
   UserId=slurm(401) GroupId=slurm(401) MCS_label=kubernetes
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:08 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2025-07-10T15:52:53 EligibleTime=2025-07-10T15:52:53
   AccrueTime=2025-07-10T15:52:53
   StartTime=2025-07-10T15:52:53 EndTime=2025-07-10T15:53:01 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-07-10T15:52:53 Scheduler=Main
   Partition=slurm-bridge AllocNode:Sid=10.244.5.5:1
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=slurm-bridge-1
   BatchHost=slurm-bridge-1
   StepMgrEnabled=Yes
   NumNodes=1 NumCPUs=4 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=96046M,node=1,billing=1
   AllocTRES=cpu=4,mem=96046M,node=1,billing=4
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) LicensesAlloc=(null) Network=(null)
   Command=(null)
   WorkDir=/tmp
   AdminComment={"pods":["slurm-bridge/job-sleep-single-8wtc2"]}
   OOMKillStep=0

Note that the Command field is equal to (null), and that the JobState field is equal to CANCELLED. These are so (null), and that the JobState field is equal to CANCELLED. This is because this Slurm job is only a placeholder - no work is actually done by the placeholder. Instead, the job is cancelled upon allocation so that the Kubelet can bind the workload to the selected node(s) for the duration of the job.

We can also look at this job using kubectl:

❯ kubectl describe job --namespace=slurm-bridge job-sleep-single
Name:             job-sleep-single
Namespace:        slurm-bridge
Selector:         batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
Labels:           batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  batch.kubernetes.io/job-name=job-sleep-single
                  controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
                  job-name=job-sleep-single
Annotations:      slinky.slurm.net/job-name: job-sleep-single
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Thu, 10 Jul 2025 09:52:53 -0600
Completed At:     Thu, 10 Jul 2025 09:53:02 -0600
Duration:         9s
Pods Statuses:    0 Active (0 Ready) / 1 Succeeded / 0 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           batch.kubernetes.io/job-name=job-sleep-single
           controller-uid=8a03f5f6-f0c0-4216-ac0b-8c9b70c92eec
           job-name=job-sleep-single
  Containers:
   sleep:
    Image:      busybox:stable
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      sleep 3
    Limits:
      cpu:     1
      memory:  100Mi
    Requests:
      cpu:        1
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  14m   job-controller  Created pod: job-sleep-single-8wtc2
  Normal  Completed         14m   job-controller  Job completed

As Kubernetes is the context in which this job actually executed, this is generally the more useful of the two outputs.

Celebrate!

At this point, you should have a cluster running slurm-bridge.

Recommended next steps involve reading through creating a workload, learning more about the architecture of slurm-bridge, or browsing our how-to-guides on administrative tasks.

1.1.2 - Concepts

Concepts related to slurm-bridge internals and design.

1.1.2.1 - Admission

Overview

The Kubernetes documentation defines admission controllers as:

a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the resource, but after the request is authenticated and authorized.

It also states that:

Admission control mechanisms may be validating, mutating, or both. Mutating controllers may modify the data for the resource being modified; validating controllers may not.

The slurm-bridge admission controller is a mutating controller. It modifies any pods within certain namespaces (slurm-bridge, by default) to use the slurm-bridge scheduler instead of the default Kube scheduler.

Design

Any pods created in certain namespaces will have their .spec.schedulerName changed to our scheduler.

Managed namespaces are defined as a list of namespace as configured in the admission controller’s values.yaml for managedNamespaces[].

Sequence Diagram

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SBA as Slurm-Bridge Admission

  KAPI-->>SBA: Watch Pod Create/Update
  opt Pod in managed Namespaces
    SBA->>KAPI: Update `.spec.schedulerName` and Tolerations
    KAPI-->>SBA: Update Response
  end %% opt Pod in managed Namespaces

1.1.2.2 - Architecture

Overview

This document describes the high-level architecture of the Slinky slurm-bridge.

Big Picture

Directory Map

This project follows the conventions of:

api/

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

cmd/

Contains code to be compiled into binary commands.

config/

Contains yaml configuration files used for kustomize deployments

docs/

Contains project documentation.

hack/

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

helm/

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

internal/

Contains code that is used internally. This code is not externally importable.

internal/controller/

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

internal/scheduler/

Contains scheduling framework plugins. Currently, this consists of slurm-bridge.

1.1.2.3 - Controllers

Overview

The Kubernetes documentation defines controllers as:

control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.

Within slurm-bridge, there are multiple controllers that manage the state of different bridge components:

Node Controller - Responsible for the state of nodes in the bridge cluster
Workload Controller - Responsible for the state of pods and other workloads running on slurm-bridge

Node Controller

The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.

Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.

A managed node is defined as a node that has a colocated kubelet and slurmd on the same physical host, and the slurm-bridge can schedule on.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop
    KAPI-->>SWC: Watch Kubernetes Nodes

    alt Node is managed
      SWC->>KAPI: Taint Node
      KAPI-->>SWC: Taint Node
    else
      SWC->>KAPI: Untaint Node
      KAPI-->>SWC: Untaint Node
    end %% alt Node is managed

    alt Node is schedulable
      SWC->>SAPI: Drain Node
      SAPI-->>SWC: Taint Node
    else
      SWC->>SAPI: Undrain Node
      SAPI-->>SWC: Undrain Node
    end %% alt Node is schedulable

  end %% loop Reconcile Loop

Workload Controller

The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop

  critical Map Slurm Job to Pod
    KAPI-->>SWC: Watch Kubernetes Pods
    SAPI-->>SWC: Watch Slurm Jobs
  option Pod is Terminated
    SWC->>SAPI: Terminate Slurm Job
    SAPI-->>SWC: Return Status
  option Job is Terminated
    SWC->>KAPI: Evict Pod
    KAPI-->>SWC: Return Status
  end %% critical Map Slurm Job to Pod

  end %% loop Reconcile Loop

1.1.2.4 - Scheduler

Overview

In Kubernetes, scheduling refers to making sure that pods are matched to nodes so that the kubelet can run them.

The scheduler controller in slurm-bridge is responsible for scheduling eligible pods onto nodes that are managed by slurm-bridge. In doing so, the slurm-bridge scheduler interacts with the Slurm REST API in order to acquire allocations for its’ workloads. In slurm-bridge, slurmctld serves as the source of truth for scheduling decisions.

Design

This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).

This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. This scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.

Sequence Diagram

sequenceDiagram
  autonumber

  actor user as User
  participant KAPI as Kubernetes API
  participant SBS as Slurm-Bridge Scheduler
  participant SAPI as Slurm REST API

  loop Workload Submission
    user->>KAPI: Submit Pod
    KAPI-->>user: Return Request Status
  end %% loop Workload Submission

  loop Scheduling Loop
    SBS->>KAPI: Get Next Pod in Workload Queue
    KAPI-->>SBS: Return Next Pod in Workload Queue

    note over SBS: Honor Slurm scheduling decision
    critical Lookup Slurm Placeholder Job
      SBS->>SAPI: Get Placeholder Job
      SAPI-->>SBS: Return Placeholder Job
    option Job is NotFound
      note over SBS: Translate Pod(s) into Slurm Job
      SBS->>SAPI: Submit Placeholder Job
      SAPI-->>SBS: Return Submit Status
    option Job is Pending
      note over SBS: Check again later...
      SBS->>SBS: Requeue
    option Job is Allocated
      note over SBS: Bind Pod(s) to Node(s) from the Slurm Job
      SBS->>KAPI: Bind Pod(s) to Node(s)
      KAPI-->>SBS: Return Bind Request Status
    end %% Lookup Slurm Placeholder Job
  end %% loop Scheduling Loop

1.1.3 - Tasks

Guides to tasks related to the administration of a cluster running slurm-bridge.

1.1.3.1 - Running slurm-bridge locally

You may want to run slurm-bridge on a single machine in order to test the software or familiarize yourself with it prior to installing it on your cluster. This should only be done for testing and evaluation purposes and should not be used for production environments.

We have provided a script to do this using Kind and the hack/kind.sh script.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites

go 1.17+ must be installed on your system

Setting up your environment

Install Kind using go install:

go install sigs.k8s.io/kind@v0.29.0

If you get kind: command not found when running the next step, you may need to add GOPATH to your PATH:

export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin

Confirm that kind is working properly by running the following commands:

kind create cluster

kubectl get nodes --all-namespaces

kind delete cluster

Clone the slurm-bridge repo and enter it:

git clone git@github.com:SlinkyProject/slurm-bridge.git
cd slurm-bridge

Installing `slurm-bridge` within your environment

Provided with slurm-bridge is the script hack/kind.sh that interfaces with kind to deploy the slurm-bridge helm chart within your local environment.

Create your cluster using hack/kind.sh:

hack/kind.sh --bridge

Familiarize yourself with and use your test environment:

kubectl get pods --namespace=slurm-bridge
kubectl get pods --namespace=slurm
kubectl get pods --namespace=slinky

Celebrate!

At this point, you should have a kind cluster running slurm-bridge.

Cleaning up

hack/kind.sh provides a mechanism by which to destroy your test environment.

Run:

hack/kind.sh --delete

To destroy your kind cluster.

1.1.3.2 - Creating a Workload

In Slurm, all workloads are represented by jobs. In slurm-bridge, however, there are a number of forms that workloads can take. While workloads can still be submitted as a Slurm job, slurm-bridge also enables users to submit workloads through Kubernetes. Most workloads that can be submitted to slurm-bridge from within Kubernetes are represented by an existing Kubernetes batch workload primitive.

At this time, slurm-bridge has scheduling support for Jobs, JobSets, Pods, and PodGroups. If your workload requires or benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider representing your workload as a JobSet or PodGroup.

Using the `slurm-bridge` Scheduler

slurm-bridge uses an admission controller to control which resources are scheduled using the slurm-bridge-scheduler. The slurm-bridge-scheduler is designed as a non-primary scheduler and is not intended to replace the default kube-scheduler. The slurm-bridge admission controller only schedules pods that request slurm-bridge as their scheduler or are in a configured namespace. By default, the slurm-bridge admission controller is configured to automatically use slurm-bridge as the scheduler for all pods in the configured namespaces.

Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler from any namespace to indicate that it should be scheduler using the slurm-bridge-scheduler.

You can learn more about the slurm-bridge admission controller here.

Annotations

Users can better inform or influence slurm-bridge how to represent their Kubernetes workload within Slurm by adding annotations on the parent Object.

Example “pause” bare pod to illustrate annotations:

apiVersion: v1
kind: Pod
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  schedulerName: slurm-bridge-scheduler
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.6
      resources:
        limits:
          cpu: "1"
          memory: 100Mi

Example “pause” deployment to illustrate annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      schedulerName: slurm-bridge-scheduler
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.6
          resources:
            limits:
              cpu: "1"
              memory: 100Mi

JobSets

This section assumes JobSets is installed.

JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.

PodGroups

This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.

Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.

2 - slurm-operator

Download the slurm-operator repository here, start using the operator with a quickstart guide, or read on to learn more.

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. Kubernetes excels at scheduling workloads that run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, and a known resource pool.

This project enables the best of both workload managers, unified on Kubernetes.

Why you need `slurm-operator` and what it can do

slurm-operator allows users to run workloads on Slurm within a Kubernetes cluster, taking advantage of many of the advanced scheduling features of Slurm within a cloud-native environment. This allows for more advanced scheduling than could be achieved within Kubernetes alone, and allows for advanced policies for job priority and allocations, pre-emption, and accounting.

It contains a Kubernetes operator to deploy and manage certain components of Slurm clusters. This repository implements custom-controllers and custom resource definitions (CRDs) designed for managing the lifecycle of Slurm clusters.

For additional architectural notes, see the architecture docs.

Features

management of Slurm clusters on Kubernetes. The operator takes into consideration the running workload among Slurm nodes as it needs to scale-in, upgrade, or otherwise handle node failures. Slurm nodes are marked as drain before their eventual termination pending scale-in or upgrade.

Slurm

Slurm is a full featured HPC workload manager. To highlight a few features:

Accounting: collect accounting information for every job and job step executed.
Partitions: job queues with sets of resources and constraints (e.g. job size limit, job time limit, users permitted).
Reservations: reserve resources for jobs being executed by select users and/or select accounts.
Job Dependencies: defer the start of jobs until the specified dependencies have been satisfied.
Job Containers: jobs which run an unprivileged OCI container bundle.
Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
Preemption: stop one or more low-priority jobs to let a high-priority job run.
QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
Fairshare: distribute resources equitably among users and accounts based on historical usage.
Node Health Check: periodically check node health via script.

Quickstart

See the quickstart guide for instructions on installing slurm-operator.

Supported Versions

Kubernetes Version >= v1.29
Slurm Version >= 24.11

Current Limitations

cgroup v2 is the only supported cgroup implementation for Slinky

Upgrades

In 0.X.0 releases, breaking changes may be introduced into newer CRDs. To upgrade between these versions, uninstall all Slinky charts and delete Slinky CRDs, then install the new release like normal.

helm --namespace=slurm uninstall slurm
helm --namespace=slinky uninstall slurm-operator
kubectl delete clusters.slinky.slurm.net
kubectl delete nodesets.slinky.slurm.net

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2.1 - 0.1.x

2.1.1 - Overview

slurm-operator

This project provides a framework that runs Slurm in Kubernetes.

Overview

This project deploys Slurm on Kubernetes. These pods coexist with other running workloads on Kubernetes. This project provides controls over the Slurm cluster configuration and deployment, along with configurable autoscaling policy for Slurm compute nodes.

This project allows for much of the functionality within Slurm for workload management. This includes:

Priority scheduling: Determine job execution order based on priorities and weights such as age
Fair share: Resources are distributed equitably among users based on historical usage.
Quality of Service (QoS): set of policies, such as limits of resources, priorities, and preemption and backfilling.
Job accounting: Information for every job and job step executed
Job dependencies: Allow users to specify relationships between jobs, from start, succeed, fail, or a particular state.
Workflows with partitioning: Divide cluster resource into sections for job management

To best enable Slurm in Kubernetes, the project uses Custom Resources (CRs) and an Operator to extend Kubernetes with custom behaviors for Slurm clusters. In addition, Helm is used for managing the deployment of the various components of this project to Kubernetes.

Supported Slurm Versions

Slurm 24.05 Data parsers v40, v41

Quickstart

See the Quickstart Guide to install.

Overall Architecture

This is a basic architecture. A more in depth description can be found in the docs directory.

Known Issues

CGroups is currently disabled, due to difficulties getting core information into the pods.
Updates may be slow, due to needing to wait for sequencing before the slurm-controller can be deployed.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2.1.2 - User

2.1.2.1 - Autoscaling

The slurm-operator may be configured to autoscale NodeSets pods based on Slurm metrics. This guide discusses how to configure autoscaling using KEDA.

Getting Started

Before attempting to autoscale NodeSets, Slinky should be fully deployed to a Kubernetes cluster and Slurm jobs should be able to run.

Dependencies

Autoscaling requires additional services that are not included in Slinky. Follow documentation to install Prometheus, Metrics Server, and KEDA.

Prometheus will install tools to report metrics and view them with Grafana. The Metrics Server is needed to report CPU and memory usage for tools like kubectl top. KEDA is recommended for autoscaling as it provides usability improvements over standard the Horizontal Pod Autoscaler (HPA).

To add KEDA in the helm install, run

helm repo add kedacore https://kedacore.github.io/charts

Install the slurm-exporter. This chart is installed as a dependency of the slurm helm chart by default. Configure using helm/slurm/values.yaml.

Verify KEDA Metrics API Server is running

$ kubectl get apiservice -l app.kubernetes.io/instance=keda
NAME                              SERVICE                                AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   keda/keda-operator-metrics-apiserver   True        22h

KEDA provides the metrics apiserver required by HPA to scale on custom metrics from Slurm. An alternative like Prometheus Adapter could be used for this, but KEDA offers usability enhancements and improvements to HPA in addition to including a metrics apiserver.

Autoscaling

Autoscaling NodeSets allows Slurm partitions to expand and contract in response to the CPU and memory usage. Using Slurm metrics, NodeSets may also scale based on Slurm specific information like the number of pending jobs or the size of the largest pending job in a partition. There are many ways to configure autoscaling. Experiment with different combinations based on the types of jobs being run and the resources available in the cluster.

NodeSet Scale Subresource

Scaling a resource in Kubernetes requires that resources such as Deployments and StatefulSets support the scale subresource. This is also true of the NodeSet Custom Resource.

The scale subresource gives a standard interface to observe and control the number of replicas of a resource. In the case of NodeSet, it allows Kubernetes and related services to control the number of slurmd replicas running as part of the NodeSet.

Note: NodeSets with replicas: null are intended to scale similar to a DaemonSet. This is not an appropriate type of NodeSet to use with Autoscaling as the Slinky operator will handle scaling NodeSet replicas across the cluster based on the selection criteria.

To manually scale a NodeSet, use the kubectl scale command. In this example, the NodeSet (nss) slurm-compute-radar is scaled to 1.

$ kubectl scale -n slurm nss/slurm-compute-radar --replicas=1
nodeset.slinky.slurm.net/slurm-compute-radar scaled

$ kubectl get pods -o wide -n slurm -l app.kubernetes.io/instance=slurm-compute-radar
NAME                        READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
slurm-compute-radar-p8jwh   1/1     Running   0          2m48s   10.244.4.17   kind-worker   <none>           <none>

This corresponds to the Slurm partition radar.

$ kubectl exec -n slurm statefulset/slurm-controller -- sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      1   idle kind-worker

NodeSets may be scaled to zero. In this case, there are no replicas of slurmd running and all jobs scheduled to that partition will remain in a pending state.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

For NodeSets to scale on demand, an autoscaler needs to be deployed. KEDA allows resources to scale from 0<->1 and also creates an HPA to scale based on scalers like Prometheus and more.

KEDA ScaledObject

KEDA uses the Custom Resource ScaledObject to monitor and scale a resource. It will automatically create the HPA needed to scale based on external triggers like Prometheus. With Slurm metrics, NodeSets may be scaled based on data collected from the Slurm restapi.

This example ScaledObject will watch the number of jobs pending for the partition radar and scale the NodeSet slurm-compute-radar until a threshold value is satisfied or maxReplicaCount is reached.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scale-radar
spec:
  scaleTargetRef:
    apiVersion: slinky.slurm.net/v1alpha1
    kind: NodeSet
    name: slurm-compute-radar
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 3
  triggers:
    - type: prometheus
      metricType: Value
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.prometheus:9090
        query: slurm_partition_pending_jobs{partition="radar"}
        threshold: "5"

Note: The Prometheus trigger is using metricType: Value instead of the default AverageValue. AverageValue calculates the replica count by averaging the threshold across the current replica count.

Check [ScaledObject] documentation for a full list of allowable options.

In this scenario, the ScaledObject scale-radar will query the Slurm metric slurm_partition_pending_jobs from Prometheus with the label partition="radar".

When there is activity on the trigger (at least one pending job), KEDA will scale the NodeSet to minReplicaCount and then let HPA handle scaling up to maxReplicaCount or back down to minReplicaCount. When there is no activity on the trigger after a configurable amount of time, KEDA will scale the NodeSet to idleReplicaCount. See the KEDA documentation on idleReplicaCount for more examples.

Note: The only supported value for idleReplicaCount is 0 due to limitations on how the HPA controller works.

To verify a KEDA ScaledObject, apply it to the cluster in the appropriate namespace on a NodeSet that has no replicas.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

Wait for Slurm to report that the partition has no nodes.

$ slurm@slurm-controller-0:/tmp$ sinfo -p radar
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      0    n/a

Apply the ScaledObject using kubectl to the correct namespace and verify the KEDA and HPA resources are created.

$ kubectl apply -f scaledobject.yaml -n slurm
scaledobject.keda.sh/scale-radar created

$ kubectl get -n slurm scaledobjects
NAME           SCALETARGETKIND                     SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
scale-radar    slinky.slurm.net/v1alpha1.NodeSet   slurm-compute-radar    1     5     prometheus                    True    False    Unknown    Unknown   28s

$ kubectl get -n slurm hpa
NAME                    REFERENCE                      TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-scale-radar    NodeSet/slurm-compute-radar    <unknown>/5   1         5         0          32s

Once the [ScaledObject] and HPA are created, initiate some jobs to test that the NodeSet scale subresource is scaled in response.

$ sbatch --wrap "sleep 30" --partition radar --exclusive

The NodeSet will scale to minReplicaCount in response to activity on the trigger. Once the number of pending jobs crosses the configured threshold (submit more exclusive jobs to the partition), more replicas will be created to handle the additional demand. Until the threshold is exceeded, the NodeSet will remain at minReplicaCount.

Note: This example only works well for single node jobs, unless threshold is set to 1. In this case, HPA will continue to scale up NodeSet as long as there is a pending job until up until it reaches the maxReplicaCount.

After the default coolDownPeriod of 5 minutes without activity on the trigger, KEDA will scale the NodeSet down to 0.

2.1.2.2 - Slurm

Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Architecture

Slurm Architecture

See the Slurm architecture docs for more information.

2.1.3 - Dev

2.1.3.1 - Architecture

Overview

This document describes the high-level architecture of the Slinky slurm-operator.

Big Picture

The slurm-operator follows the Kubernetes operator pattern.

Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.

The slurm-operator has one controller for each Custom Resource Definition (CRD) that it is responsible to manage. Each controller has a control loop where the state of the Custom Resource (CR) is reconciled.

Often, an operator is only concerned about data reported by the Kubernetes API. In our case, we are also concerned about data reported by the Slurm API, which influences how the slurm-operator reconciles certain CRs.

Directory Map

This project follows the conventions of:

`api/`

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

`cmd/`

Contains code to be compiled into binary commands.

`config/`

Contains yaml configuration files used for kustomize deployments.

`docs/`

Contains project documentation.

`hack/`

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

`helm/`

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

`internal/`

Contains code that is used internally. This code is not externally importable.

`internal/controller/`

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

2.1.3.2 - Cluster Control

Overview

This controller is responsible for managing and reconciling the Cluster CRD. A CRD represents communication to a Slurm cluster via slurmrestd and auth/jwt.

This controller uses the Slurm client library.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant CC as Cluster Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals

    note over KAPI: Handle CR Creation
    User->>KAPI: Create Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    CC->>+KAPI: Get referenced secret
    KAPI-->>-CC: Return secret
    create participant SC as Slurm Client
    CC->>+SC: Create Slurm Client for Cluster
    SC-->>-CC: Return Slurm Client Status
    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes
    CC->>SCM: Add Slurm Client to Map
    CC->>+SC: Ping Slurm Control Plane
    SC->>+SAPI: Ping Slurm Control Plane
    SAPI-->>-SC: Return Ping
    SC-->>-CC: Return Ping
    CC->>KAPI: Update Cluster CR Status

    note over KAPI: Handle CR Deletion
    User->>KAPI: Delete Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    SCM-->>CC: Lookup Slurm Client
    destroy SC
    CC-)SC: Shutdown Slurm Client
    CC->>SCM: Remove Slurm Client from Map

    participant SAPI as Slurm REST API

2.1.3.3 - Develop

This document aims to provide enough information that you can get started with development on this project.

Getting Started

You will need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against your choice of remote cluster.

Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

Dependencies

Install KIND and Golang binaries for pre-commit hooks.

sudo apt-get install golang
make install

Pre-Commit

Install pre-commit and install the git hooks.

sudo apt-get install pre-commit
pre-commit install

Docker

Install Docker and configure rootless Docker.

After, test that your user account and communicate with docker.

docker run hello-world

Helm

Install Helm.

sudo snap install helm --classic

Skaffold

Install Skaffold.

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && \
sudo install skaffold /usr/local/bin/

If google-cloud-sdk is installed, skaffold is available as an additional component.

sudo apt-get install -y google-cloud-cli-skaffold

Kubernetes Client

Install kubectl.

sudo snap install kubectl --classic

If google-cloud-sdk is installed, kubectl is available as an additional component.

sudo apt-get install -y kubectl

Running on the Cluster

For development, all Helm deployments use a values-dev.yaml. If they do not exist in your environment yet or you are unsure, safely copy the values.yaml as a base by running:

make values-dev

Automatic

You can use Skaffold to build and push images, and deploy components using:

cd helm/slurm-operator/
skaffold run

NOTE: The skaffold.yaml is configured to inject the image and tag into the values-dev.yaml so they are correctly referenced.

Operator

The slurm operator aims to follow the Kubernetes Operator pattern.

It uses Controllers, which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

Install CRDs

When deploying a helm chart with skaffold or helm, the CRDs defined in its crds/ directory will be installed if not already present in the cluster.

Uninstall CRDs

To delete the Operator CRDs from the cluster:

make uninstall

WARNING: CRDs do not upgrade! The old ones must be uninstalled first so the new ones can be installed. This should only be done in development.

Modifying the API Definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

Slurm Version Changed

If the Slurm version has changed, generate the new OpenAPI spec and its golang client code using:

make generate

NOTE: Update code interacting with the API in accordance with the slurmrestd plugin lifecycle.

Running the operator locally

Install the operator’s CRDs with make install.

Launch the operator via the VSCode debugger using the “Launch Operator” launch task.

Because the operator will be running outside of Kubernetes and needs to communicate to the Slurm cluster, set the following options in you Slurm helm chart’s values.yaml:

debug.enable=true
debug.localOperator=true

If running on a Kind cluster, also set:

debug.disableCgroups=true

If the Slurm helm chart is being deployed with skaffold, run skaffold run --port-forward --tail. It is configured to automatically port-forward the restapi for the local operator to communicate with the Slurm cluster.

If skaffold is not used, manually run kubectl port-forward --namespace slurm services/slurm-restapi 6820:6820 for the local operator to communicate with the Slurm cluster.

After starting the operator, verify it is able to contact the Slurm cluster by checking that the Cluster CR has been marked ready:

$ kubectl get --namespace slurm clusters.slinky.slurm.net
NAME     READY   AGE
slurm    true    110s

See skaffold port-forwarding to learn how skaffold automatically detects which services to forward.

2.1.3.4 - NodeSet Controller

Overview

The nodeset controller is responsible for managing and reconciling the NodeSet CRD, which represents a set of homogeneous Slurm Nodes.

Design

This controller is responsible for managing and reconciling the NodeSet CRD. In addition to the regular responsibility of managing resources in Kubernetes via the Kubernetes API, this controller should take into consideration the state of Slurm to make certain reconciliation decisions.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant NS as NodeSet Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals
    participant SC as Slurm Client
    participant SAPI as Slurm REST API

    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes

    note over KAPI: Handle CR Update
    SEC-->>NS: Watch Event Channel
    User->>KAPI: Update NodeSet CR
    KAPI-->>NS: Watch NodeSet CRD
    opt Scale-out Replicas
        NS->>KAPI: Create Pods
    end %% Scale-out Replicas
    opt Scale-in Replicas
        SCM-->>NS: Lookup Slurm Client
        NS->>+SC: Drain Slurm Node
        SC->>+SAPI: Drain Slurm Node
        SAPI-->>-SC: Return Drain Slurm Node Status
        SC-->>-NS: Drain Slurm Node
        alt Slurm Node is Drained
            NS->>KAPI: Delete Pod
        else
            NS->>NS: Check Again Later
        end %% alt Slurm Node is Drained
    end %% opt Scale-in Replicas

2.1.4 - Quickstart Guides

2.1.4.1 - Basic Quickstart

QuickStart Guide

Overview

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Kubernetes.

Install

Pre-Requiisites

Install the pre-requisite helm charts.

 helm repo add prometheus-community
https://prometheus-community.github.io/helm-charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml helm install slurm-operator
oci://ghcr.io/slinkyproject/charts/slurm-operator \
 --values=values-operator.yaml --version=0.1.0 --namespace=slinky
--create-namespace

Make sure the cluster deployed successfully with:

 kubectl --namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml
\
 -o values-slurm.yaml helm install slurm
oci://ghcr.io/slinkyproject/charts/slurm \
 --values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace

Make sure the slurm cluster deployed successfully with:

 kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS AGE slurm-accounting-0 1/1
Running 0 5m00s slurm-compute-debug-0 1/1 Running 0 5m00s slurm-controller-0 2/2
Running 0 5m00s slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s slurm-restapi-5f75db85d9-67gpl 1/1 Running 0
5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec -it
statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.1.4.2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

 gcloud container clusters create
slinky-cluster \
 --location=us-central1-a \
 --num-nodes=2 \
 --node-taints "" \
 --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

 gcloud
container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml helm install slurm-operator
oci://ghcr.io/slinkyproject/charts/slurm-operator \
 --version 0.1.0 \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml
\
 -o values-slurm.yaml helm install slurm
oci://ghcr.io/slinkyproject/charts/slurm \
 --version 0.1.0 \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.1.4.3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS

 az group create --name
slinky --location westus2

Setup a cluster on AKS

 az aks create \
 --resource-group slinky \
 --name slinky \
 --location westus2 \
 --node-vm-size Standard_D2s_v3

Setup kubectl to point to your new cluster.

 az aks
get-credentials --resource-group slinky --name slinky

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
 --version 0.1.0 \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml
\
 -o values-slurm.yaml

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

 helm install slurm
oci://ghcr.io/slinkyproject/charts/slurm \
 --version 0.1.0 \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.2 - 0.2.x

2.2.1 - Overview

slurm-operator

This project provides a framework that runs Slurm in Kubernetes.

Overview

This project deploys Slurm on Kubernetes. These pods coexist with other running workloads on Kubernetes. This project provides controls over the Slurm cluster configuration and deployment, along with configurable autoscaling policy for Slurm compute nodes. This project allows for much of the functionality within Slurm for workload management. This includes:

Priority scheduling: Determine job execution order based on priorities and weights such as age
Fair share: Resources are distributed equitably among users based on historical usage.
Quality of Service (QoS): set of policies, such as limits of resources, priorities, and preemption and backfilling.
Job accounting: Information for every job and job step executed
Job dependencies: Allow users to specify relationships between jobs, from start, succeed, fail, or a particular state.
Workflows with partitioning: Divide cluster resource into sections for job management

To best enable Slurm in Kubernetes, the project uses Custom Resources (CRs) and an Operator to extend Kubernetes with custom behaviors for Slurm clusters. In addition, Helm is used for managing the deployment of the various components of this project to Kubernetes.

Supported Slurm Versions

Data Parser: v41

24.05
24.11

Quickstart

See the Quickstart Guide to install.

Overall Architecture

This is a basic architecture. A more in depth description can be found in the docs directory.

Known Issues

slurmd fails to dynamically register with its pod’s resource limits.
- slurmd currently registers and assets the host machine’s resources instead of its pod resource limits.
slurmd fails to start with cgroups enforcement enabled within a pod.
- cgroup.conf is forced to CgroupPlugin=disabled to avoid init failure.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2.2.2 - Architecture

Overview

This document describes the high-level architecture of the Slinky slurm-operator.

Big Picture

The slurm-operator follows the Kubernetes operator pattern.

Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.

The slurm-operator has one controller for each Custom Resource Definition (CRD) that it is responsible to manage. Each controller has a control loop where the state of the Custom Resource (CR) is reconciled.

Often, an operator is only concerned about data reported by the Kubernetes API. In our case, we are also concerned about data reported by the Slurm API, which influences how the slurm-operator reconciles certain CRs.

Directory Map

This project follows the conventions of:

`api/`

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

`cmd/`

Contains code to be compiled into binary commands.

`config/`

Contains yaml configuration files used for kustomize deployments.

`docs/`

Contains project documentation.

`hack/`

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

`helm/`

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

`internal/`

Contains code that is used internally. This code is not externally importable.

`internal/controller/`

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

2.2.3 - Autoscaling

The slurm-operator may be configured to autoscale NodeSets pods based on Slurm metrics. This guide discusses how to configure autoscaling using KEDA.

Getting Started

Before attempting to autoscale NodeSets, Slinky should be fully deployed to a Kubernetes cluster and Slurm jobs should be able to run.

Dependencies

Autoscaling requires additional services that are not included in Slinky. Follow documentation to install Prometheus, Metrics Server, and KEDA.

Prometheus will install tools to report metrics and view them with Grafana. The Metrics Server is needed to report CPU and memory usage for tools like kubectl top. KEDA is recommended for autoscaling as it provides usability improvements over standard the Horizontal Pod Autoscaler (HPA).

To add KEDA in the helm install, run

helm repo add kedacore https://kedacore.github.io/charts

Install the slurm-exporter. This chart is installed as a dependency of the slurm helm chart by default. Configure using helm/slurm/values.yaml.

Verify KEDA Metrics API Server is running

$ kubectl get apiservice -l app.kubernetes.io/instance=keda
NAME                              SERVICE                                AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   keda/keda-operator-metrics-apiserver   True        22h

KEDA provides the metrics apiserver required by HPA to scale on custom metrics from Slurm. An alternative like Prometheus Adapter could be used for this, but KEDA offers usability enhancements and improvements to HPA in addition to including a metrics apiserver.

Autoscaling

Autoscaling NodeSets allows Slurm partitions to expand and contract in response to the CPU and memory usage. Using Slurm metrics, NodeSets may also scale based on Slurm specific information like the number of pending jobs or the size of the largest pending job in a partition. There are many ways to configure autoscaling. Experiment with different combinations based on the types of jobs being run and the resources available in the cluster.

NodeSet Scale Subresource

Scaling a resource in Kubernetes requires that resources such as Deployments and StatefulSets support the scale subresource. This is also true of the NodeSet Custom Resource.

The scale subresource gives a standard interface to observe and control the number of replicas of a resource. In the case of NodeSet, it allows Kubernetes and related services to control the number of slurmd replicas running as part of the NodeSet.

To manually scale a NodeSet, use the kubectl scale command. In this example, the NodeSet (nss) slurm-compute-radar is scaled to 1.

$ kubectl scale -n slurm nss/slurm-compute-radar --replicas=1
nodeset.slinky.slurm.net/slurm-compute-radar scaled

$ kubectl get pods -o wide -n slurm -l app.kubernetes.io/instance=slurm-compute-radar
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
slurm-compute-radar-0   1/1     Running   0          2m48s   10.244.4.17   kind-worker   <none>           <none>

This corresponds to the Slurm partition radar.

$ kubectl exec -n slurm statefulset/slurm-controller -- sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      1   idle kind-worker

NodeSets may be scaled to zero. In this case, there are no replicas of slurmd running and all jobs scheduled to that partition will remain in a pending state.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

For NodeSets to scale on demand, an autoscaler needs to be deployed. KEDA allows resources to scale from 0<->1 and also creates an HPA to scale based on scalers like Prometheus and more.

KEDA ScaledObject

KEDA uses the Custom Resource ScaledObject to monitor and scale a resource. It will automatically create the HPA needed to scale based on external triggers like Prometheus. With Slurm metrics, NodeSets may be scaled based on data collected from the Slurm restapi.

This example ScaledObject will watch the number of jobs pending for the partition radar and scale the NodeSet slurm-compute-radar until a threshold value is satisfied or maxReplicaCount is reached.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scale-radar
spec:
  scaleTargetRef:
    apiVersion: slinky.slurm.net/v1alpha1
    kind: NodeSet
    name: slurm-compute-radar
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 3
  triggers:
    - type: prometheus
      metricType: Value
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.prometheus:9090
        query: slurm_partition_pending_jobs{partition="radar"}
        threshold: "5"

Note: The Prometheus trigger is using metricType: Value instead of the default AverageValue. AverageValue calculates the replica count by averaging the threshold across the current replica count.

Check ScaledObject documentation for a full list of allowable options.

In this scenario, the ScaledObject scale-radar will query the Slurm metric slurm_partition_pending_jobs from Prometheus with the label partition="radar".

When there is activity on the trigger (at least one pending job), KEDA will scale the NodeSet to minReplicaCount and then let HPA handle scaling up to maxReplicaCount or back down to minReplicaCount. When there is no activity on the trigger after a configurable amount of time, KEDA will scale the NodeSet to idleReplicaCount. See the KEDA documentation on idleReplicaCount for more examples.

Note: The only supported value for idleReplicaCount is 0 due to limitations on how the HPA controller works.

To verify a KEDA ScaledObject, apply it to the cluster in the appropriate namespace on a NodeSet that has no replicas.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

Wait for Slurm to report that the partition has no nodes.

$ slurm@slurm-controller-0:/tmp$ sinfo -p radar
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      0    n/a

Apply the ScaledObject using kubectl to the correct namespace and verify the KEDA and HPA resources are created.

$ kubectl apply -f scaledobject.yaml -n slurm
scaledobject.keda.sh/scale-radar created

$ kubectl get -n slurm scaledobjects
NAME           SCALETARGETKIND                     SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
scale-radar    slinky.slurm.net/v1alpha1.NodeSet   slurm-compute-radar    1     5     prometheus                    True    False    Unknown    Unknown   28s

$ kubectl get -n slurm hpa
NAME                    REFERENCE                      TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-scale-radar    NodeSet/slurm-compute-radar    <unknown>/5   1         5         0          32s

Once the ScaledObject and HPA are created, initiate some jobs to test that the NodeSet scale subresource is scaled in response.

$ sbatch --wrap "sleep 30" --partition radar --exclusive

The NodeSet will scale to minReplicaCount in response to activity on the trigger. Once the number of pending jobs crosses the configured threshold (submit more exclusive jobs to the partition), more replicas will be created to handle the additional demand. Until the threshold is exceeded, the NodeSet will remain at minReplicaCount.

Note: This example only works well for single node jobs, unless threshold is set to 1. In this case, HPA will continue to scale up NodeSet as long as there is a pending job until up until it reaches the maxReplicaCount.

After the default coolDownPeriod of 5 minutes without activity on the trigger, KEDA will scale the NodeSet down to 0.

2.2.4 - Cluster Controller

Overview

This controller is responsible for managing and reconciling the Cluster CRD. A CRD represents communication to a Slurm cluster via slurmrestd and auth/jwt.

This controller uses the Slurm client library.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant CC as Cluster Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals

    note over KAPI: Handle CR Creation
    User->>KAPI: Create Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    CC->>+KAPI: Get referenced secret
    KAPI-->>-CC: Return secret
    create participant SC as Slurm Client
    CC->>+SC: Create Slurm Client for Cluster
    SC-->>-CC: Return Slurm Client Status
    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes
    CC->>SCM: Add Slurm Client to Map
    CC->>+SC: Ping Slurm Control Plane
    SC->>+SAPI: Ping Slurm Control Plane
    SAPI-->>-SC: Return Ping
    SC-->>-CC: Return Ping
    CC->>KAPI: Update Cluster CR Status

    note over KAPI: Handle CR Deletion
    User->>KAPI: Delete Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    SCM-->>CC: Lookup Slurm Client
    destroy SC
    CC-)SC: Shutdown Slurm Client
    CC->>SCM: Remove Slurm Client from Map

    participant SAPI as Slurm REST API

2.2.5 - Develop

This document aims to provide enough information that you can get started with development on this project.

Getting Started

You will need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against your choice of remote cluster.

Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

Dependencies

Install KIND and Golang binaries for pre-commit hooks.

sudo apt-get install golang
make install

Pre-Commit

Install pre-commit and install the git hooks.

sudo apt-get install pre-commit
pre-commit install

Docker

Install Docker and configure rootless Docker.

After, test that your user account and communicate with docker.

docker run hello-world

Helm

Install Helm.

sudo snap install helm --classic

Skaffold

Install Skaffold.

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && \
sudo install skaffold /usr/local/bin/

If google-cloud-sdk is installed, skaffold is available as an additional component.

sudo apt-get install -y google-cloud-cli-skaffold

Kubernetes Client

Install kubectl.

sudo snap install kubectl --classic

If google-cloud-sdk is installed, kubectl is available as an additional component.

sudo apt-get install -y kubectl

Running on the Cluster

For development, all Helm deployments use a values-dev.yaml. If they do not exist in your environment yet or you are unsure, safely copy the values.yaml as a base by running:

make values-dev

Automatic

You can use Skaffold to build and push images, and deploy components using:

cd helm/slurm-operator/
skaffold run

NOTE: The skaffold.yaml is configured to inject the image and tag into the values-dev.yaml so they are correctly referenced.

Operator

The slurm operator aims to follow the Kubernetes Operator pattern.

It uses Controllers, which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

Install CRDs

When deploying a helm chart with skaffold or helm, the CRDs defined in its crds/ directory will be installed if not already present in the cluster.

Uninstall CRDs

To delete the Operator CRDs from the cluster:

make uninstall

WARNING: CRDs do not upgrade! The old ones must be uninstalled first so the new ones can be installed. This should only be done in development.

Modifying the API Definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

Slurm Version Changed

If the Slurm version has changed, generate the new OpenAPI spec and its golang client code using:

make generate

NOTE: Update code interacting with the API in accordance with the slurmrestd plugin lifecycle.

Running the operator locally

Install the operator’s CRDs with make install.

Launch the operator via the VSCode debugger using the “Launch Operator” launch task.

Because the operator will be running outside of Kubernetes and needs to communicate to the Slurm cluster, set the following options in you Slurm helm chart’s values.yaml:

debug.enable=true
debug.localOperator=true

If running on a Kind cluster, also set:

debug.disableCgroups=true

If the Slurm helm chart is being deployed with skaffold, run skaffold run --port-forward --tail. It is configured to automatically port-forward the restapi for the local operator to communicate with the Slurm cluster.

If skaffold is not used, manually run kubectl port-forward --namespace slurm services/slurm-restapi 6820:6820 for the local operator to communicate with the Slurm cluster.

After starting the operator, verify it is able to contact the Slurm cluster by checking that the Cluster CR has been marked ready:

$ kubectl get --namespace slurm clusters.slinky.slurm.net
NAME     READY   AGE
slurm    true    110s

See skaffold port-forwarding to learn how skaffold automatically detects which services to forward.

2.2.6 - NodeSet Controller

Overview

The nodeset controller is responsible for managing and reconciling the NodeSet CRD, which represents a set of homogeneous Slurm Nodes.

Design

This controller is responsible for managing and reconciling the NodeSet CRD. In addition to the regular responsibility of managing resources in Kubernetes via the Kubernetes API, this controller should take into consideration the state of Slurm to make certain reconciliation decisions.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant NS as NodeSet Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals
    participant SC as Slurm Client
    participant SAPI as Slurm REST API

    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes

    note over KAPI: Handle CR Update
    SEC-->>NS: Watch Event Channel
    User->>KAPI: Update NodeSet CR
    KAPI-->>NS: Watch NodeSet CRD
    opt Scale-out Replicas
        NS->>KAPI: Create Pods
    end %% Scale-out Replicas
    opt Scale-in Replicas
        SCM-->>NS: Lookup Slurm Client
        NS->>+SC: Drain Slurm Node
        SC->>+SAPI: Drain Slurm Node
        SAPI-->>-SC: Return Drain Slurm Node Status
        SC-->>-NS: Drain Slurm Node
        alt Slurm Node is Drained
            NS->>KAPI: Delete Pod
        else
            NS->>NS: Check Again Later
        end %% alt Slurm Node is Drained
    end %% opt Scale-in Replicas

2.2.7 - Slurm

Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Architecture

Slurm Architecture

See the Slurm architecture docs for more information.

2.2.8 - Quickstart Guides

2.2.8.1 - QuickStart Guide for Amazon EKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.

Setup

Setup a cluster on EKS.

 eksctl create cluster \
 --name slinky-cluster \
 --region us-west-2 \
 --nodegroup-name slinky-nodes \
 --node-type t3.medium \
 --nodes 2

Setup kubectl to point to your new cluster.

 aws eks
--region us-west-2 update-kubeconfig --name slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo add aws-ebs-csi-driver
https://kubernetes-sigs.github.io/aws-ebs-csi-driver helm repo update helm
install cert-manager jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Install EBS CSI Driver

 helm install aws-ebs-csi
aws-ebs-csi-driver/aws-ebs-csi-driver -n kube-system

AWS Permissions

You will need to make sure your IAM user has the proper permissions.

Step 1: Identify the IAM Role

Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:

 aws eks describe-nodegroup \
 --cluster-name slinky-cluster \
 --nodegroup-name slinky-nodes \
 --query "nodegroup.nodeRole" \
 --output text

This will return something like:

arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK

The IAM role name here is eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK.

Step 2: Attach the Required IAM Policy for EBS CSI Driver

Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.

Run the following command:

 aws iam attach-role-policy \
 --role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
\
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

Create StorageClass

You will need to create a StorageClass to use.

Here is an example storageclass.yaml file for a StorageClass

 apiVersion: storage.k8s.io/v1 kind: StorageClass
metadata: name: standard provisioner: ebs.csi.aws.com reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer parameters: type: gp2 fsType: ext4

Create the StorageClass using your storageclass.yaml file.

 kubectl apply -f storageclass.yaml

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.2.8.2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

 gcloud container clusters create
slinky-cluster \
 --location=us-central1-a \
 --num-nodes=2 \
 --node-taints "" \
 --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

 gcloud
container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.2.8.3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS

 az group create --name
slinky --location westus2

Setup a cluster on AKS

 az aks create \
 --resource-group slinky \
 --name slinky \
 --location westus2 \
 --node-vm-size Standard_D2s_v3

Setup kubectl to point to your new cluster.

 az aks
get-credentials --resource-group slinky --name slinky

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

Make sure you are authenticated and the proper role is assigned to pull your images.

 az acr login -n slinky

az aks show \
 --resource-group slinky \
 --name slinky \
 --query identityProfile.kubeletidentity.clientId \
 -o tsv

az role assignment create --assignee <clientId from above> \
 --role AcrPull \
 --scope $(az acr show --name slinky --query id -o tsv)

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.3 - 0.3.x

2.3.1 - Concepts

Concepts related to slurm-operator internals and design.

2.3.1.1 - Architecture

Overview

This document describes the high-level architecture of the Slinky slurm-operator.

Operator

The following diagram illustrates the operator, from a communication perspective.

The slurm-operator follows the Kubernetes operator pattern.

Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop.

The slurm-operator has one controller for each Custom Resource Definition (CRD) that it is responsible to manage. Each controller has a control loop where the state of the Custom Resource (CR) is reconciled.

Often, an operator is only concerned about data reported by the Kubernetes API. In our case, we are also concerned about data reported by the Slurm API, which influences how the slurm-operator reconciles certain CRs.

Slurm

The following diagram illustrates a containerized Slurm cluster, from a communication perspective.

For additional information about Slurm, see the slurm docs.

Hybrid

The following hybrid diagram is an example. There are many different configurations for a hybrid setup. The core takeaways are: slurmd can be on bare-metal and still be joined to your containerized Slurm cluster; external services that your Slurm cluster needs or wants (e.g. AD/LDAP, NFS, MariaDB) do not have to live in Kubernetes to be functional with your Slurm cluster.

Autoscale

Kubernetes supports resource autoscaling. In the context of Slurm, autoscaling Slurm compute nodes can be quite useful when your Kubernetes and Slurm clusters have workload fluctuations.

See the autoscaling guide for additional information.

Directory Map

This project follows the conventions of:

`api/`

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

`cmd/`

Contains code to be compiled into binary commands.

`config/`

Contains yaml configuration files used for kustomize deployments.

`docs/`

Contains project documentation.

`hack/`

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

`helm/`

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

`internal/`

Contains code that is used internally. This code is not externally importable.

`internal/controller/`

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

2.3.1.2 - Cluster Controller

Overview

This controller is responsible for managing and reconciling the Cluster CRD. A CRD represents communication to a Slurm cluster via slurmrestd and auth/jwt.

This controller uses the Slurm client library.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant CC as Cluster Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals

    note over KAPI: Handle CR Creation
    User->>KAPI: Create Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    CC->>+KAPI: Get referenced secret
    KAPI-->>-CC: Return secret
    create participant SC as Slurm Client
    CC->>+SC: Create Slurm Client for Cluster
    SC-->>-CC: Return Slurm Client Status
    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes
    CC->>SCM: Add Slurm Client to Map
    CC->>+SC: Ping Slurm Control Plane
    SC->>+SAPI: Ping Slurm Control Plane
    SAPI-->>-SC: Return Ping
    SC-->>-CC: Return Ping
    CC->>KAPI: Update Cluster CR Status

    note over KAPI: Handle CR Deletion
    User->>KAPI: Delete Cluster CR
    KAPI-->>CC: Watch Cluster CRD
    SCM-->>CC: Lookup Slurm Client
    destroy SC
    CC-)SC: Shutdown Slurm Client
    CC->>SCM: Remove Slurm Client from Map

    participant SAPI as Slurm REST API

2.3.1.3 - NodeSet Controller

Overview

The nodeset controller is responsible for managing and reconciling the NodeSet CRD, which represents a set of homogeneous Slurm Nodes.

Design

This controller is responsible for managing and reconciling the NodeSet CRD. In addition to the regular responsibility of managing resources in Kubernetes via the Kubernetes API, this controller should take into consideration the state of Slurm to make certain reconciliation decisions.

Sequence Diagram

sequenceDiagram
    autonumber

    actor User as User
    participant KAPI as Kubernetes API
    participant NS as NodeSet Controller
    box Operator Internals
        participant SCM as Slurm Client Map
        participant SEC as Slurm Event Channel
    end %% Operator Internals
    participant SC as Slurm Client
    participant SAPI as Slurm REST API

    loop Watch Slurm Nodes
        SC->>+SAPI: Get Slurm Nodes
        SAPI-->>-SC: Return Slurm Nodes
        SC->>SEC: Add Event for Cache Delta
    end %% loop Watch Slurm Nodes

    note over KAPI: Handle CR Update
    SEC-->>NS: Watch Event Channel
    User->>KAPI: Update NodeSet CR
    KAPI-->>NS: Watch NodeSet CRD
    opt Scale-out Replicas
        NS->>KAPI: Create Pods
    end %% Scale-out Replicas
    opt Scale-in Replicas
        SCM-->>NS: Lookup Slurm Client
        NS->>+SC: Drain Slurm Node
        SC->>+SAPI: Drain Slurm Node
        SAPI-->>-SC: Return Drain Slurm Node Status
        SC-->>-NS: Drain Slurm Node
        alt Slurm Node is Drained
            NS->>KAPI: Delete Pod
        else
            NS->>NS: Check Again Later
        end %% alt Slurm Node is Drained
    end %% opt Scale-in Replicas

2.3.1.4 - Cluster CRD

2.3.1.5 - NodeSet CRD

2.3.1.6 - Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.

Architecture

Slurm Architecture

See the Slurm architecture docs for more information.

2.3.2 - Tasks

Guides to tasks related to the administration of a cluster running slurm-operator.

2.3.2.1 - Autoscaling

Getting Started

Before attempting to autoscale NodeSets, Slinky should be fully deployed to a Kubernetes cluster and Slurm jobs should be able to run.

Dependencies

Autoscaling requires additional services that are not included in Slinky. Follow documentation to install Prometheus, Metrics Server, and KEDA.

Prometheus will install tools to report metrics and view them with Grafana. The Metrics Server is needed to report CPU and memory usage for tools like kubectl top. KEDA is recommended for autoscaling as it provides usability improvements over standard the Horizontal Pod Autoscaler (HPA).

To add KEDA in the helm install, run

helm repo add kedacore https://kedacore.github.io/charts

Install the slurm-exporter. This chart is installed as a dependency of the slurm helm chart by default. Configure using helm/slurm/values.yaml.

Verify KEDA Metrics API Server is running

$ kubectl get apiservice -l app.kubernetes.io/instance=keda
NAME                              SERVICE                                AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   keda/keda-operator-metrics-apiserver   True        22h

KEDA provides the metrics apiserver required by HPA to scale on custom metrics from Slurm. An alternative like Prometheus Adapter could be used for this, but KEDA offers usability enhancements and improvements to HPA in addition to including a metrics apiserver.

Autoscaling

Autoscaling NodeSets allows Slurm partitions to expand and contract in response to the CPU and memory usage. Using Slurm metrics, NodeSets may also scale based on Slurm specific information like the number of pending jobs or the size of the largest pending job in a partition. There are many ways to configure autoscaling. Experiment with different combinations based on the types of jobs being run and the resources available in the cluster.

NodeSet Scale Subresource

Scaling a resource in Kubernetes requires that resources such as Deployments and StatefulSets support the scale subresource. This is also true of the NodeSet Custom Resource.

The scale subresource gives a standard interface to observe and control the number of replicas of a resource. In the case of NodeSet, it allows Kubernetes and related services to control the number of slurmd replicas running as part of the NodeSet.

To manually scale a NodeSet, use the kubectl scale command. In this example, the NodeSet (nss) slurm-compute-radar is scaled to 1.

$ kubectl scale -n slurm nss/slurm-compute-radar --replicas=1
nodeset.slinky.slurm.net/slurm-compute-radar scaled

$ kubectl get pods -o wide -n slurm -l app.kubernetes.io/instance=slurm-compute-radar
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
slurm-compute-radar-0   1/1     Running   0          2m48s   10.244.4.17   kind-worker   <none>           <none>

This corresponds to the Slurm partition radar.

$ kubectl exec -n slurm statefulset/slurm-controller -- sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      1   idle kind-worker

NodeSets may be scaled to zero. In this case, there are no replicas of slurmd running and all jobs scheduled to that partition will remain in a pending state.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

For NodeSets to scale on demand, an autoscaler needs to be deployed. KEDA allows resources to scale from 0<->1 and also creates an HPA to scale based on scalers like Prometheus and more.

KEDA ScaledObject

KEDA uses the Custom Resource ScaledObject to monitor and scale a resource. It will automatically create the HPA needed to scale based on external triggers like Prometheus. With Slurm metrics, NodeSets may be scaled based on data collected from the Slurm restapi.

This example ScaledObject will watch the number of jobs pending for the partition radar and scale the NodeSet slurm-compute-radar until a threshold value is satisfied or maxReplicaCount is reached.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scale-radar
spec:
  scaleTargetRef:
    apiVersion: slinky.slurm.net/v1alpha1
    kind: NodeSet
    name: slurm-compute-radar
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 3
  triggers:
    - type: prometheus
      metricType: Value
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.prometheus:9090
        query: slurm_partition_pending_jobs{partition="radar"}
        threshold: "5"

Note: The Prometheus trigger is using metricType: Value instead of the default AverageValue. AverageValue calculates the replica count by averaging the threshold across the current replica count.

Check ScaledObject documentation for a full list of allowable options.

In this scenario, the ScaledObject scale-radar will query the Slurm metric slurm_partition_pending_jobs from Prometheus with the label partition="radar".

When there is activity on the trigger (at least one pending job), KEDA will scale the NodeSet to minReplicaCount and then let HPA handle scaling up to maxReplicaCount or back down to minReplicaCount. When there is no activity on the trigger after a configurable amount of time, KEDA will scale the NodeSet to idleReplicaCount. See the KEDA documentation on idleReplicaCount for more examples.

Note: The only supported value for idleReplicaCount is 0 due to limitations on how the HPA controller works.

To verify a KEDA ScaledObject, apply it to the cluster in the appropriate namespace on a NodeSet that has no replicas.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

Wait for Slurm to report that the partition has no nodes.

$ slurm@slurm-controller-0:/tmp$ sinfo -p radar
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      0    n/a

Apply the ScaledObject using kubectl to the correct namespace and verify the KEDA and HPA resources are created.

$ kubectl apply -f scaledobject.yaml -n slurm
scaledobject.keda.sh/scale-radar created

$ kubectl get -n slurm scaledobjects
NAME           SCALETARGETKIND                     SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
scale-radar    slinky.slurm.net/v1alpha1.NodeSet   slurm-compute-radar    1     5     prometheus                    True    False    Unknown    Unknown   28s

$ kubectl get -n slurm hpa
NAME                    REFERENCE                      TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-scale-radar    NodeSet/slurm-compute-radar    <unknown>/5   1         5         0          32s

Once the ScaledObject and HPA are created, initiate some jobs to test that the NodeSet scale subresource is scaled in response.

$ sbatch --wrap "sleep 30" --partition radar --exclusive

The NodeSet will scale to minReplicaCount in response to activity on the trigger. Once the number of pending jobs crosses the configured threshold (submit more exclusive jobs to the partition), more replicas will be created to handle the additional demand. Until the threshold is exceeded, the NodeSet will remain at minReplicaCount.

Note: This example only works well for single node jobs, unless threshold is set to 1. In this case, HPA will continue to scale up NodeSet as long as there is a pending job until up until it reaches the maxReplicaCount.

After the default coolDownPeriod of 5 minutes without activity on the trigger, KEDA will scale the NodeSet down to 0.

2.3.2.2 - Development

This document aims to provide enough information that you can get started with development on this project.

Getting Started

You will need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against your choice of remote cluster.

Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

Dependencies

Install KIND and Golang binaries for pre-commit hooks.

sudo apt-get install golang
make install

Pre-Commit

Install pre-commit and install the git hooks.

sudo apt-get install pre-commit
pre-commit install

Docker

Install Docker and configure rootless Docker.

After, test that your user account and communicate with docker.

docker run hello-world

Helm

Install Helm.

sudo snap install helm --classic

Skaffold

Install Skaffold.

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && \
sudo install skaffold /usr/local/bin/

If google-cloud-sdk is installed, skaffold is available as an additional component.

sudo apt-get install -y google-cloud-cli-skaffold

Kubernetes Client

Install kubectl.

sudo snap install kubectl --classic

If google-cloud-sdk is installed, kubectl is available as an additional component.

sudo apt-get install -y kubectl

Running on the Cluster

For development, all Helm deployments use a values-dev.yaml. If they do not exist in your environment yet or you are unsure, safely copy the values.yaml as a base by running:

make values-dev

Automatic

You can use Skaffold to build and push images, and deploy components using:

cd helm/slurm-operator/
skaffold run

NOTE: The skaffold.yaml is configured to inject the image and tag into the values-dev.yaml so they are correctly referenced.

Operator

The slurm operator aims to follow the Kubernetes Operator pattern.

It uses Controllers, which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

Install CRDs

When deploying a helm chart with skaffold or helm, the CRDs defined in its crds/ directory will be installed if not already present in the cluster.

Uninstall CRDs

To delete the Operator CRDs from the cluster:

make uninstall

WARNING: CRDs do not upgrade! The old ones must be uninstalled first so the new ones can be installed. This should only be done in development.

Modifying the API Definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

Slurm Version Changed

If the Slurm version has changed, generate the new OpenAPI spec and its golang client code using:

make generate

NOTE: Update code interacting with the API in accordance with the slurmrestd plugin lifecycle.

Running the operator locally

Install the operator’s CRDs with make install.

Launch the operator via the VSCode debugger using the “Launch Operator” launch task.

Because the operator will be running outside of Kubernetes and needs to communicate to the Slurm cluster, set the following options in you Slurm helm chart’s values.yaml:

debug.enable=true
debug.localOperator=true

If running on a Kind cluster, also set:

debug.disableCgroups=true

If the Slurm helm chart is being deployed with skaffold, run skaffold run --port-forward --tail. It is configured to automatically port-forward the restapi for the local operator to communicate with the Slurm cluster.

If skaffold is not used, manually run kubectl port-forward --namespace slurm services/slurm-restapi 6820:6820 for the local operator to communicate with the Slurm cluster.

After starting the operator, verify it is able to contact the Slurm cluster by checking that the Cluster CR has been marked ready:

$ kubectl get --namespace slurm clusters.slinky.slurm.net
NAME     READY   AGE
slurm    true    110s

See skaffold port-forwarding to learn how skaffold automatically detects which services to forward.

Slurm Cluster

Get into a Slurm pod that can submit workload.

kubectl --namespace=slurm exec -it deployments/slurm-login -- bash -l
kubectl --namespace=slurm exec -it statefulsets/slurm-controller -- bash -l

cloud-provider-kind -enable-lb-port-mapping &
SLURM_LOGIN_PORT="$(kubectl --namespace=slurm get services -l app.kubernetes.io/name=login,app.kubernetes.io/instance=slurm -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ports[0].port}")"
SLURM_LOGIN_IP="$(kubectl --namespace=slurm get services -l app.kubernetes.io/name=login,app.kubernetes.io/instance=slurm -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
ssh -p "$SLURM_LOGIN_PORT" "${USER}@${SLURM_LOGIN_IP}"

2.3.2.3 - Using Pyxis

Overview

This guide tells how to configure your Slurm cluster to use pyxis (and enroot), a Slurm SPANK plugin for containerized jobs with Nvidia GPU support.

Configure

Configure plugstack.conf to include the pyxis configuration.

Warning: In plugstack.conf, you must use glob syntax to avoid slurmctld failure while trying to resolve the paths in the includes. Only the login and slurmd pods should actually have the pyxis libraries installed.

slurm:
  configFiles:
    plugstack.conf: |
      include /usr/share/pyxis/*
  ...

Configure one or more NodeSets and the login pods to use a pyxis OCI image.

login:
  image:
    repository: ghcr.io/slinkyproject/login-pyxis
  ...
compute:
  nodesets:
    - name: debug
      image:
        repository: ghcr.io/slinkyproject/slurmd-pyxis
      ...

To make enroot activity in the login container permissible, it requires securityContext.privileged=true.

login:
  image:
    repository: ghcr.io/slinkyproject/login-pyxis
  securityContext:
    privileged: true

Test

Submit a job to a Slurm node.

$ srun --partition=debug grep PRETTY /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"

Submit a job to a Slurm node with pyxis and it will launch in its requested container.

$ srun --partition=debug --container-image=alpine:latest grep PRETTY /etc/os-release
pyxis: importing docker image: alpine:latest
pyxis: imported docker image: alpine:latest
PRETTY_NAME="Alpine Linux v3.21"

Warning: SPANK plugins will only work on specific Slurm node that have them and is configured to use them. It is best to constrain where jobs run with --partition=<partition>, --batch=<features>, and/or --constraint=<features> to ensure a compatible computing environment.

If the login container has securityContext.privileged=true, enroot activity is permissible. You can test the functionality with the following:

enroot import docker://alpine:latest

2.3.3 - Quickstart Guides

2.3.3.1 - QuickStart Guide for Amazon EKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.

Setup

Setup a cluster on EKS:

eksctl create cluster \
 --name slinky-cluster \
 --region us-west-2 \
 --nodegroup-name slinky-nodes \
 --node-type t3.medium \
 --nodes 2

Setup kubectl to point to your new cluster:

aws eks
--region us-west-2 update-kubeconfig --name slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io

helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set installCRDs=true

Install EBS CSI Driver

helm install aws-ebs-csi aws-ebs-csi-driver/aws-ebs-csi-driver \
--namespace kube-system

AWS Permissions

You will need to make sure your IAM user has the proper permissions.

Step 1: Identify the IAM Role

Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:

aws eks describe-nodegroup \
 --cluster-name slinky-cluster \
 --nodegroup-name slinky-nodes \
 --query "nodegroup.nodeRole" \
 --output text

This will return something like:

arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK

The IAM role name here is eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK.

Step 2: Attach the Required IAM Policy for EBS CSI Driver

Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.

Run the following command:

 aws iam attach-role-policy \
 --role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK \
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

Create StorageClass

You will need to create a StorageClass to use.

Here is an example storageclass.yaml file for a StorageClass

apiVersion: storage.k8s.io/v1 kind: StorageClass
metadata: name: standard provisioner: ebs.csi.aws.com reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer parameters: type: gp2 fsType: ext4

Create the StorageClass using your storageclass.yaml file.

kubectl apply -f storageclass.yaml

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm-operator/values.yaml \
-o values-operator.yaml

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml \
--version=0.3.0 \
--namespace=slinky \
--create-namespace

Make sure the cluster deployed successfully with:

 kubectl --namespace=slinky get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm/values.yaml \
-o values-slurm.yaml

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml \
--version=0.3.0 \
--namespace=slurm \
--create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-0 1/1 Running 05m00s
slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.3.3.2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

gcloud container clusters create slinky-cluster \
 --location=us-central1-a \
 --num-nodes=2 \
 --node-taints "" \
 --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

gcloud container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io

helm repo update helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm-operator/values.yaml -o values-operator.yaml

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml \
--version=0.3.0 \
--namespace=slinky \
--create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm/values.yaml -o values-slurm.yaml

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml \
--version=0.3.0 \
--namespace=slurm \
--create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-0 1/1 Running 0
5m00s slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl -n slurm exec -it slurm-controller-0 -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

2.3.3.3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS:

az group create --name slinky --location westus2

Setup a cluster on AKS:

az aks create \
--resource-group slinky \
--name slinky \
--location westus2 \
--node-vm-size Standard_D2s_v3

Setup kubectl to point to your new cluster:

az aks get-credentials --resource-group slinky --name slinky

Pre-Requisites

Install the pre-requisite helm charts:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io

helm repo update helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm-operator/values.yaml -o values-operator.yaml

Make sure you are authenticated and the proper role is assigned to pull your images:

az acr login -n slinky

az aks show \
--resource-group slinky \
--name slinky \
--query identityProfile.kubeletidentity.clientId \
-o tsv

az role assignment create --assignee <clientId from above> \
--role AcrPull \
--scope $(az acr show --name slinky --query id -o tsv)

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml \
--version=0.3.0 \
--namespace=slinky \
--create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm/values.yaml -o values-slurm.yaml

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml \
--version=0.3.0 \
--namespace=slurm \
--create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-0 1/1 Running 0
5m00s slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.

Documentation

1 - slurm-bridge

Why you need slurm-bridge and what it can do

Features

Supported Versions

Current Limitations

Get started using slurm-bridge with the quickstart guide!

Versions:

1.1 - 0.3.x

1.1.1 - Quickstart

Pre-requisites

Installation

1. Install the required helm charts:

2. Download and configure values.yaml for the slurm-bridge helm chart

3. Download and install the slurm-bridge package from OCI:

Running Your First Job

Celebrate!

1.1.2 - Concepts

1.1.2.1 - Admission

Overview

Design

Sequence Diagram

1.1.2.2 - Architecture

Overview

Big Picture

Directory Map

1.1.2.3 - Controllers

Overview

Node Controller

Workload Controller

1.1.2.4 - Scheduler

Overview

Design

Sequence Diagram

1.1.3 - Tasks

1.1.3.1 - Running slurm-bridge locally

Pre-requisites

Setting up your environment

Installing slurm-bridge within your environment

Celebrate!

Cleaning up

1.1.3.2 - Creating a Workload

Using the slurm-bridge Scheduler

Annotations

JobSets

PodGroups

2 - slurm-operator

Why you need slurm-operator and what it can do

Features

Slurm

Quickstart

Supported Versions

Current Limitations

Upgrades

License

2.1 - 0.1.x

2.1.1 - Overview

slurm-operator

Overview

Supported Slurm Versions

Quickstart

Overall Architecture

Known Issues

License

2.1.2 - User

2.1.2.1 - Autoscaling

Getting Started

Dependencies

Verify KEDA Metrics API Server is running

Autoscaling

NodeSet Scale Subresource

KEDA ScaledObject

2.1.2.2 - Slurm

Slurm

Architecture

2.1.3 - Dev

2.1.3.1 - Architecture

Overview

Big Picture

Directory Map

Why you need `slurm-bridge` and what it can do

Get started using `slurm-bridge` with the quickstart guide!

2. Download and configure `values.yaml` for the `slurm-bridge` helm chart

3. Download and install the `slurm-bridge` package from OCI:

Installing `slurm-bridge` within your environment

Using the `slurm-bridge` Scheduler

Why you need `slurm-operator` and what it can do

`api/`

`cmd/`

`config/`

`docs/`

`hack/`

`helm/`

`internal/`

`internal/controller/`

`api/`

`cmd/`

`config/`

`docs/`

`hack/`