This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

0.3.x

1 - Overview

Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.

This project enables the best of both workload managers. It contains a Kubernetes scheduler to manage select workload from Kubernetes.

Image

For additional architectural notes, see the architecture docs.

Features

Slurm

Slurm is a full featured HPC workload manager. To highlight a few features:

  • Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
  • Preemption: stop one or more low-priority jobs to let a high-priority job run.
  • QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
  • Fairshare: distribute resources equitably among users and accounts based on historical usage.

Limitations

  • Kubernetes Version: >= v1.29
  • Slurm Version: >= 25.05

Installation

Install the slurm-bridge scheduler:

helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
  --namespace=slinky --create-namespace

For additional instructions, see the quickstart guide.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

2 - Admission

Overview

The admission controller is a webhook for Pods. Any pods created in certain namespaces will be modified so our scheduler will schedule them instead of the default scheduler.

Design

Any pods created in certain namespaces will have their .spec.schedulerName changed to our scheduler.

Managed namespaces are defined as a list of namespace as configured in the admission controller’s values.yaml for managedNamespaces[].

Sequence Diagram

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SBA as Slurm-Bridge Admission

  KAPI-->>SBA: Watch Pod Create/Update
  opt Pod in managed Namespaces
    SBA->>KAPI: Update `.spec.schedulerName` and Tolerations
    KAPI-->>SBA: Update Response
  end %% opt Pod in managed Namespaces

3 - Architecture

Overview

This document describes the high-level architecture of the Slinky slurm-bridge.

Big Picture

Image

Directory Map

This project follows the conventions of:

api/

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

cmd/

Contains code to be compiled into binary commands.

config/

Contains yaml configuration files used for kustomize deployments

docs/

Contains project documentation.

hack/

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

helm/

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

internal/

Contains code that is used internally. This code is not externally importable.

internal/controller/

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

internal/scheduler/

Contains scheduling framework plugins. Currently, this consists of slurm-bridge.

4 - Controllers

Overview

This component is comprised of multiple controllers with specialized tasks.

Node Controller

The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.

Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.

A managed node is defined as a node that has a colocated kubelet and slurmd on the same physical host, and the slurm-bridge can schedule on.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop
    KAPI-->>SWC: Watch Kubernetes Nodes

    alt Node is managed
      SWC->>KAPI: Taint Node
      KAPI-->>SWC: Taint Node
    else
      SWC->>KAPI: Untaint Node
      KAPI-->>SWC: Untaint Node
    end %% alt Node is managed

    alt Node is schedulable
      SWC->>SAPI: Drain Node
      SAPI-->>SWC: Taint Node
    else
      SWC->>SAPI: Undrain Node
      SAPI-->>SWC: Undrain Node
    end %% alt Node is schedulable

  end %% loop Reconcile Loop

Workload Controller

The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop

  critical Map Slurm Job to Pod
    KAPI-->>SWC: Watch Kubernetes Pods
    SAPI-->>SWC: Watch Slurm Jobs
  option Pod is Terminated
    SWC->>SAPI: Terminate Slurm Job
    SAPI-->>SWC: Return Status
  option Job is Terminated
    SWC->>KAPI: Evict Pod
    KAPI-->>SWC: Return Status
  end %% critical Map Slurm Job to Pod

  end %% loop Reconcile Loop

5 - Quickstart

Overview

This quickstart guide will help you get the slurm-bridge running and configured with your existing Slurm cluster.

Slurm Configuration

There are a set of assumptions that the slurm-bridge-scheduler must make. The Slurm admin must satisfy those assumption in their Slurm cluster.

  • There exists a set of hosts that have colocated kubelet and slurmd (installed on the same host, typically running as a systemd service).
  • Slurm is configured with a partition with that only contains host with colocated kubelet and slurmd.
    • The partition name must match the one configured in values.yaml used to deploy the slurm-bridge helm chart (default: “slurm-bridge”).
    • Example slurm.conf snippet.
      # slurm.conf
      ...
      NodeSet=kubernetes Feature=kubernetes
      PartitionName=slurm-bridge Nodes=kubernetes
      
  • In the event that the colocated node’s Slurm NodeName does not match the Kubernetes Node name, you should patch the Kubernetes node with a label to allow slurm-bridge to map the colocated Kubernetes and Slurm node.
    kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
    
  • Slurm has Multi-Category Security enabled for labels.
    • Example slurm.conf snippet.
    # slurm.conf
    ...
    MCSPlugin=mcs/label
    MCSParameters=ondemand,ondemandselect
    

Install slurm-bridge

Pre-Requisites

Install the pre-requisite helm charts.

helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true

Slurm Bridge

Download values and install the slurm-bridge from OCI package:

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
  -o values-bridge.yaml
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
  --values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace

You can check if your cluster deployed successfully with:

kubectl --namespace=slinky get pods

Your output should be similar to:

NAME                                        READY   STATUS    RESTARTS      AGE
slurm-bridge-admission-85f89cf884-8c9jt     1/1     Running   0             1m0s
slurm-bridge-controllers-757f64b875-bsfnf   1/1     Running   0             1m0s
slurm-bridge-scheduler-5484467f55-wtspk     1/1     Running   0             1m0s

Scheduling Workload

Generally speaking, slurm-bridge translates one or more pods into a representative Slurm workload, where Slurm does the underlying scheduling. Certain optimizations can be made, depending on which resource is being translated.

slurm-bridge has specific scheduling support for JobSet and PodGroup resources and their pods. If your workload requires or benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider representing your workload as a JobSet or PodGroup.

Admission Controller

slurm-bridge will only schedule pods who requests slurm-bridge as its scheduler. The slurm-bridge admission controller can be configured to automatically make slurm-bridge the scheduler for all pods created in the configured namespaces.

Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler from any namespace.

Annotations

Users can better inform or influence slurm-bridge how to represent their Kubernetes workload within Slurm by adding annotations on the parent Object.

Example “pause” bare pod to illustrate annotations:

apiVersion: v1
kind: Pod
metadata:
  name: pause
  # slurm-bridge annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  schedulerName: slurm-bridge-scheduler
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.6
      resources:
        limits:
          cpu: "1"
          memory: 100Mi

Example “pause” deployment to illustrate annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  # slurm-bridge annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      schedulerName: slurm-bridge-scheduler
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.6
          resources:
            limits:
              cpu: "1"
              memory: 100Mi

JobSets

This section assumes JobSets is installed.

JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.

PodGroups

This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.

Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.

6 - Scheduler

Overview

The scheduler controller is responsible for scheduling pending pods onto nodes.

This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).

This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. Tis scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.

Sequence Diagram

sequenceDiagram
  autonumber

  actor user as User
  participant KAPI as Kubernetes API
  participant SBS as Slurm-Bridge Scheduler
  participant SAPI as Slurm REST API

  loop Workload Submission
    user->>KAPI: Submit Pod
    KAPI-->>user: Return Request Status
  end %% loop Workload Submission

  loop Scheduling Loop
    SBS->>KAPI: Get Next Pod in Workload Queue
    KAPI-->>SBS: Return Next Pod in Workload Queue

    note over SBS: Honor Slurm scheduling decision
    critical Lookup Slurm Placeholder Job
      SBS->>SAPI: Get Placeholder Job
      SAPI-->>SBS: Return Placeholder Job
    option Job is NotFound
      note over SBS: Translate Pod(s) into Slurm Job
      SBS->>SAPI: Submit Placeholder Job
      SAPI-->>SBS: Return Submit Status
    option Job is Pending
      note over SBS: Check again later...
      SBS->>SBS: Requeue
    option Job is Allocated
      note over SBS: Bind Pod(s) to Node(s) from the Slurm Job
      SBS->>KAPI: Bind Pod(s) to Node(s)
      KAPI-->>SBS: Return Bind Request Status
    end %% Lookup Slurm Placeholder Job
  end %% loop Scheduling Loop