This is the multi-page printable view of this section. Click here to print.
0.3.x
- 1: Overview
- 2: Admission
- 3: Architecture
- 4: Controllers
- 5: Quickstart
- 6: Scheduler
1 - Overview
Slurm and Kubernetes are workload managers originally designed for different kinds of workloads. In broad strokes: Kubernetes excels at scheduling workloads that typically run for an indefinite amount of time, with potentially vague resource requirements, on a single node, with loose policy, but can scale its resource pool infinitely to meet demand; Slurm excels at quickly scheduling workloads that run for a finite amount of time, with well defined resource requirements and topology, on multiple nodes, with strict policy, but its resource pool is known.
This project enables the best of both workload managers. It contains a Kubernetes scheduler to manage select workload from Kubernetes.
For additional architectural notes, see the architecture docs.
Features
Slurm
Slurm is a full featured HPC workload manager. To highlight a few features:
- Priority: assigns priorities to jobs upon submission and on an ongoing basis (e.g. as they age).
- Preemption: stop one or more low-priority jobs to let a high-priority job run.
- QoS: sets of policies affecting scheduling priority, preemption, and resource limits.
- Fairshare: distribute resources equitably among users and accounts based on historical usage.
Limitations
Installation
Install the slurm-bridge scheduler:
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
--namespace=slinky --create-namespace
For additional instructions, see the quickstart guide.
License
Copyright (C) SchedMD LLC.
Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
2 - Admission
Overview
The admission controller is a webhook for Pods. Any pods created in certain namespaces will be modified so our scheduler will schedule them instead of the default scheduler.
Design
Any pods created in certain namespaces will have their .spec.schedulerName
changed to our scheduler.
Managed namespaces are defined as a list of namespace as configured in the
admission controller’s values.yaml
for managedNamespaces[]
.
Sequence Diagram
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SBA as Slurm-Bridge Admission KAPI-->>SBA: Watch Pod Create/Update opt Pod in managed Namespaces SBA->>KAPI: Update `.spec.schedulerName` and Tolerations KAPI-->>SBA: Update Response end %% opt Pod in managed Namespaces
3 - Architecture
Overview
This document describes the high-level architecture of the Slinky slurm-bridge
.
Big Picture
Directory Map
This project follows the conventions of:
api/
Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.
cmd/
Contains code to be compiled into binary commands.
config/
Contains yaml configuration files used for kustomize deployments
docs/
Contains project documentation.
hack/
Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.
helm/
Contains helm deployments, including the configuration files such as values.yaml.
Helm is the recommended method to install this project into your Kubernetes cluster.
internal/
Contains code that is used internally. This code is not externally importable.
internal/controller/
Contains the controllers.
Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.
internal/scheduler/
Contains scheduling framework plugins. Currently, this consists of slurm-bridge
.
4 - Controllers
Overview
This component is comprised of multiple controllers with specialized tasks.
Node Controller
The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.
Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.
A managed node is defined as a node that has a colocated kubelet
and slurmd
on the same physical host, and the slurm-bridge can schedule on.
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SWC as Slurm Workload Controller participant SAPI as Slurm REST API loop Reconcile Loop KAPI-->>SWC: Watch Kubernetes Nodes alt Node is managed SWC->>KAPI: Taint Node KAPI-->>SWC: Taint Node else SWC->>KAPI: Untaint Node KAPI-->>SWC: Untaint Node end %% alt Node is managed alt Node is schedulable SWC->>SAPI: Drain Node SAPI-->>SWC: Taint Node else SWC->>SAPI: Undrain Node SAPI-->>SWC: Undrain Node end %% alt Node is schedulable end %% loop Reconcile Loop
Workload Controller
The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SWC as Slurm Workload Controller participant SAPI as Slurm REST API loop Reconcile Loop critical Map Slurm Job to Pod KAPI-->>SWC: Watch Kubernetes Pods SAPI-->>SWC: Watch Slurm Jobs option Pod is Terminated SWC->>SAPI: Terminate Slurm Job SAPI-->>SWC: Return Status option Job is Terminated SWC->>KAPI: Evict Pod KAPI-->>SWC: Return Status end %% critical Map Slurm Job to Pod end %% loop Reconcile Loop
5 - Quickstart
Overview
This quickstart guide will help you get the slurm-bridge running and configured with your existing Slurm cluster.
Slurm Configuration
There are a set of assumptions that the slurm-bridge-scheduler must make. The Slurm admin must satisfy those assumption in their Slurm cluster.
- There exists a set of hosts that have colocated kubelet and slurmd (installed on the same host, typically running as a systemd service).
- Slurm is configured with a partition with that only contains host with
colocated kubelet and slurmd.
- The partition name must match the one configured in values.yaml used to deploy the slurm-bridge helm chart (default: “slurm-bridge”).
- Example
slurm.conf
snippet.# slurm.conf ... NodeSet=kubernetes Feature=kubernetes PartitionName=slurm-bridge Nodes=kubernetes
- In the event that the colocated node’s Slurm NodeName does not match the
Kubernetes Node name, you should patch the Kubernetes node with a label to
allow slurm-bridge to map the colocated Kubernetes and Slurm node.
kubectl patch node $KUBERNETES_NODENAME -p "{\"metadata\":{\"labels\":{\"slinky.slurm.net/slurm-nodename\":\"$SLURM_NODENAME\"}}}"
- Slurm has Multi-Category Security enabled for labels.
- Example
slurm.conf
snippet.
# slurm.conf ... MCSPlugin=mcs/label MCSParameters=ondemand,ondemandselect
- Example
Install slurm-bridge
Pre-Requisites
Install the pre-requisite helm charts.
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
Slurm Bridge
Download values and install the slurm-bridge
from OCI package:
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-bridge/refs/tags/v0.2.0/helm/slurm-bridge/values.yaml \
-o values-bridge.yaml
helm install slurm-bridge oci://ghcr.io/slinkyproject/charts/slurm-bridge \
--values=values-bridge.yaml --version=0.3.0 --namespace=slinky --create-namespace
You can check if your cluster deployed successfully with:
kubectl --namespace=slinky get pods
Your output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-bridge-admission-85f89cf884-8c9jt 1/1 Running 0 1m0s
slurm-bridge-controllers-757f64b875-bsfnf 1/1 Running 0 1m0s
slurm-bridge-scheduler-5484467f55-wtspk 1/1 Running 0 1m0s
Scheduling Workload
Generally speaking, slurm-bridge
translates one or more pods into a
representative Slurm workload, where Slurm does the underlying scheduling.
Certain optimizations can be made, depending on which resource is being
translated.
slurm-bridge
has specific scheduling support for JobSet and
PodGroup resources and their pods. If your workload requires or
benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider
representing your workload as a JobSet or PodGroup.
Admission Controller
slurm-bridge
will only schedule pods who requests slurm-bridge
as its
scheduler. The slurm-bridge
admission controller can be configured to
automatically make slurm-bridge
the scheduler for all pods created in the
configured namespaces.
Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler
from any namespace.
Annotations
Users can better inform or influence slurm-bridge
how to represent their
Kubernetes workload within Slurm by adding
annotations on the parent Object.
Example “pause” bare pod to illustrate annotations:
apiVersion: v1
kind: Pod
metadata:
name: pause
# slurm-bridge annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
Example “pause” deployment to illustrate annotations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause
# slurm-bridge annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
replicas: 2
selector:
matchLabels:
app: pause
template:
metadata:
labels:
app: pause
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
JobSets
This section assumes JobSets is installed.
JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.
PodGroups
This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.
Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.
6 - Scheduler
Overview
The scheduler controller is responsible for scheduling pending pods onto nodes.
This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).
This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. Tis scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.
Sequence Diagram
sequenceDiagram autonumber actor user as User participant KAPI as Kubernetes API participant SBS as Slurm-Bridge Scheduler participant SAPI as Slurm REST API loop Workload Submission user->>KAPI: Submit Pod KAPI-->>user: Return Request Status end %% loop Workload Submission loop Scheduling Loop SBS->>KAPI: Get Next Pod in Workload Queue KAPI-->>SBS: Return Next Pod in Workload Queue note over SBS: Honor Slurm scheduling decision critical Lookup Slurm Placeholder Job SBS->>SAPI: Get Placeholder Job SAPI-->>SBS: Return Placeholder Job option Job is NotFound note over SBS: Translate Pod(s) into Slurm Job SBS->>SAPI: Submit Placeholder Job SAPI-->>SBS: Return Submit Status option Job is Pending note over SBS: Check again later... SBS->>SBS: Requeue option Job is Allocated note over SBS: Bind Pod(s) to Node(s) from the Slurm Job SBS->>KAPI: Bind Pod(s) to Node(s) KAPI-->>SBS: Return Bind Request Status end %% Lookup Slurm Placeholder Job end %% loop Scheduling Loop