Architecture

Table of Contents

Overview

This document describes the high-level architecture of the Slinky slurm-bridge.

A pod scheduled by slurm-bridge will coordinate with Slurm to schedule a placeholder job to represent the pod workload. The placeholder job uses the external job capability in Slurm 25.05. An external Slurm job will be operate like any other job in Slurm, with the exception that an external job will be launched by something other than Slurm. In the case of slurm-bridge, the placeholder job will determine where and when a pod run, but kubelet will launch the pod instead of slurmd.

Pod Flowchart

Pod Flowchart

The above diagram represents the process of scheduling a pod with Slurm through the following sequence:

  1. A pod is applied to a configured slurm-bridge namespace.

  2. The pod is sent to the slurm-bridge admission webhook.

  3. The slurm-bridge scheduler begins placement of pod.

  4. slurm-bridge coordinates with Slurm to create a “placeholder job”.

  5. The placeholder job is scheduled to node1.

  6. slurm-bridge determines the placeholder job has started on node1.

  7. The scheduler binds the pod to node1.

  8. kubelet starts the pod on node1.

During its lifecycle, the slurm-bridge controller will be reconciling events from Kubernetes and Slurm.

Directory Map

This project follows the conventions of:

cmd/

Contains code to be compiled into binary commands.

config/

Contains yaml configuration files used for kustomize deployments.

docs/

Contains project documentation.

hack/

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

helm/

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

internal/

Contains code that is used internally. This code is not externally importable.

internal/admission/

Contains the admission webhook.

The webhook sets the scheduler name for pods created in the configured namespace and enforces policy on labels and annotations used by the Slurm scheduler.

internal/controller/

Contains the node and pod controllers.

The pod controller syncs the state of pods running in Kubernetes with the associated placeholder job managed by Slurm, and vice versa. Similarly, the node controller syncs node states between Kubernetes and Slurm.

internal/scheduler/

Contains scheduling framework plugins. Currently, this consists of slurm-bridge.