Concepts related to slurm-bridge
internals and design.
This is the multi-page printable view of this section. Click here to print.
Concepts
- 1: Admission
- 2: Architecture
- 3: Controllers
- 4: Scheduler
1 - Admission
Overview
The Kubernetes documentation defines admission controllers as:
a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the resource, but after the request is authenticated and authorized.
It also states that:
Admission control mechanisms may be validating, mutating, or both. Mutating controllers may modify the data for the resource being modified; validating controllers may not.
The slurm-bridge
admission controller is a mutating controller. It modifies
any pods within certain namespaces (slurm-bridge
, by default) to use the
slurm-bridge
scheduler instead of the default Kube scheduler.
Design
Any pods created in certain namespaces will have their .spec.schedulerName
changed to our scheduler.
Managed namespaces are defined as a list of namespace as configured in the
admission controller’s values.yaml
for managedNamespaces[]
.
Sequence Diagram
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SBA as Slurm-Bridge Admission KAPI-->>SBA: Watch Pod Create/Update opt Pod in managed Namespaces SBA->>KAPI: Update `.spec.schedulerName` and Tolerations KAPI-->>SBA: Update Response end %% opt Pod in managed Namespaces
2 - Architecture
Overview
This document describes the high-level architecture of the Slinky
slurm-bridge
.
Big Picture
Directory Map
This project follows the conventions of:
api/
Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.
cmd/
Contains code to be compiled into binary commands.
config/
Contains yaml configuration files used for kustomize deployments
docs/
Contains project documentation.
hack/
Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.
helm/
Contains helm deployments, including the configuration files such as values.yaml.
Helm is the recommended method to install this project into your Kubernetes cluster.
internal/
Contains code that is used internally. This code is not externally importable.
internal/controller/
Contains the controllers.
Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.
internal/scheduler/
Contains scheduling framework plugins. Currently, this consists of
slurm-bridge
.
3 - Controllers
Overview
The Kubernetes documentation defines controllers as:
control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.
Within slurm-bridge
, there are multiple controllers that manage the state of
different bridge components:
- Node Controller - Responsible for the state of nodes in the bridge cluster
- Workload Controller - Responsible for the state of pods and other
workloads running on
slurm-bridge
Node Controller
The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.
Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.
A managed node is defined as a node that has a colocated kubelet
and slurmd
on the same physical host, and the slurm-bridge can schedule on.
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SWC as Slurm Workload Controller participant SAPI as Slurm REST API loop Reconcile Loop KAPI-->>SWC: Watch Kubernetes Nodes alt Node is managed SWC->>KAPI: Taint Node KAPI-->>SWC: Taint Node else SWC->>KAPI: Untaint Node KAPI-->>SWC: Untaint Node end %% alt Node is managed alt Node is schedulable SWC->>SAPI: Drain Node SAPI-->>SWC: Taint Node else SWC->>SAPI: Undrain Node SAPI-->>SWC: Undrain Node end %% alt Node is schedulable end %% loop Reconcile Loop
Workload Controller
The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.
sequenceDiagram autonumber participant KAPI as Kubernetes API participant SWC as Slurm Workload Controller participant SAPI as Slurm REST API loop Reconcile Loop critical Map Slurm Job to Pod KAPI-->>SWC: Watch Kubernetes Pods SAPI-->>SWC: Watch Slurm Jobs option Pod is Terminated SWC->>SAPI: Terminate Slurm Job SAPI-->>SWC: Return Status option Job is Terminated SWC->>KAPI: Evict Pod KAPI-->>SWC: Return Status end %% critical Map Slurm Job to Pod end %% loop Reconcile Loop
4 - Scheduler
Overview
In Kubernetes, scheduling refers to making sure that pods are matched to nodes so that the kubelet can run them.
The scheduler controller in slurm-bridge
is responsible for scheduling
eligible pods onto nodes that are managed by slurm-bridge
. In doing so, the
slurm-bridge
scheduler interacts with the Slurm REST API in order to acquire
allocations for its’ workloads. In slurm-bridge
, slurmctld
serves as the
source of truth for scheduling decisions.
Design
This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).
This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. This scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.
Sequence Diagram
sequenceDiagram autonumber actor user as User participant KAPI as Kubernetes API participant SBS as Slurm-Bridge Scheduler participant SAPI as Slurm REST API loop Workload Submission user->>KAPI: Submit Pod KAPI-->>user: Return Request Status end %% loop Workload Submission loop Scheduling Loop SBS->>KAPI: Get Next Pod in Workload Queue KAPI-->>SBS: Return Next Pod in Workload Queue note over SBS: Honor Slurm scheduling decision critical Lookup Slurm Placeholder Job SBS->>SAPI: Get Placeholder Job SAPI-->>SBS: Return Placeholder Job option Job is NotFound note over SBS: Translate Pod(s) into Slurm Job SBS->>SAPI: Submit Placeholder Job SAPI-->>SBS: Return Submit Status option Job is Pending note over SBS: Check again later... SBS->>SBS: Requeue option Job is Allocated note over SBS: Bind Pod(s) to Node(s) from the Slurm Job SBS->>KAPI: Bind Pod(s) to Node(s) KAPI-->>SBS: Return Bind Request Status end %% Lookup Slurm Placeholder Job end %% loop Scheduling Loop