This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

Concepts related to slurm-bridge internals and design.

1 - Admission

Overview


The Kubernetes documentation defines admission controllers as:

a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the resource, but after the request is authenticated and authorized.

It also states that:

Admission control mechanisms may be validating, mutating, or both. Mutating controllers may modify the data for the resource being modified; validating controllers may not.

The slurm-bridge admission controller is a mutating controller. It modifies any pods within certain namespaces (slurm-bridge, by default) to use the slurm-bridge scheduler instead of the default Kube scheduler.

Design


Any pods created in certain namespaces will have their .spec.schedulerName changed to our scheduler.

Managed namespaces are defined as a list of namespace as configured in the admission controller’s values.yaml for managedNamespaces[].

Sequence Diagram

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SBA as Slurm-Bridge Admission

  KAPI-->>SBA: Watch Pod Create/Update
  opt Pod in managed Namespaces
    SBA->>KAPI: Update `.spec.schedulerName` and Tolerations
    KAPI-->>SBA: Update Response
  end %% opt Pod in managed Namespaces

2 - Architecture

Overview


This document describes the high-level architecture of the Slinky slurm-bridge.

Big Picture


Image

Directory Map


This project follows the conventions of:

api/

Contains Custom Kubernetes API definitions. These become Custom Resource Definitions (CRDs) and are installed into a Kubernetes cluster.

cmd/

Contains code to be compiled into binary commands.

config/

Contains yaml configuration files used for kustomize deployments

docs/

Contains project documentation.

hack/

Contains files for development and Kubebuilder. This includes a kind.sh script that can be used to create a kind cluster with all pre-requisites for local testing.

helm/

Contains helm deployments, including the configuration files such as values.yaml.

Helm is the recommended method to install this project into your Kubernetes cluster.

internal/

Contains code that is used internally. This code is not externally importable.

internal/controller/

Contains the controllers.

Each controller is named after the Custom Resource Definition (CRD) it manages. Currently, this consists of the nodeset and the cluster CRDs.

internal/scheduler/

Contains scheduling framework plugins. Currently, this consists of slurm-bridge.

3 - Controllers

Overview


The Kubernetes documentation defines controllers as:

control loops that watch the state of your cluster, then make or request changes where needed. Each controller tries to move the current cluster state closer to the desired state.

Within slurm-bridge, there are multiple controllers that manage the state of different bridge components:

  • Node Controller - Responsible for the state of nodes in the bridge cluster
  • Workload Controller - Responsible for the state of pods and other workloads running on slurm-bridge

Node Controller


The node controller is responsible for tainting the managed nodes so the scheduler component is fully in control of all workload that is bound to those nodes.

Additionally, this controller will reconcile certain node states for scheduling purposes. Slurm becomes the source of truth for scheduling among managed nodes.

A managed node is defined as a node that has a colocated kubelet and slurmd on the same physical host, and the slurm-bridge can schedule on.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop
    KAPI-->>SWC: Watch Kubernetes Nodes

    alt Node is managed
      SWC->>KAPI: Taint Node
      KAPI-->>SWC: Taint Node
    else
      SWC->>KAPI: Untaint Node
      KAPI-->>SWC: Untaint Node
    end %% alt Node is managed

    alt Node is schedulable
      SWC->>SAPI: Drain Node
      SAPI-->>SWC: Taint Node
    else
      SWC->>SAPI: Undrain Node
      SAPI-->>SWC: Undrain Node
    end %% alt Node is schedulable

  end %% loop Reconcile Loop

Workload Controller


The workload controller reconciles Kubernetes Pods and Slurm Jobs. Slurm is the source of truth for what workload is allowed to run on which managed nodes.

sequenceDiagram
  autonumber

  participant KAPI as Kubernetes API
  participant SWC as Slurm Workload Controller
  participant SAPI as Slurm REST API

  loop Reconcile Loop

  critical Map Slurm Job to Pod
    KAPI-->>SWC: Watch Kubernetes Pods
    SAPI-->>SWC: Watch Slurm Jobs
  option Pod is Terminated
    SWC->>SAPI: Terminate Slurm Job
    SAPI-->>SWC: Return Status
  option Job is Terminated
    SWC->>KAPI: Evict Pod
    KAPI-->>SWC: Return Status
  end %% critical Map Slurm Job to Pod

  end %% loop Reconcile Loop

4 - Scheduler

Overview


In Kubernetes, scheduling refers to making sure that pods are matched to nodes so that the kubelet can run them.

The scheduler controller in slurm-bridge is responsible for scheduling eligible pods onto nodes that are managed by slurm-bridge. In doing so, the slurm-bridge scheduler interacts with the Slurm REST API in order to acquire allocations for its’ workloads. In slurm-bridge, slurmctld serves as the source of truth for scheduling decisions.

Design


This scheduler is designed to be a non-primary scheduler (e.g. should not replace the default kube-scheduler). This means that only certain pods should be scheduled via this scheduler (e.g. non-critical pods).

This scheduler represents Kubernetes Pods as a Slurm Job, waits for Slurm to schedule the Job, then informs Kubernetes on which nodes to allocate the represented Pods. This scheduler defers scheduling decisions to Slurm, hence certain assumptions about the environment must be met for this to function correctly.

Sequence Diagram

sequenceDiagram
  autonumber

  actor user as User
  participant KAPI as Kubernetes API
  participant SBS as Slurm-Bridge Scheduler
  participant SAPI as Slurm REST API

  loop Workload Submission
    user->>KAPI: Submit Pod
    KAPI-->>user: Return Request Status
  end %% loop Workload Submission

  loop Scheduling Loop
    SBS->>KAPI: Get Next Pod in Workload Queue
    KAPI-->>SBS: Return Next Pod in Workload Queue

    note over SBS: Honor Slurm scheduling decision
    critical Lookup Slurm Placeholder Job
      SBS->>SAPI: Get Placeholder Job
      SAPI-->>SBS: Return Placeholder Job
    option Job is NotFound
      note over SBS: Translate Pod(s) into Slurm Job
      SBS->>SAPI: Submit Placeholder Job
      SAPI-->>SBS: Return Submit Status
    option Job is Pending
      note over SBS: Check again later...
      SBS->>SBS: Requeue
    option Job is Allocated
      note over SBS: Bind Pod(s) to Node(s) from the Slurm Job
      SBS->>KAPI: Bind Pod(s) to Node(s)
      KAPI-->>SBS: Return Bind Request Status
    end %% Lookup Slurm Placeholder Job
  end %% loop Scheduling Loop