This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Tasks

Guides to tasks related to the administration of a cluster running slurm-bridge.

1 - Running slurm-bridge locally

You may want to run slurm-bridge on a single machine in order to test the software or familiarize yourself with it prior to installing it on your cluster. This should only be done for testing and evaluation purposes and should not be used for production environments.

We have provided a script to do this using Kind and the hack/kind.sh script.

This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.

Pre-requisites


  • go 1.17+ must be installed on your system

Setting up your environment


  1. Install Kind using go install:
go install sigs.k8s.io/kind@v0.29.0

If you get kind: command not found when running the next step, you may need to add GOPATH to your PATH:

export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
  1. Confirm that kind is working properly by running the following commands:
kind create cluster

kubectl get nodes --all-namespaces

kind delete cluster
  1. Clone the slurm-bridge repo and enter it:
git clone git@github.com:SlinkyProject/slurm-bridge.git
cd slurm-bridge

Installing slurm-bridge within your environment


Provided with slurm-bridge is the script hack/kind.sh that interfaces with kind to deploy the slurm-bridge helm chart within your local environment.

  1. Create your cluster using hack/kind.sh:
hack/kind.sh --bridge
  1. Familiarize yourself with and use your test environment:
kubectl get pods --namespace=slurm-bridge
kubectl get pods --namespace=slurm
kubectl get pods --namespace=slinky

Celebrate!

At this point, you should have a kind cluster running slurm-bridge.

Cleaning up


hack/kind.sh provides a mechanism by which to destroy your test environment.

Run:

hack/kind.sh --delete

To destroy your kind cluster.

2 - Creating a Workload

In Slurm, all workloads are represented by jobs. In slurm-bridge, however, there are a number of forms that workloads can take. While workloads can still be submitted as a Slurm job, slurm-bridge also enables users to submit workloads through Kubernetes. Most workloads that can be submitted to slurm-bridge from within Kubernetes are represented by an existing Kubernetes batch workload primitive.

At this time, slurm-bridge has scheduling support for Jobs, JobSets, Pods, and PodGroups. If your workload requires or benefits from co-scheduled pod launch (e.g. MPI, multi-node), consider representing your workload as a JobSet or PodGroup.

Using the slurm-bridge Scheduler


slurm-bridge uses an admission controller to control which resources are scheduled using the slurm-bridge-scheduler. The slurm-bridge-scheduler is designed as a non-primary scheduler and is not intended to replace the default kube-scheduler. The slurm-bridge admission controller only schedules pods that request slurm-bridge as their scheduler or are in a configured namespace. By default, the slurm-bridge admission controller is configured to automatically use slurm-bridge as the scheduler for all pods in the configured namespaces.

Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler from any namespace to indicate that it should be scheduler using the slurm-bridge-scheduler.

You can learn more about the slurm-bridge admission controller here.

Annotations


Users can better inform or influence slurm-bridge how to represent their Kubernetes workload within Slurm by adding annotations on the parent Object.

Example “pause” bare pod to illustrate annotations:

apiVersion: v1
kind: Pod
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  schedulerName: slurm-bridge-scheduler
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.6
      resources:
        limits:
          cpu: "1"
          memory: 100Mi

Example “pause” deployment to illustrate annotations:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
  # `slurm-bridge` annotations on parent object
  annotations:
    slinky.slurm.net/timelimit: "5"
    slinky.slurm.net/account: foo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      schedulerName: slurm-bridge-scheduler
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.6
          resources:
            limits:
              cpu: "1"
              memory: 100Mi

JobSets


This section assumes JobSets is installed.

JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.

PodGroups


This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.

Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.