Guides to tasks related to the administration of a cluster running
slurm-bridge
.
This is the multi-page printable view of this section. Click here to print.
Tasks
1 - Running slurm-bridge locally
You may want to run slurm-bridge
on a single machine in order to test the
software or familiarize yourself with it prior to installing it on your cluster.
This should only be done for testing and evaluation purposes and should not be
used for production environments.
We have provided a script to do this using Kind and
the
hack/kind.sh
script.
This document assumes a basic understanding of Kubernetes architecture. It is highly recommended that those who are unfamiliar with the core concepts of Kubernetes review the documentation on Kubernetes, pods, and nodes before getting started.
Pre-requisites
- go 1.17+ must be installed on your system
Setting up your environment
- Install Kind using
go install
:
go install sigs.k8s.io/kind@v0.29.0
If you get kind: command not found
when running the next step, you may need to
add GOPATH to your PATH:
export GOPATH=$HOME/go
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin
- Confirm that kind is working properly by running the following commands:
kind create cluster
kubectl get nodes --all-namespaces
kind delete cluster
- Clone the
slurm-bridge
repo and enter it:
git clone git@github.com:SlinkyProject/slurm-bridge.git
cd slurm-bridge
Installing slurm-bridge
within your environment
Provided with slurm-bridge
is the script hack/kind.sh
that interfaces with
kind to deploy the slurm-bridge
helm chart within your local environment.
- Create your cluster using
hack/kind.sh
:
hack/kind.sh --bridge
- Familiarize yourself with and use your test environment:
kubectl get pods --namespace=slurm-bridge
kubectl get pods --namespace=slurm
kubectl get pods --namespace=slinky
Celebrate!
At this point, you should have a kind cluster running slurm-bridge
.
Cleaning up
hack/kind.sh
provides a mechanism by which to destroy your test environment.
Run:
hack/kind.sh --delete
To destroy your kind cluster.
2 - Creating a Workload
In Slurm, all workloads are represented by jobs. In slurm-bridge
, however,
there are a number of forms that workloads can take. While workloads can still
be submitted as a Slurm job, slurm-bridge
also enables users to submit
workloads through Kubernetes. Most workloads that can be submitted to
slurm-bridge
from within Kubernetes are represented by an existing Kubernetes
batch workload primitive.
At this time, slurm-bridge
has scheduling support for Jobs,
JobSets, Pods, and PodGroups. If your workload
requires or benefits from co-scheduled pod launch (e.g. MPI, multi-node),
consider representing your workload as a JobSet or
PodGroup.
Using the slurm-bridge
Scheduler
slurm-bridge
uses an
admission controller
to control which resources are scheduled using the slurm-bridge-scheduler
. The
slurm-bridge-scheduler
is designed as a non-primary scheduler and is not
intended to replace the default
kube-scheduler.
The slurm-bridge
admission controller only schedules pods that request
slurm-bridge
as their scheduler or are in a configured namespace. By default,
the slurm-bridge
admission controller is configured to automatically use
slurm-bridge
as the scheduler for all pods in the configured namespaces.
Alternatively, a pod can specify Pod.Spec.schedulerName=slurm-bridge-scheduler
from any namespace to indicate that it should be scheduler using the
slurm-bridge-scheduler
.
You can learn more about the slurm-bridge
admission controller
here.
Annotations
Users can better inform or influence slurm-bridge
how to represent their
Kubernetes workload within Slurm by adding
annotations on the parent Object.
Example “pause” bare pod to illustrate annotations:
apiVersion: v1
kind: Pod
metadata:
name: pause
# `slurm-bridge` annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
Example “pause” deployment to illustrate annotations:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pause
# `slurm-bridge` annotations on parent object
annotations:
slinky.slurm.net/timelimit: "5"
slinky.slurm.net/account: foo
spec:
replicas: 2
selector:
matchLabels:
app: pause
template:
metadata:
labels:
app: pause
spec:
schedulerName: slurm-bridge-scheduler
containers:
- name: pause
image: registry.k8s.io/pause:3.6
resources:
limits:
cpu: "1"
memory: 100Mi
JobSets
This section assumes JobSets is installed.
JobSet pods will be coscheduled and launched together. The JobSet controller is responsible for managing the JobSet status and other Pod interactions once marked as completed.
PodGroups
This section assumes PodGroups CRD is installed and the out-of-tree kube-scheduler is installed and configured as a (non-primary) scheduler.
Pods contained within a PodGroup will be co-scheduled and launched together. The PodGroup controller is responsible for managing the PodGroup status and other Pod interactions once marked as completed.