This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Tasks

Guides to tasks related to the administration of a cluster running slurm-operator.

1 - Autoscaling

Getting Started

Before attempting to autoscale NodeSets, Slinky should be fully deployed to a Kubernetes cluster and Slurm jobs should be able to run.

Dependencies

Autoscaling requires additional services that are not included in Slinky. Follow documentation to install Prometheus, Metrics Server, and KEDA.

Prometheus will install tools to report metrics and view them with Grafana. The Metrics Server is needed to report CPU and memory usage for tools like kubectl top. KEDA is recommended for autoscaling as it provides usability improvements over standard the Horizontal Pod Autoscaler (HPA).

To add KEDA in the helm install, run

helm repo add kedacore https://kedacore.github.io/charts

Install the slurm-exporter. This chart is installed as a dependency of the slurm helm chart by default. Configure using helm/slurm/values.yaml.

Verify KEDA Metrics API Server is running

$ kubectl get apiservice -l app.kubernetes.io/instance=keda
NAME                              SERVICE                                AVAILABLE   AGE
v1beta1.external.metrics.k8s.io   keda/keda-operator-metrics-apiserver   True        22h

KEDA provides the metrics apiserver required by HPA to scale on custom metrics from Slurm. An alternative like Prometheus Adapter could be used for this, but KEDA offers usability enhancements and improvements to HPA in addition to including a metrics apiserver.

Autoscaling

Autoscaling NodeSets allows Slurm partitions to expand and contract in response to the CPU and memory usage. Using Slurm metrics, NodeSets may also scale based on Slurm specific information like the number of pending jobs or the size of the largest pending job in a partition. There are many ways to configure autoscaling. Experiment with different combinations based on the types of jobs being run and the resources available in the cluster.

NodeSet Scale Subresource

Scaling a resource in Kubernetes requires that resources such as Deployments and StatefulSets support the scale subresource. This is also true of the NodeSet Custom Resource.

The scale subresource gives a standard interface to observe and control the number of replicas of a resource. In the case of NodeSet, it allows Kubernetes and related services to control the number of slurmd replicas running as part of the NodeSet.

To manually scale a NodeSet, use the kubectl scale command. In this example, the NodeSet (nss) slurm-compute-radar is scaled to 1.

$ kubectl scale -n slurm nss/slurm-compute-radar --replicas=1
nodeset.slinky.slurm.net/slurm-compute-radar scaled

$ kubectl get pods -o wide -n slurm -l app.kubernetes.io/instance=slurm-compute-radar
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES
slurm-compute-radar-0   1/1     Running   0          2m48s   10.244.4.17   kind-worker   <none>           <none>

This corresponds to the Slurm partition radar.

$ kubectl exec -n slurm statefulset/slurm-controller -- sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      1   idle kind-worker

NodeSets may be scaled to zero. In this case, there are no replicas of slurmd running and all jobs scheduled to that partition will remain in a pending state.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

For NodeSets to scale on demand, an autoscaler needs to be deployed. KEDA allows resources to scale from 0<->1 and also creates an HPA to scale based on scalers like Prometheus and more.

KEDA ScaledObject

KEDA uses the Custom Resource ScaledObject to monitor and scale a resource. It will automatically create the HPA needed to scale based on external triggers like Prometheus. With Slurm metrics, NodeSets may be scaled based on data collected from the Slurm restapi.

This example ScaledObject will watch the number of jobs pending for the partition radar and scale the NodeSet slurm-compute-radar until a threshold value is satisfied or maxReplicaCount is reached.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: scale-radar
spec:
  scaleTargetRef:
    apiVersion: slinky.slurm.net/v1alpha1
    kind: NodeSet
    name: slurm-compute-radar
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 3
  triggers:
    - type: prometheus
      metricType: Value
      metadata:
        serverAddress: http://prometheus-kube-prometheus-prometheus.prometheus:9090
        query: slurm_partition_pending_jobs{partition="radar"}
        threshold: "5"

Note: The Prometheus trigger is using metricType: Value instead of the default AverageValue. AverageValue calculates the replica count by averaging the threshold across the current replica count.

Check ScaledObject documentation for a full list of allowable options.

In this scenario, the ScaledObject scale-radar will query the Slurm metric slurm_partition_pending_jobs from Prometheus with the label partition="radar".

When there is activity on the trigger (at least one pending job), KEDA will scale the NodeSet to minReplicaCount and then let HPA handle scaling up to maxReplicaCount or back down to minReplicaCount. When there is no activity on the trigger after a configurable amount of time, KEDA will scale the NodeSet to idleReplicaCount. See the KEDA documentation on idleReplicaCount for more examples.

Note: The only supported value for idleReplicaCount is 0 due to limitations on how the HPA controller works.

To verify a KEDA ScaledObject, apply it to the cluster in the appropriate namespace on a NodeSet that has no replicas.

$ kubectl scale nss/slurm-compute-radar -n slurm --replicas=0
nodeset.slinky.slurm.net/slurm-compute-radar scaled

Wait for Slurm to report that the partition has no nodes.

$ slurm@slurm-controller-0:/tmp$ sinfo -p radar
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
radar        up   infinite      0    n/a

Apply the ScaledObject using kubectl to the correct namespace and verify the KEDA and HPA resources are created.

$ kubectl apply -f scaledobject.yaml -n slurm
scaledobject.keda.sh/scale-radar created

$ kubectl get -n slurm scaledobjects
NAME           SCALETARGETKIND                     SCALETARGETNAME        MIN   MAX   TRIGGERS     AUTHENTICATION   READY   ACTIVE   FALLBACK   PAUSED    AGE
scale-radar    slinky.slurm.net/v1alpha1.NodeSet   slurm-compute-radar    1     5     prometheus                    True    False    Unknown    Unknown   28s

$ kubectl get -n slurm hpa
NAME                    REFERENCE                      TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
keda-hpa-scale-radar    NodeSet/slurm-compute-radar    <unknown>/5   1         5         0          32s

Once the ScaledObject and HPA are created, initiate some jobs to test that the NodeSet scale subresource is scaled in response.

$ sbatch --wrap "sleep 30" --partition radar --exclusive

The NodeSet will scale to minReplicaCount in response to activity on the trigger. Once the number of pending jobs crosses the configured threshold (submit more exclusive jobs to the partition), more replicas will be created to handle the additional demand. Until the threshold is exceeded, the NodeSet will remain at minReplicaCount.

Note: This example only works well for single node jobs, unless threshold is set to 1. In this case, HPA will continue to scale up NodeSet as long as there is a pending job until up until it reaches the maxReplicaCount.

After the default coolDownPeriod of 5 minutes without activity on the trigger, KEDA will scale the NodeSet down to 0.

2 - Development

This document aims to provide enough information that you can get started with development on this project.

Getting Started

You will need a Kubernetes cluster to run against. You can use KIND to get a local cluster for testing, or run against your choice of remote cluster.

Note: Your controller will automatically use the current context in your kubeconfig file (i.e. whatever cluster kubectl cluster-info shows).

Dependencies

Install KIND and Golang binaries for pre-commit hooks.

sudo apt-get install golang
make install

Pre-Commit

Install pre-commit and install the git hooks.

sudo apt-get install pre-commit
pre-commit install

Docker

Install Docker and configure rootless Docker.

After, test that your user account and communicate with docker.

docker run hello-world

Helm

Install Helm.

sudo snap install helm --classic

Skaffold

Install Skaffold.

curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && \
sudo install skaffold /usr/local/bin/

If google-cloud-sdk is installed, skaffold is available as an additional component.

sudo apt-get install -y google-cloud-cli-skaffold

Kubernetes Client

Install kubectl.

sudo snap install kubectl --classic

If google-cloud-sdk is installed, kubectl is available as an additional component.

sudo apt-get install -y kubectl

Running on the Cluster

For development, all Helm deployments use a values-dev.yaml. If they do not exist in your environment yet or you are unsure, safely copy the values.yaml as a base by running:

make values-dev

Automatic

You can use Skaffold to build and push images, and deploy components using:

cd helm/slurm-operator/
skaffold run

NOTE: The skaffold.yaml is configured to inject the image and tag into the values-dev.yaml so they are correctly referenced.

Operator

The slurm operator aims to follow the Kubernetes Operator pattern.

It uses Controllers, which provide a reconcile function responsible for synchronizing resources until the desired state is reached on the cluster.

Install CRDs

When deploying a helm chart with skaffold or helm, the CRDs defined in its crds/ directory will be installed if not already present in the cluster.

Uninstall CRDs

To delete the Operator CRDs from the cluster:

make uninstall

WARNING: CRDs do not upgrade! The old ones must be uninstalled first so the new ones can be installed. This should only be done in development.

Modifying the API Definitions

If you are editing the API definitions, generate the manifests such as CRs or CRDs using:

make manifests

Slurm Version Changed

If the Slurm version has changed, generate the new OpenAPI spec and its golang client code using:

make generate

NOTE: Update code interacting with the API in accordance with the slurmrestd plugin lifecycle.

Running the operator locally

Install the operator’s CRDs with make install.

Launch the operator via the VSCode debugger using the “Launch Operator” launch task.

Because the operator will be running outside of Kubernetes and needs to communicate to the Slurm cluster, set the following options in you Slurm helm chart’s values.yaml:

  • debug.enable=true
  • debug.localOperator=true

If running on a Kind cluster, also set:

  • debug.disableCgroups=true

If the Slurm helm chart is being deployed with skaffold, run skaffold run --port-forward --tail. It is configured to automatically port-forward the restapi for the local operator to communicate with the Slurm cluster.

If skaffold is not used, manually run kubectl port-forward --namespace slurm services/slurm-restapi 6820:6820 for the local operator to communicate with the Slurm cluster.

After starting the operator, verify it is able to contact the Slurm cluster by checking that the Cluster CR has been marked ready:

$ kubectl get --namespace slurm clusters.slinky.slurm.net
NAME     READY   AGE
slurm    true    110s

See skaffold port-forwarding to learn how skaffold automatically detects which services to forward.

Slurm Cluster

Get into a Slurm pod that can submit workload.

kubectl --namespace=slurm exec -it deployments/slurm-login -- bash -l
kubectl --namespace=slurm exec -it statefulsets/slurm-controller -- bash -l
cloud-provider-kind -enable-lb-port-mapping &
SLURM_LOGIN_PORT="$(kubectl --namespace=slurm get services -l app.kubernetes.io/name=login,app.kubernetes.io/instance=slurm -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ports[0].port}")"
SLURM_LOGIN_IP="$(kubectl --namespace=slurm get services -l app.kubernetes.io/name=login,app.kubernetes.io/instance=slurm -o jsonpath="{.items[0].status.loadBalancer.ingress[0].ip}")"
ssh -p "$SLURM_LOGIN_PORT" "${USER}@${SLURM_LOGIN_IP}"

3 - Using Pyxis

Overview

This guide tells how to configure your Slurm cluster to use pyxis (and enroot), a Slurm SPANK plugin for containerized jobs with Nvidia GPU support.

Configure

Configure plugstack.conf to include the pyxis configuration.

Warning: In plugstack.conf, you must use glob syntax to avoid slurmctld failure while trying to resolve the paths in the includes. Only the login and slurmd pods should actually have the pyxis libraries installed.

slurm:
  configFiles:
    plugstack.conf: |
      include /usr/share/pyxis/*
  ...

Configure one or more NodeSets and the login pods to use a pyxis OCI image.

login:
  image:
    repository: ghcr.io/slinkyproject/login-pyxis
  ...
compute:
  nodesets:
    - name: debug
      image:
        repository: ghcr.io/slinkyproject/slurmd-pyxis
      ...

To make enroot activity in the login container permissible, it requires securityContext.privileged=true.

login:
  image:
    repository: ghcr.io/slinkyproject/login-pyxis
  securityContext:
    privileged: true

Test

Submit a job to a Slurm node.

$ srun --partition=debug grep PRETTY /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"

Submit a job to a Slurm node with pyxis and it will launch in its requested container.

$ srun --partition=debug --container-image=alpine:latest grep PRETTY /etc/os-release
pyxis: importing docker image: alpine:latest
pyxis: imported docker image: alpine:latest
PRETTY_NAME="Alpine Linux v3.21"

Warning: SPANK plugins will only work on specific Slurm node that have them and is configured to use them. It is best to constrain where jobs run with --partition=<partition>, --batch=<features>, and/or --constraint=<features> to ensure a compatible computing environment.

If the login container has securityContext.privileged=true, enroot activity is permissible. You can test the functionality with the following:

enroot import docker://alpine:latest