This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

slurm-operator

slurm-operator

This project provides a framework that runs Slurm and Kubernetes.

Overview

This project deploys Slurm on Kubernetes. These pods coexist with other running workloads on Kubernetes. This project provides controls over the Slurm cluster configuration and deployment, along with configurable autoscaling policy for Slurm compute nodes.

This project allows for much of the functionality within Slurm for workload management. This includes:

  • Priority scheduling: Determine job execution order based on priorities and weights such as age
  • Fair share: Resources are distributed equitably among users based on historical usage.
  • Quality of Service (QoS): set of policies, such as limits of resources, priorities, and preemption and backfilling.
  • Job accounting: Information for every job and job step executed
  • Job dependencies: Allow users to specify relationships between jobs, from start, succeed, fail, or a particular state.
  • Workflows with partitioning: Divide cluster resource into sections for job management

Supported Slurm Versions

Data Parser: v41

  • 24.05
  • 24.11

Overall Architecture

This is a basic architecture. A more in depth description can be found in the docs directory.

Image

Known Issues

  • CGroups is currently disabled, due to difficulties getting core information into the pods.
  • Updates may be slow, due to needing to wait for sequencing before the slurm-controller can be deployed.

License

Copyright (C) SchedMD LLC.

Licensed under the Apache License, Version 2.0 you may not use project except in compliance with the license.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

1 - 0.1.x

1.1 - Quickstart Guides

1.1.1 - Basic Quickstart

QuickStart Guide

Overview

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Kubernetes.

Install

Pre-Requiisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
	--namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml \
  -o values-operator.yaml
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
  --values=values-operator.yaml --version=0.1.0 --namespace=slinky --create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml \
  -o values-slurm.yaml
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
  --values=values-slurm.yaml --version=0.1.0 --namespace=slurm --create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-0             1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.

1.1.2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

gcloud container clusters create slinky-cluster \
    --location=us-central1-a \
    --num-nodes=2 \
    --node-taints "" \
    --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

gcloud container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml \
    -o values-operator.yaml
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
    --version 0.1.0 \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml \
    -o values-slurm.yaml
helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
    --version 0.1.0 \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.

1.1.3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS

az group create --name slinky --location westus2

Setup a cluster on AKS

az aks create \
    --resource-group slinky \
    --name slinky \
    --location westus2 \
    --node-vm-size Standard_D2s_v3

Setup kubectl to point to your new cluster.

az aks get-credentials --resource-group slinky --name slinky

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm-operator/values.yaml \
    -o values-operator.yaml
    
helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
    --version 0.1.0 \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster from OCI package.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.1.0/helm/slurm/values.yaml \
    -o values-slurm.yaml

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
    --version 0.1.0 \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.

2 - 0.2.x

2.1 - Quickstart Guides

2.1.1 - QuickStart Guide for Amazon EKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.

Setup

Setup a cluster on EKS.

eksctl create cluster \
    --name slinky-cluster \
    --region us-west-2 \
    --nodegroup-name slinky-nodes \
    --node-type t3.medium \
    --nodes 2

Setup kubectl to point to your new cluster.

aws eks --region us-west-2 update-kubeconfig --name slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Install EBS CSI Driver

helm install aws-ebs-csi aws-ebs-csi-driver/aws-ebs-csi-driver -n kube-system

AWS Permissions

You will need to make sure your IAM user has the proper permissions.

Step 1: Identify the IAM Role

Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:

aws eks describe-nodegroup \
    --cluster-name slinky-cluster \
    --nodegroup-name slinky-nodes \
    --query "nodegroup.nodeRole" \
    --output text
    

This will return something like:

arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
The IAM role name here is eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK.

Step 2: Attach the Required IAM Policy for EBS CSI Driver

Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.

Run the following command:

aws iam attach-role-policy \
  --role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

Create StorageClass

You will need to create a StorageClass to use.

Here is an example storageclass.yaml file for a StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp2
  fsType: ext4

Create the StorageClass using your storageclass.yaml file.

kubectl apply -f storageclass.yaml

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
    -o values-operator.yaml

helm install slurm-operator \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace \
    helm/slurm-operator

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
    -o values-slurm.yaml

helm install slurm \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace \
    helm/slurm

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.

2.1.2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

gcloud container clusters create slinky-cluster \
    --location=us-central1-a \
    --num-nodes=2 \
    --node-taints "" \
    --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

gcloud container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
    -o values-operator.yaml

helm install slurm-operator \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace \
    helm/slurm-operator

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
    -o values-slurm.yaml

helm install slurm \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace \
    helm/slurm

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.

2.1.3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS

az group create --name slinky --location westus2

Setup a cluster on AKS

az aks create \
    --resource-group slinky \
    --name slinky \
    --location westus2 \
    --node-vm-size Standard_D2s_v3

Setup kubectl to point to your new cluster.

az aks get-credentials --resource-group slinky --name slinky

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
    -o values-operator.yaml

Make sure you are authenticated and the proper role is assigned to pull your images.

az acr login -n slinky

az aks show \
    --resource-group slinky \
    --name slinky \
    --query identityProfile.kubeletidentity.clientId \
    -o tsv

az role assignment create --assignee <clientId from above> \
     --role AcrPull \
     --scope $(az acr show --name slinky --query id -o tsv)

helm install slurm-operator \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace \
    helm/slurm-operator

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
    -o values-slurm.yaml

helm install slurm \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace \
    helm/slurm

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.