1 - QuickStart Guide for Amazon EKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.

Setup

Setup a cluster on EKS.

 eksctl create cluster \
 --name slinky-cluster \
 --region us-west-2 \
 --nodegroup-name slinky-nodes \
 --node-type t3.medium \
 --nodes 2 

Setup kubectl to point to your new cluster.

 aws eks
--region us-west-2 update-kubeconfig --name slinky-cluster 

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo add aws-ebs-csi-driver
https://kubernetes-sigs.github.io/aws-ebs-csi-driver helm repo update helm
install cert-manager jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Install EBS CSI Driver

 helm install aws-ebs-csi
aws-ebs-csi-driver/aws-ebs-csi-driver -n kube-system 

AWS Permissions

You will need to make sure your IAM user has the proper permissions.

Step 1: Identify the IAM Role

Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:

 aws eks describe-nodegroup \
 --cluster-name slinky-cluster \
 --nodegroup-name slinky-nodes \
 --query "nodegroup.nodeRole" \
 --output text

This will return something like:

arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
The IAM role name here is eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK.

Step 2: Attach the Required IAM Policy for EBS CSI Driver

Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.

Run the following command:

 aws iam attach-role-policy \
 --role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
\
 --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

Create StorageClass

You will need to create a StorageClass to use.

Here is an example storageclass.yaml file for a StorageClass

 apiVersion: storage.k8s.io/v1 kind: StorageClass
metadata: name: standard provisioner: ebs.csi.aws.com reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer parameters: type: gp2 fsType: ext4

Create the StorageClass using your storageclass.yaml file.

 kubectl apply -f storageclass.yaml 

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator 

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s 

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm 

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s 

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login 

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue 

See Slurm Commands for more details on how to interact with Slurm.

2 - QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

 gcloud container clusters create
slinky-cluster \
 --location=us-central1-a \
 --num-nodes=2 \
 --node-taints "" \
 --machine-type=c2-standard-16 

Setup kubectl to point to your new cluster.

 gcloud
container clusters get-credentials slinky-cluster 

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator 

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s 

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm 

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s 

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login 

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue 

See Slurm Commands for more details on how to interact with Slurm.

3 - QuickStart Guide for Microsoft AKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.

Setup

Setup a resource group on AKS

 az group create --name
slinky --location westus2 

Setup a cluster on AKS

 az aks create \
 --resource-group slinky \
 --name slinky \
 --location westus2 \
 --node-vm-size Standard_D2s_v3 

Setup kubectl to point to your new cluster.

 az aks
get-credentials --resource-group slinky --name slinky 

Pre-Requisites

Install the pre-requisite helm charts.

 helm repo add
prometheus-community https://prometheus-community.github.io/helm-charts helm
repo add kedacore https://kedacore.github.io/charts helm repo add metrics-server
https://kubernetes-sigs.github.io/metrics-server/ helm repo add bitnami
https://charts.bitnami.com/bitnami helm repo add jetstack
https://charts.jetstack.io helm repo update helm install cert-manager
jetstack/cert-manager \
 --namespace cert-manager --create-namespace --set crds.enabled=true helm
install prometheus prometheus-community/kube-prometheus-stack \
 --namespace prometheus --create-namespace --set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm-operator/values.yaml
\
 -o values-operator.yaml

Make sure you are authenticated and the proper role is assigned to pull your images.

 az acr login -n slinky

az aks show \
 --resource-group slinky \
 --name slinky \
 --query identityProfile.kubeletidentity.clientId \
 -o tsv

az role assignment create --assignee <clientId from above> \
 --role AcrPull \
 --scope $(az acr show --name slinky --query id -o tsv)

helm install slurm-operator \
 -f values-operator.yaml \
 --namespace=slinky \
 --create-namespace \
 helm/slurm-operator 

Make sure the cluster deployed successfully with:

 kubectl
--namespace=slinky get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s 

Slurm Cluster

Download values and install a Slurm cluster.

 curl -L
https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.2.1/helm/slurm/values.yaml
\
 -o values-slurm.yaml

helm install slurm \
 -f values-slurm.yaml \
 --namespace=slurm \
 --create-namespace \
 helm/slurm 

By default the values-slurm.yaml file uses standard for controller.persistence.storageClass and mariadb.primary.persistence.storageClass. You will need to update this value to default to use AKS’s default storageClass.

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods 

Output should be similar to:

 NAME READY STATUS RESTARTS
AGE slurm-accounting-0 1/1 Running 0 5m00s slurm-compute-debug-l4bd2 1/1 Running
0 5m00s slurm-controller-0 2/2 Running 0 5m00s slurm-exporter-7b44b6d856-d86q5
1/1 Running 0 5m00s slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s 

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

 kubectl --namespace=slurm exec \
 -it statefulsets/slurm-controller -- bash --login 

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

 sinfo srun
hostname sbatch --wrap="sleep 60" squeue 

See Slurm Commands for more details on how to interact with Slurm.