This is the multi-page printable view of this section. Click here to print.
Quickstart Guides
- 1: QuickStart Guide for Amazon EKS
- 2: QuickStart Guide for Google GKE
- 3: QuickStart Guide for Microsoft AKS
1 - QuickStart Guide for Amazon EKS
This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.
Setup
Setup a cluster on EKS.
eksctl create cluster \
--name slinky-cluster \
--region us-west-2 \
--nodegroup-name slinky-nodes \
--node-type t3.medium \
--nodes 2
Setup kubectl to point to your new cluster.
aws eks --region us-west-2 update-kubeconfig --name slinky-cluster
Pre-Requisites
Install the pre-requisite helm charts.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace --set installCRDs=true
Install EBS CSI Driver
helm install aws-ebs-csi aws-ebs-csi-driver/aws-ebs-csi-driver -n kube-system
AWS Permissions
You will need to make sure your IAM user has the proper permissions.
Step 1: Identify the IAM Role
Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:
aws eks describe-nodegroup \
--cluster-name slinky-cluster \
--nodegroup-name slinky-nodes \
--query "nodegroup.nodeRole" \
--output text
This will return something like:
arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
Step 2: Attach the Required IAM Policy for EBS CSI Driver
Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.
Run the following command:
aws iam attach-role-policy \
--role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
Create StorageClass
You will need to create a StorageClass to use.
Here is an example storageclass.yaml file for a StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
type: gp2
fsType: ext4
Create the StorageClass using your storageclass.yaml file.
kubectl apply -f storageclass.yaml
Slurm Operator
Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
-o values-operator.yaml
helm install slurm-operator \
-f values-operator.yaml \
--namespace=slinky \
--create-namespace \
helm/slurm-operator
Make sure the cluster deployed successfully with:
kubectl --namespace=slinky get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
Slurm Cluster
Download values and install a Slurm cluster.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
-o values-slurm.yaml
helm install slurm \
-f values-slurm.yaml \
--namespace=slurm \
--create-namespace \
helm/slurm
Make sure the slurm cluster deployed successfully with:
kubectl --namespace=slurm get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-l4bd2 1/1 Running 0 5m00s
slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
Testing
To test Slurm functionality, connect to the controller to use Slurm client commands:
kubectl --namespace=slurm exec \
-it statefulsets/slurm-controller -- bash --login
On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:
sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue
See Slurm Commands for more details on how to interact with Slurm.
2 - QuickStart Guide for Google GKE
This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.
Setup
Setup a cluster on GKE.
gcloud container clusters create slinky-cluster \
--location=us-central1-a \
--num-nodes=2 \
--node-taints "" \
--machine-type=c2-standard-16
Setup kubectl to point to your new cluster.
gcloud container clusters get-credentials slinky-cluster
Pre-Requisites
Install the pre-requisite helm charts.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace --set installCRDs=true
Slurm Operator
Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
-o values-operator.yaml
helm install slurm-operator \
-f values-operator.yaml \
--namespace=slinky \
--create-namespace \
helm/slurm-operator
Make sure the cluster deployed successfully with:
kubectl --namespace=slinky get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
Slurm Cluster
Download values and install a Slurm cluster.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
-o values-slurm.yaml
helm install slurm \
-f values-slurm.yaml \
--namespace=slurm \
--create-namespace \
helm/slurm
Make sure the slurm cluster deployed successfully with:
kubectl --namespace=slurm get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-l4bd2 1/1 Running 0 5m00s
slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
Testing
To test Slurm functionality, connect to the controller to use Slurm client commands:
kubectl --namespace=slurm exec \
-it statefulsets/slurm-controller -- bash --login
On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:
sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue
See Slurm Commands for more details on how to interact with Slurm.
3 - QuickStart Guide for Microsoft AKS
This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to AKS.
Setup
Setup a resource group on AKS
az group create --name slinky --location westus2
Setup a cluster on AKS
az aks create \
--resource-group slinky \
--name slinky \
--location westus2 \
--node-vm-size Standard_D2s_v3
Setup kubectl to point to your new cluster.
az aks get-credentials --resource-group slinky --name slinky
Pre-Requisites
Install the pre-requisite helm charts.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus --create-namespace --set installCRDs=true
Slurm Operator
Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
-o values-operator.yaml
Make sure you are authenticated and the proper role is assigned to pull your images.
az acr login -n slinky
az aks show \
--resource-group slinky \
--name slinky \
--query identityProfile.kubeletidentity.clientId \
-o tsv
az role assignment create --assignee <clientId from above> \
--role AcrPull \
--scope $(az acr show --name slinky --query id -o tsv)
helm install slurm-operator \
-f values-operator.yaml \
--namespace=slinky \
--create-namespace \
helm/slurm-operator
Make sure the cluster deployed successfully with:
kubectl --namespace=slinky get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s
Slurm Cluster
Download values and install a Slurm cluster.
curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
-o values-slurm.yaml
helm install slurm \
-f values-slurm.yaml \
--namespace=slurm \
--create-namespace \
helm/slurm
By default the values-slurm.yaml file uses standard
for controller.persistence.storageClass
and mariadb.primary.persistence.storageClass
. You will need to update this value to default
to use AKS’s default storageClass.
Make sure the slurm cluster deployed successfully with:
kubectl --namespace=slurm get pods
Output should be similar to:
NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-l4bd2 1/1 Running 0 5m00s
slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s
Testing
To test Slurm functionality, connect to the controller to use Slurm client commands:
kubectl --namespace=slurm exec \
-it statefulsets/slurm-controller -- bash --login
On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:
sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue
See Slurm Commands for more details on how to interact with Slurm.