QuickStart Guide for Amazon EKS

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to Amazon EKS.

Setup

Setup a cluster on EKS.

eksctl create cluster \
    --name slinky-cluster \
    --region us-west-2 \
    --nodegroup-name slinky-nodes \
    --node-type t3.medium \
    --nodes 2

Setup kubectl to point to your new cluster.

aws eks --region us-west-2 update-kubeconfig --name slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add kedacore https://kedacore.github.io/charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io
helm repo add aws-ebs-csi-driver https://kubernetes-sigs.github.io/aws-ebs-csi-driver
helm repo update
helm install cert-manager jetstack/cert-manager \
	--namespace cert-manager --create-namespace --set crds.enabled=true
helm install prometheus prometheus-community/kube-prometheus-stack \
    --namespace prometheus --create-namespace --set installCRDs=true

Install EBS CSI Driver

helm install aws-ebs-csi aws-ebs-csi-driver/aws-ebs-csi-driver -n kube-system

AWS Permissions

You will need to make sure your IAM user has the proper permissions.

Step 1: Identify the IAM Role

Run the following AWS CLI command to get the IAM role attached to your EKS worker nodes:

aws eks describe-nodegroup \
    --cluster-name slinky-cluster \
    --nodegroup-name slinky-nodes \
    --query "nodegroup.nodeRole" \
    --output text
    

This will return something like:

arn:aws:iam::017820679962:role/eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK
The IAM role name here is eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK.

Step 2: Attach the Required IAM Policy for EBS CSI Driver

Attach the AmazonEBSCSIDriverPolicy managed IAM policy to this role.

Run the following command:

aws iam attach-role-policy \
  --role-name eksctl-slurm-cluster-nodegroup-my-nod-NodeInstanceRole-hpbRU4WRvvlK \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

Create StorageClass

You will need to create a StorageClass to use.

Here is an example storageclass.yaml file for a StorageClass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp2
  fsType: ext4

Create the StorageClass using your storageclass.yaml file.

kubectl apply -f storageclass.yaml

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm-operator/values.yaml \
    -o values-operator.yaml

helm install slurm-operator \
    -f values-operator.yaml \
    --namespace=slinky \
    --create-namespace \
    helm/slurm-operator

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME                                      READY   STATUS    RESTARTS   AGE
slurm-operator-7444c844d5-dpr5h           1/1     Running   0          5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh   1/1     Running   0          5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/heads/main/helm/slurm/values.yaml \
    -o values-slurm.yaml

helm install slurm \
    -f values-slurm.yaml \
    --namespace=slurm \
    --create-namespace \
    helm/slurm

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME                              READY   STATUS    RESTARTS       AGE
slurm-accounting-0                1/1     Running   0              5m00s
slurm-compute-debug-l4bd2         1/1     Running   0              5m00s
slurm-controller-0                2/2     Running   0              5m00s
slurm-exporter-7b44b6d856-d86q5   1/1     Running   0              5m00s
slurm-mariadb-0                   1/1     Running   0              5m00s
slurm-restapi-5f75db85d9-67gpl    1/1     Running   0              5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl --namespace=slurm exec \
    -it statefulsets/slurm-controller -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun hostname
sbatch --wrap="sleep 60"
squeue

See Slurm Commands for more details on how to interact with Slurm.