QuickStart Guide for Google GKE

This quickstart guide will help you get the slurm-operator running and deploy Slurm clusters to GKE.

Setup

Setup a cluster on GKE.

gcloud container clusters create slinky-cluster \
 --location=us-central1-a \
 --num-nodes=2 \
 --node-taints "" \
 --machine-type=c2-standard-16

Setup kubectl to point to your new cluster.

gcloud container clusters get-credentials slinky-cluster

Pre-Requisites

Install the pre-requisite helm charts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add jetstack https://charts.jetstack.io

helm repo update helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set crds.enabled=true

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set installCRDs=true

Slurm Operator

Download values and install the slurm-operator. You will need to update the operator and webhook repository values to point to the desired container repository.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm-operator/values.yaml -o values-operator.yaml

helm install slurm-operator oci://ghcr.io/slinkyproject/charts/slurm-operator \
--values=values-operator.yaml \
--version=0.3.0 \
--namespace=slinky \
--create-namespace

Make sure the cluster deployed successfully with:

kubectl --namespace=slinky get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-operator-7444c844d5-dpr5h 1/1 Running 0 5m00s
slurm-operator-webhook-6fd8d7857d-zcvqh 1/1 Running 0 5m00s

Slurm Cluster

Download values and install a Slurm cluster.

curl -L https://raw.githubusercontent.com/SlinkyProject/slurm-operator/refs/tags/v0.3.0/helm/slurm/values.yaml -o values-slurm.yaml

helm install slurm oci://ghcr.io/slinkyproject/charts/slurm \
--values=values-slurm.yaml \
--version=0.3.0 \
--namespace=slurm \
--create-namespace

Make sure the slurm cluster deployed successfully with:

kubectl --namespace=slurm get pods

Output should be similar to:

NAME READY STATUS RESTARTS AGE
slurm-accounting-0 1/1 Running 0 5m00s
slurm-compute-debug-0 1/1 Running 0
5m00s slurm-controller-0 2/2 Running 0 5m00s
slurm-exporter-7b44b6d856-d86q5 1/1 Running 0 5m00s
slurm-mariadb-0 1/1 Running 0 5m00s
slurm-restapi-5f75db85d9-67gpl 1/1 Running 0 5m00s

Testing

To test Slurm functionality, connect to the controller to use Slurm client commands:

kubectl -n slurm exec -it slurm-controller-0 -- bash --login

On the controller pod (e.g. host slurm@slurm-controller-0), run the following commands to quickly test Slurm is functioning:

sinfo
srun
hostname sbatch --wrap="sleep 60" squeue

See Slurm Commands for more details on how to interact with Slurm.