Usage Tutorial
Overview
This guide provides an example workflow for a simple use of Pytorch running on Slurm using Slurm Operator. By the end of this tutorial, the Graph Convolutional Network Pytorch example should be running on a Slurm cluster within Kubernetes.
Warning
These instructions are not intended for use in a production environment.
Prerequisites
A deployment of
slurm-operatorwith a LoginSet enabled.The latest official release of the following repositories cloned locally:
Slurm Operator: slurm-operator
Slinky Containers: containers
Building Images
In-depth instructions can be found at Building Slinky Containers.
The Slinky project uses Docker Buildx Bake to build modular images for each
Slurm component. In order to extend or change functionality, the Dockerfile
containing the images’ build configuration must be modified. For example, the
configuration for images built with Slurm 25.11 and Rocky Linux 9 can be found
within the containers repo at:
containers/schedmd/slurm/25.11/rockylinux9/Dockerfile.
To install Pytorch for this example, change the packages that are installed by
dnf in the base-extra layer. This can be done by appending the required
Python and Pytorch packages to the list, cloning the Pytorch repository, and
installing the GCN example’s dependencies from within its directory via pip. An
example of how this is done is the following:
################################################################################
FROM base AS base-extra
SHELL ["bash", "-c"]
RUN --mount=type=cache,target=/var/cache/dnf,sharing=locked <<EOR
# Install Extra Packages
set -xeuo pipefail
dnf -q -y install --setopt='install_weak_deps=False' \
openmpi python3-pip python-setuptools python3-scipy
EOR
# pytorch
RUN git clone --depth=2 https://github.com/pytorch/examples.git
WORKDIR /tmp/examples/gcn
RUN pip install -r requirements.txt
################################################################################
After making modifications, images can be tagged by setting the SUFFIX
environment variable, which appends a -<SUFFIX> to the tag of the version you
choose to build. Then, from the container repo’s root directory, the images can
be built by running the following command, where <version> must be replaced
with the Slurm version, and <distribution> with the distribution desired:
docker bake --file containers/schedmd/slurm/docker-bake.hcl --file containers/schedmd/slurm/<version>/<distribution>/slurm.hcl
After building the images, they must be made available on your cluster via a container registry.
Loading
The Slurm Helm chart’s values file can be downloaded here, or
can be found in the slurm-operator repo at helm/slurm/values.yaml. The
values at nodesets.slinky.slurmd.image, where slinky is replaced by the name
of your NodeSet, can be used to specify the repository and tag of the image used
by Slurm Operator to deploy that NodeSet’s slurmd pods. These values must be
modified to reference the container registry within which the images built in
the previous step are hosted and the tag that was applied to them.
Note
Take note of the ImagePullPolicy that is set by default by Kubernetes.
Running
The functionality can be tested by running Pytorch within Slurm. Create the
following sbatch script as the file pytorch-sbatsh.sh:
#!/bin/bash
# Slurm Parameters
#SBATCH -n 3 # Run n tasks
#SBATCH -N 3 # Run across N nodes
#SBATCH -t 0-00:10 # Time limit in D-HH:MM
#SBATCH -p slinky # Partition to submit to
#SBATCH --mem=100 # Memory pool for all cores
#SBATCH -o pytorch-output_%j.out # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e pytorch-errors_%j.err # File to which STDERR will be written, %j inserts jobid
echo "This was run on $SLURM_JOB_NODELIST."
srun -D /tmp/examples/gcn python3 main.py --epochs 200 --lr 0.01 --l2 5e-4 --dropout-p 0.5 --hidden-dim 16 --val-every 20 --include-bias
This script runs the GCN Pytorch example on three nodes, modify it or scale
accordingly. It must be copied to your slurm-login-slinky pod, this can be
done with the following command, where <hash> is replaced with the hash of the
current slurm-login-slinky pod:
kubectl -n slurm cp ~/location/of/script/pytorch-sbatch.sh slurm-login-slinky-<hash>:/root/
Then the script can be run from the slurm-login-slinky node with:
sbatch pytorch-sbatch.sh
The output should be available in a file titled pytorch-output_<JOB>.out,
created on the worker node on which it ran. Any errors would be in
pytorch-errors_<JOB>.err. To retrieve these files, they can be copied from the
slurm-worker node with the following command, where <N> is replaced by the
ordinal of the pod on which the job ran, and <JOB> is replaced by the jobid:
kubectl -n slurm cp slurm-worker-slinky-<N>:/root/pytorch-output_<JOB>.out .
For further reading, see the Slinky and Slurm documentation.