Hybrid

Table of Contents

Overview

A hybrid cluster is one that combines more than one type of infrastructure orchestration – bare-metal, Virtual Machines (VMs), containers (e.g. Kubernetes, Docker), and cloud infrastrcture (e.g. AWS, GCP, Azure, OpenStack).

Through the slurm-operator and its CRDs, a hybrid Slurm cluster can be expressed such that some Slurm cluster components live in Kubernetes and other components live externally.

Slurm

Slinky currently requires that Slurm uses configless, auth/slurm, auth/jwt, and use_client_ids. This dictates how Slurm clusters can be defined.

Store slurm.key as a secret in Kubernetes.

kubectl create secret generic external-auth-slurm \
  --namespace=slurm --from-file="slurm.key=/etc/slurm/slurm.key"

Store jwt_hs256.key as a secret in Kubernetes.

kubectl create secret generic external-auth-jwths256 \
  --namespace=slurm --from-file="jwt_hs256.key=/etc/slurm/jwt_hs256.key"

Networking

In the context of a hybrid configuration, there are two traffic routes to take into account: Internal-Internal communication; and External-Internal communication. Kubernetes Internal-Internal communication typically is pod to pod traffic, which is a flat network with DNS. External-Internal communication typically involves external traffic being proxied via NAT to a pod.

Slurm expects a fully connected network with bidirectional communication between all Slurm daemons and clients This means NAT type networks will generally impede communication.

Therefore, the network configuration needs to be configured to allow Slurm components to directly communicate over the network. There are two setups to choose from, each with their own benefits and drawbacks.

Host Network

This approach is about avoiding the Kubernetes NAT by having the Slurm pods use the Kubernetes node host network directly. While it is the simplest methodology, it does have security and Slurm configuration considerations.

Each Slurm pod would be configured as follows.

hostNetwork: true
dNSPolicy: ClusterFirstWithHostNet

Note

Controller and Accounting do not support this option due to Slurm configuration race with Kubernetes.

Warning

Only one pod with host network enabled can run on a Kubernetes node at a time. It will inherit the node’s hostname and will run within the host’s namespace, giving the pod access to the entire network and all ports.

Network Peering

This approach configures network peering such that internal and external services can directly communicate. While it is the most complex methodology, it does not diminish security and minimal Slurm configurations are needed.

This typically involves configuring an advanced CNI, like Calico, with network peering for bidirectional communication across Kubernetes boundaries.

Generally, no special configuration is required for the Slurm helm chart.

Slurm Configuration

Slinky currently requires that Slurm use configless, auth/slurm, auth/jwt, and use_client_ids. This dictates how Slurm clusters can be defined.

Copy slurm.key as a secret in Kubernetes.

kubectl create secret generic external-auth-slurm \
  --namespace=slurm --from-file="slurm.key=/etc/slurm/slurm.key"

Copy jwt_hs256.key as a secret in Kubernetes.

kubectl create secret generic external-auth-jwths256 \
  --namespace=slurm --from-file="jwt_hs256.key=/etc/slurm/jwt_hs256.key"

When configuring the Slurm helm chart, set the Slurm key and JWT key to the secrets that were copied into Kubernetes otherwise Slurm components will be unable to authenticate with the rest of the Slurm cluster.

slurmKeyRef:
  name: external-auth-slurm
  key: slurm.key
jwtHs256KeyRef:
  name: external-auth-jwths256
  key: jwt_hs256.key

Warning

Mixing containerized slurmd with non-containerized slurmd may be problematic due to Slurm’s assumed homogeneous configuration across all nodes. Notably, cgroup.conf with IgnoreSystemd=yes may not work on both types of nodes.

External Slurmdbd

When slurmctld is external to Kubernetes, the Slurm helm chart needs to have the accounting CR configured such that it knows how to communicate with it.

accounting:
  external: true
  externalConfig:
    host: $SLURMDBD_HOST
    port: $SLURMDBD_PORT # Default: 6819

External Slurmctld

When slurmctld is external to Kubernetes, the Slurm helm chart needs to have the controller CR configured such that it knows how to communicate with it.

controller:
  external: true
  externalConfig:
    host: $SLURMCTLD_HOST
    port: $SLURMCTLD_PORT # Default: 6817

External Slurmd

When slurmd is external to Kubernetes, the Slurm helm chart only provides additional workers. The external slurmd must be started with the following options.

slurmd --conf-server "${SLURMCTLD_HOST}:${SLURMCTLD_PORT}"

External Login

When login hosts are external to Kubernetes, the Slurm helm chart only provides additional login pods. The external sackd must be started with the following options.

sackd --conf-server "${SLURMCTLD_HOST}:${SLURMCTLD_PORT}"

External Slurmrestd

The Slurm helm chart always provides a slurmrestd pod such that the slurm-operator can use it to correctly take action on Slurm resources within kubernetes.

You may still have a slurmrestd that is accessible outside of Kubernetes to handles requests outside of Kubernetes.