Design and Implementation of a Slurm-Based HPC Cluster

#architecture #automation #distributedsystems #infrastructure

The Problem
Managing a growing fleet of GPU and HPC servers one-by-one doesn't scale - here's how we fixed it.

Our department has several computing resources, including GPU servers and HPC servers. Previously, these machines were managed and accessed individually. As the number of servers and users grew, so did the inefficiencies.

From an administrator's perspective, it was difficult to monitor resource usage and ensure fair utilization across machines. Some servers would become heavily loaded while others sat idle, with no automatic balancing in place.

From a user's perspective, researchers and students had to manually decide which server to use - without any visibility into current availability or load.

To address these challenges, we built a Slurm-based HPC cluster. Users now submit jobs with their resource requirements, and the scheduler automatically selects an appropriate compute node. This simplifies resource management and allows the department's computing infrastructure to be utilized far more efficiently.

What Is Slurm?

Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager widely used in HPC environments. It handles job queuing, scheduling, and resource allocation across a cluster of machines — letting users focus on their work rather than infrastructure logistics.

Architecture

The cluster consists of three main components:

Login node — the entry point where users connect and submit jobs
Controller node (CERF) — runs slurmctld, the central Slurm daemon responsible for scheduling decisions
Compute nodes — kepler (GPU node) and aiken (HPC node), each running slurmd to receive and execute assigned jobs

Users submit jobs to the controller, which schedules and dispatches them to the appropriate compute node based on resource availability and job requirements.

Communication between Slurm components is secured using Munge authentication, which establishes trust between cluster nodes and ensures that scheduling operations are performed securely. User authentication is handled separately through the standard Linux user management system.

A Slurm partition groups the available compute resources, and the controller maintains real-time information about each node's state — allowing it to make informed scheduling decisions.

Submitting Jobs

Once the cluster was operational, users could submit workloads through standard Slurm commands. Instead of SSH-ing directly into a compute node, users interact with the cluster through the login node using commands like:

srun — run a command interactively on an allocated node
sbatch — submit a batch script to be executed asynchronously
squeue — view the current job queue and job statuses
scancel — cancel a running or pending job

The request is sent to the Slurm controller, which identifies a suitable compute node and dispatches the job for execution.

The figure below shows a simple srun command being submitted from the login node, scheduled by the controller, and executed on the kepler GPU node — with the output returned directly to the user's terminal.

DEV Community

Design and Implementation of a Slurm-Based HPC Cluster

Top comments (0)