DEV Community

Erythix
Erythix

Posted on

SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters

SLURM powers Summit, Frontier, LUMI, and most of the TOP500. If you work with GPU clusters, AI training infrastructure, or scientific computing, understanding how it works is not optional.


What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source cluster workload manager originally developed at Lawrence Livermore National Laboratory 1. It is now the de-facto standard for HPC environments worldwide, deployed on more than 60% of TOP500 systems 2.

It has three core responsibilities:

Resource allocation assigns compute nodes to jobs based on configured policies: partitions, Quality of Service (QOS) rules, and fairshare weights. It accounts for CPU cores, memory, GPU devices, and network topology simultaneously.

Job scheduling queues submitted jobs and launches them when resources become available. The default algorithm is backfill scheduling, which fills scheduling gaps with smaller jobs without delaying the larger ones already queued.

Accounting records every resource consumption event — who ran what, on which nodes, for how long, consuming how much CPU, memory, and GPU — via a dedicated daemon connected to a relational database.

It operates on a heartbeat model: nodes report their state to a central controller, which dispatches queued jobs as resources free up.


Architecture

The Four Daemons

+------------------------------------------------------------------+
|                        CONTROL PLANE                             |
|                                                                  |
|   +------------------+          +------------------+            |
|   |   slurmctld      |<-------->|   slurmdbd       |            |
|   |   TCP 6817       |          |   TCP 6819       |            |
|   |                  |          |                  |            |
|   |  Scheduler       |          |  Accounting GW   |            |
|   |  State manager   |          |  Only DB client  |            |
|   +--------+---------+          +--------+---------+            |
|            |                             |                       |
+------------|-----------------------------|-----------------------+
             |                             |
             | TCP 6818                    | SQL TCP 3306
             v                             v
+---------------------------+    +--------------------+
|   COMPUTE NODES           |    |   MariaDB          |
|                           |    |   Accounting DB    |
|   slurmd   slurmd   ...   |    +--------------------+
|   node01   node02         |
|                           |
|   cgroups v2 enforcement  |
|   Prolog / Epilog hooks   |
+---------------------------+
             ^
             |
     +-------+--------+
     |   slurmrestd   |
     |   TCP 6820     |
     |   OpenAPI/JWT  |
     +----------------+
Enter fullscreen mode Exit fullscreen mode

slurmctld — Controller Daemon (TCP 6817)

The brain of the cluster. It maintains the global state of every node and every job in memory, periodically checkpointing to disk (the StateSaveLocation directory). On restart after a failure, it replays this state to resume operations without losing queued or running jobs.

Key responsibilities:

  • Runs the scheduler plugin (backfill by default, with optional gang scheduling)
  • Manages node state transitions (IDLE, ALLOCATED, DOWN, DRAIN, FAIL)
  • Dispatches jobs to slurmd on compute nodes
  • Enforces partition and QOS limits
  • Processes all client commands (sbatch, srun, scontrol)

High availability is supported via a primary/backup pair. If the primary slurmctld fails, the backup takes over within seconds, with minimal job disruption 3.

slurmd — Node Daemon (TCP 6818)

One instance runs on every compute node. It is the execution agent: it receives job steps dispatched by slurmctld, spawns user processes inside cgroup hierarchies, monitors resource consumption continuously, and sends periodic heartbeats back to the controller.

When a heartbeat is missed beyond the configured SlurmdTimeout, the controller marks the node as DOWN and can optionally reschedule its jobs.

slurmd also runs the site-defined Prolog script before launching each job (environment setup, filesystem mounting, health checks) and the Epilog script after completion (cleanup, unmounting, node validation).

slurmdbd — Database Daemon (TCP 6819)

The exclusive gateway to the accounting database. No other daemon connects to MariaDB directly. This design creates a single point of control for all historical data: job records, resource consumption, user associations, QOS definitions, and the fairshare tree.

slurmdbd can run on a dedicated server, isolated from the controller. Losing it does not stop job execution — running jobs continue — but new accounting records are buffered locally on slurmctld and flushed when connectivity is restored.

slurmrestd — REST API Daemon (TCP 6820)

Available since SLURM 20.11 4, slurmrestd exposes the full SLURM management interface as an OpenAPI-documented REST API. It bridges REST calls to internal SLURM RPC, enabling integration with web portals, JupyterHub, workflow orchestrators (Nextflow, Snakemake, Apache Airflow), and cloud bursting systems.

Authentication is via JWT tokens. The API surface is significant and must be treated as a privileged endpoint.


Communication Flows

User (sbatch / srun / salloc)
        |
        | TCP 6817 — job submit, validated against associations + QOS
        v
  +-------------+   TCP 6819   +-------------+   SQL   +-----------+
  | slurmctld   |<------------>| slurmdbd    |-------->| MariaDB   |
  +------+------+   accounting +-------------+         +-----------+
         |
         | TCP 6818 — job dispatch (JobID, allocated nodes, resources)
         |
    +----+----+
    |         |
slurmd #1   slurmd #2  ...
    |
    +-- cgroups v2 (memory.max, cpu.max, devices allowlist)
    +-- Prolog  (runs as root before job)
    +-- job step (runs as user)
    +-- Epilog  (runs as root after job)
    +-- heartbeat -> slurmctld every SlurmdTimeout/3

slurmrestd --REST/JWT--> slurmctld (internal RPC)

All inter-daemon messages: signed + timestamped by MUNGE
Enter fullscreen mode Exit fullscreen mode

Every message exchanged between slurmctld, slurmdbd, and slurmd is signed and timestamped by MUNGE (MUNGE Uid 'N' Gid Emporium). A credential contains the UID/GID of the originating process, a timestamp, and a configurable TTL. Replayed credentials are rejected 5.


Scheduling Deep Dive

Backfill Scheduling

The default sched/backfill plugin extends simple first-in-first-out scheduling by maintaining a time-ordered reservation list. When a large job cannot start immediately, the scheduler looks for smaller jobs that can be inserted into the scheduling gap without pushing back the start time of the large job 6.

This is why you sometimes see a small 2-node job start before a 100-node job that was submitted earlier: the 100-node job is waiting for enough nodes to free up, and the 2-node job fits in the current available capacity without affecting the projected start time.

Queue state:
  Job A: 100 nodes, submitted T+0, cannot start (only 20 nodes free)
  Job B: 10 nodes, submitted T+10

Backfill logic:
  - Job A projected start: T+45 (when enough nodes finish current jobs)
  - Job B can complete before T+45 if started now
  - Job B is scheduled immediately without delaying Job A
Enter fullscreen mode Exit fullscreen mode

Priority Calculation

SLURM computes a weighted sum for each queued job 7:

Priority = w_age        * factor_age
         + w_fairshare  * factor_fairshare
         + w_jobsize    * factor_jobsize
         + w_qos        * factor_qos
         + w_partition  * factor_partition
         + w_assoc      * factor_assoc
Enter fullscreen mode Exit fullscreen mode

The fairshare factor is the most important for multi-tenant clusters. It is computed using a decay algorithm: resource usage from the past contributes less weight over time (configured by PriorityDecayHalfLife). A user who ran 10,000 CPU-hours last week has a lower fairshare score than a user who has not submitted a job in two weeks, pushing the inactive user's jobs to higher priority.

The tool sprio shows the current priority breakdown for every queued job.

QOS and Associations

The association tree controls access at every level:

Cluster: mycluster
  |
  +-- Account: research_lab        (FairShare: 40)
  |       |
  |       +-- User: alice          (FairShare: 20)
  |       |     QOS: normal, gpu_priority
  |       |     MaxTRES: cpu=256,gres/gpu=8
  |       |
  |       +-- User: bob            (FairShare: 20)
  |             QOS: normal
  |             MaxTRES: cpu=128
  |
  +-- Account: ops_team            (FairShare: 60)
          |
          +-- User: carol          (FairShare: 60)
                QOS: normal, high_priority, infra
                MaxTRES: cpu=512,gres/gpu=32
Enter fullscreen mode Exit fullscreen mode

A QOS defines hard limits (GrpTRES, MaxTRESPerJob, MaxWallDurationPerJob) and soft priority boosts. When a user submits a job requesting resources beyond their association or QOS limits, the job is rejected at submission time, not at scheduling time.


Job Lifecycle

  SUBMIT         QUEUE          ALLOCATE         RUN          COMPLETE
     |               |               |              |               |
 sbatch          PENDING          Nodes          RUNNING        COMPLETED
 script.sh        state          reserved         state           state
     |               |               |              |               |
     v               v               v              v               v
 slurmctld      Scheduler        slurmd         slurmd          slurmdbd
 validates      computes         runs           monitors        records
 resources      priority         Prolog         CPU/mem/GPU     all metrics
 + QOS limits   backfill         cgroups        heartbeats      to MariaDB
                analysis         configured     to controller
Enter fullscreen mode Exit fullscreen mode

Submission

#!/bin/bash
#SBATCH --job-name=train_llm
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:a100:8
#SBATCH --mem=512G
#SBATCH --time=48:00:00
#SBATCH --account=research_lab
#SBATCH --qos=gpu_priority

module load cuda/12.2
srun python train.py --config config.yaml
Enter fullscreen mode Exit fullscreen mode

slurmctld validates this script against:

  1. The partition definition (nodes available, max wall time)
  2. The user's association (account exists, user is a member)
  3. The QOS (resource limits not exceeded)
  4. Current cluster capacity (enough GPUs exist)

If all checks pass, the job receives a JobID and enters the PENDING state.

Execution on Nodes

When slurmctld dispatches the job, each slurmd on the allocated nodes:

  1. Runs the site Prolog (as root)
  2. Creates the cgroup hierarchy for the job
  3. Sets memory.max, cpu.max, and the GPU device allowlist
  4. Spawns slurmstepd, which drops privileges to the user and executes the job step
  5. Monitors consumption every JobAcctGatherFrequency seconds
  6. Runs the Epilog on completion (as root)
  7. Reports final resource usage to slurmctld, which forwards it to slurmdbd

Job Arrays

For parameter sweeps, job arrays avoid submitting thousands of individual jobs:

#SBATCH --array=0-99%10    # 100 tasks, max 10 running simultaneously

PARAM=${SLURM_ARRAY_TASK_ID}
python experiment.py --seed $PARAM
Enter fullscreen mode Exit fullscreen mode

Each task gets its own JobID (formatted as ArrayJobID_TaskID) and its own accounting record. The %10 limits concurrent tasks to avoid saturating the cluster.


Observability Stack

Architecture

Compute Nodes
+------------------------+    +------------------------+
| slurmd                 |    | DCGM Exporter          |
|                        |    | (NVIDIA GPU metrics)   |
| slurm-exporter :8080   |    | :9400                  |
|  slurm_jobs_running    |    |  DCGM_FI_DEV_GPU_UTIL  |
|  slurm_jobs_pending    |    |  DCGM_FI_DEV_MEM_COPY  |
|  slurm_nodes_alloc     |    |  DCGM_FI_DEV_NVLINK_*  |
|  slurm_cpus_idle       |    |  label: slurm_job_id   |
+----------+-------------+    +----------+-------------+
           |                             |
           | Prometheus scrape           | Prometheus scrape
           v                             v
+-----------------------------------------------+
|   VMAgent (per node or centralized)           |
|   Relabeling, filtering, remote_write         |
+-------------------+---------------------------+
                    |
                    | remote_write
                    v
+-----------------------------------------------+
|   VictoriaMetrics (vminsert / vmstorage)      |
|   Long-term storage, MetricsQL                |
+-------------------+---------------------------+
                    |
                    | datasource
                    v
+-----------------------------------------------+
|   Grafana                                     |
|   Job efficiency dashboards                   |
|   GPU heatmaps, fairshare visualization       |
|   Alerting (PagerDuty, Slack)                 |
+-----------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

slurm-exporter

The prometheus-slurm-exporter scrapes SLURM CLI tools (squeue, sinfo, sacct) and exposes metrics on port 8080 8.

Key metrics exposed:

Metric Description
slurm_jobs_running Count of running jobs, by partition
slurm_jobs_pending Count of pending jobs, by reason
slurm_nodes_alloc Nodes in ALLOCATED state
slurm_nodes_idle Nodes in IDLE state
slurm_nodes_down Nodes in DOWN/DRAIN state
slurm_cpus_total Total CPUs in cluster
slurm_cpus_idle Idle CPUs
slurm_account_cpu_count CPUs used per account

A known limitation: the exporter calls CLI binaries, which adds latency and load at scale (thousands of jobs). At very large scale, prefer reading directly from slurmctld's state files or using slurmrestd as a data source.

DCGM Exporter and GPU Correlation

The NVIDIA DCGM Exporter exposes per-GPU hardware metrics 9:

DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="...", hostname="node01"} 94
DCGM_FI_DEV_FB_USED{gpu="0", ...} 38654
DCGM_FI_DEV_POWER_USAGE{gpu="0", ...} 387
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", ...} 198432
Enter fullscreen mode Exit fullscreen mode

To correlate GPU metrics with SLURM jobs, DCGM can be configured to expose the SLURM_JOB_ID environment variable as a label. This enables Grafana queries like:

# GPU efficiency for a specific job
DCGM_FI_DEV_GPU_UTIL{slurm_job_id="12345"}
Enter fullscreen mode Exit fullscreen mode

This is the key insight for AI/ML workloads: raw GPU utilization tells you if GPUs are busy, but job_id correlation tells you which specific training run, user, or team is responsible.

Why VictoriaMetrics for HPC

Prometheus alone struggles with HPC-scale workloads for three reasons:

  1. Cardinality: a 1000-node cluster with 8 GPUs each, running thousands of jobs, generates millions of unique time series
  2. Retention: HPC accounting requires months or years of metrics for capacity planning and user reporting
  3. Query performance: job efficiency reports aggregate over large time ranges with complex label filters

VictoriaMetrics addresses all three 10:

# vmagent config: distributed collection on compute nodes
scrape_configs:
  - job_name: slurm
    static_configs:
      - targets: ["localhost:8080"]
  - job_name: dcgm
    static_configs:
      - targets: ["localhost:9400"]

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"
Enter fullscreen mode Exit fullscreen mode

Compression ratios on HPC workloads are typically 10-15x better than Prometheus TSDB, and MetricsQL supports advanced aggregations like quantile_over_time and increase that are essential for wait time analysis.

KPIs That Actually Matter

Most HPC operators track GPU utilization and stop there. That is not enough. The metrics that reveal actual cluster health:

Metric Formula Why it matters
CPU efficiency used_cpus / alloc_cpus Reveals job over-allocation and poor sizing
Memory waste alloc_mem - max_rss Often 40-60% on ML clusters
Wait time P95 start_time - submit_time Scheduler health indicator
Fairshare drift factor_fairshare over 30d Detects long-term resource monopolies
GPU occupancy DCGM_GPU_UTIL weighted by job Distinguishes idle allocation from compute-bound
Job failure rate failed / (completed + failed) Infrastructure reliability signal

A sacct query for job efficiency after the fact:

sacct -j 12345 \
  --format=JobID,CPUTime,CPUTimeRAW,AveCPU,MaxRSS,ReqMem,Elapsed \
  --units=G
Enter fullscreen mode Exit fullscreen mode

Security

Authentication: MUNGE

MUNGE is the default authentication mechanism for all inter-daemon communication 5. Every message is signed with a shared secret (/etc/munge/munge.key), timestamped, and includes the originating UID/GID. A receiving daemon verifies the signature and rejects credentials outside the configured TTL window, preventing replay attacks.

Node A                              Node B
+------------------+                +------------------+
|  slurmctld       |                |  slurmd          |
|                  |--[credential]->|                  |
|  signs with      |                |  verifies with   |
|  munge.key       |                |  munge.key       |
|                  |<--[response]---|                  |
+------------------+                +------------------+

Credential contains:
  - UID / GID of sender
  - Timestamp (TTL: 300s default)
  - Realm (optional)
  - Payload (encrypted)
Enter fullscreen mode Exit fullscreen mode

Key operational requirements:

  • munge.key must be identical on all nodes (controller + compute + login + slurmdbd server)
  • File permissions must be 0400, owned by the munge user
  • Distribution should use a secrets manager (HashiCorp Vault, Ansible Vault) rather than manual scp
  • Key rotation requires a coordinated restart of all SLURM daemons — the most disruptive operation on a live cluster

Key rotation procedure on a live cluster:

# 1. Generate new key on the controller
mungekey --create --keyfile /etc/munge/munge.key.new

# 2. Distribute to all nodes (use your config management tool)
ansible all -m copy \
  -a "src=/etc/munge/munge.key.new dest=/etc/munge/munge.key mode=0400 owner=munge"

# 3. Restart munge everywhere simultaneously (parallel SSH)
ansible all -m service -a "name=munge state=restarted"

# 4. Restart SLURM daemons in order
ansible compute -m service -a "name=slurmd state=restarted"
ansible controller -m service -a "name=slurmctld state=restarted"
ansible dbd -m service -a "name=slurmdbd state=restarted"
Enter fullscreen mode Exit fullscreen mode

Resource Isolation: cgroups v2

Without cgroup enforcement, a job that allocates 64GB of memory can consume 512GB, triggering OOM kills across all other jobs on the node. SLURM's cgroup plugin prevents this 11.

slurmd receives job dispatch
        |
        v
Creates cgroup hierarchy:
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_12345/
        |
        +-- memory.max        = 65536M   (allocated memory)
        +-- memory.swap.max   = 0        (no swap for HPC jobs)
        +-- cpu.max           = 6400 100000  (64 cores)
        +-- devices.allow     = c 195:0  (GPU 0 only)
        +-- devices.allow     = c 195:1  (GPU 1 only)
Enter fullscreen mode Exit fullscreen mode

Essential cgroup.conf settings:

CgroupPlugin=autodetect
ConstrainRAMSpace=yes       # OOM kill if job exceeds memory limit
ConstrainSwapSpace=yes      # Disable swap for job processes
ConstrainCores=yes          # Pin processes to allocated CPU cores
ConstrainDevices=yes        # Restrict GPU access to allocated devices
AllowedRAMSpace=100         # No tolerance: enforce hard limit
TaskAffinity=yes            # Bind threads to cores
Enter fullscreen mode Exit fullscreen mode

ConstrainRAMSpace=yes is non-negotiable in any multi-tenant environment. Without it, a misbehaving job can take down an entire node.

Authorization: RBAC and Associations

SLURM's authorization model is hierarchical. Access is validated at every layer:

Level 1 — Cluster
  Who can submit at all?

Level 2 — Account
  Which budget/project does the job charge to?
  What is the fairshare allocation?

Level 3 — User
  Individual limits within the account.

Level 4 — QOS
  Hard limits on resources, wall time, and concurrent jobs.
  Priority boosts or penalties.

Level 5 — Partition
  Which physical nodes? What maximum wall time?
  Restricted to specific groups (AllowGroups)?
Enter fullscreen mode Exit fullscreen mode

Managing associations with sacctmgr:

# Create account hierarchy
sacctmgr add cluster mycluster
sacctmgr add account research_lab cluster=mycluster fairshare=40
sacctmgr add user alice account=research_lab defaultaccount=research_lab

# Define QOS
sacctmgr add qos gpu_priority \
  MaxTRESPerUser=cpu=256,gres/gpu=8 \
  MaxWallDurationPerJob=48:00:00 \
  Priority=100

# Assign QOS to user
sacctmgr modify user alice set qos+=gpu_priority
Enter fullscreen mode Exit fullscreen mode

API Security: JWT and TLS

slurmrestd is the largest attack surface in a modern SLURM deployment. A compromised API token provides full cluster control: job submission, node management, user impersonation.

Hardening checklist:

# 1. Generate JWT signing key
openssl genrsa -out /etc/slurm/jwt_hs256.key 2048
chmod 0600 /etc/slurm/jwt_hs256.key
chown slurm: /etc/slurm/jwt_hs256.key

# In slurm.conf:
# AuthAltTypes=auth/jwt
# AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key

# 2. Issue short-lived tokens (1 hour max)
scontrol token username=alice lifespan=3600

# 3. Run behind nginx with rate limiting
# nginx.conf excerpt:
# limit_req_zone $binary_remote_addr zone=slurm_api:10m rate=10r/s;
# location /slurm/ {
#   limit_req zone=slurm_api burst=20 nodelay;
#   proxy_pass http://127.0.0.1:6820;
# }

# 4. Restrict port 6820 by firewall
# Only the proxy IP should reach slurmrestd directly
Enter fullscreen mode Exit fullscreen mode

For inter-daemon TLS (SLURM 23.x+), add to slurm.conf:

CommunicationParameters=EnableTLS
TLSType=tls/openssl
Enter fullscreen mode Exit fullscreen mode

Audit Trail

slurmdbd maintains a complete, immutable audit trail. Every job submission, modification, start, and completion is recorded with full resource accounting. This data is queryable via sacct:

# Full accounting for a user, last 30 days
sacct -u alice \
  --starttime=$(date -d '30 days ago' +%Y-%m-%d) \
  --format=JobID,JobName,Account,QOS,Partition,NCPUS,NNodes,\
           ReqMem,MaxRSS,CPUTime,Elapsed,State,ExitCode \
  --units=G

# Cluster-wide report
sreport cluster utilization \
  start=2024-01-01 end=2024-03-31 \
  -t hourper
Enter fullscreen mode Exit fullscreen mode

For SIEM integration, SLURM writes structured logs to syslog. These can be forwarded to Wazuh, Elastic SIEM, or Splunk for correlation with authentication events and anomaly detection.


Key Configuration Files

File Purpose Critical settings
slurm.conf Main config: nodes, partitions, plugins SelectType, PriorityType, AccountingStorageType
slurmdbd.conf Accounting daemon: DB credentials Permissions must be 0600
cgroup.conf Resource enforcement ConstrainRAMSpace, ConstrainDevices
gres.conf GPU/FPGA topology and binding GPU count, MIG partitions
topology.conf Network topology for MPI placement Switch hierarchy, InfiniBand fabric
acct_gather.conf Per-job energy and I/O metrics RAPL, InfiniBand, Lustre

Annotated slurm.conf for a GPU cluster

# Identity
ClusterName=mycluster
SlurmctldHost=controller01
SlurmctldHost=controller02  # HA backup

# Ports
SlurmctldPort=6817
SlurmdPort=6818

# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres           # Consumable resources: Track
SelectTypeParameters=CR_Core_Memory   # individual CPUs and memory
SchedulerParameters=bf_max_job_test=500,bf_resolution=60

# Priority (multifactor)
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightJobSize=100
PriorityDecayHalfLife=7-0             # 7 days half-life for fairshare
PriorityMaxAge=7-0

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStoragePort=6819
AccountingStorageUser=slurm
AccountingStoragePass=<db_password>
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30             # Collect every 30s

# Task and process tracking
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup

# GRES (GPU)
GresTypes=gpu

# Timeouts
SlurmdTimeout=300
SlurmctldTimeout=120
MessageTimeout=10

# Logging
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info

# Nodes (example: 16 nodes, 8x A100 each)
NodeName=node[01-16] \
  CPUs=64 \
  RealMemory=512000 \
  Gres=gpu:a100:8 \
  State=UNKNOWN

# Partitions
PartitionName=gpu \
  Nodes=node[01-16] \
  MaxTime=INFINITE \
  DefaultTime=24:00:00 \
  State=UP \
  Default=YES

PartitionName=debug \
  Nodes=node[01-02] \
  MaxTime=1:00:00 \
  Priority=100 \
  State=UP
Enter fullscreen mode Exit fullscreen mode

Operational Runbook: Common Tasks

Drain a node for maintenance

# Drain: no new jobs, current jobs finish
scontrol update NodeName=node05 State=DRAIN Reason="scheduled maintenance"

# Check when node will be empty
squeue -w node05

# After jobs finish, confirm drain
scontrol show node node05 | grep State

# Return to service
scontrol update NodeName=node05 State=RESUME
Enter fullscreen mode Exit fullscreen mode

Hold and release a job

# Hold a pending job (prevents scheduling)
scontrol hold 12345

# Release
scontrol release 12345

# Requeue a failed running job
scontrol requeue 12345
Enter fullscreen mode Exit fullscreen mode

Identify wasted resources

# Jobs where memory usage < 50% of allocation
sacct --format=JobID,ReqMem,MaxRSS,CPUTime,AveCPU \
  --state=COMPLETED \
  --starttime=2024-01-01 \
  | awk '$3 != 0 && ($3/$2) < 0.5 {print $0}'
Enter fullscreen mode Exit fullscreen mode

Summary

SLURM in one diagram:

User submits job (sbatch / srun / srun --pty)
        |
        v
slurmctld
   validates resources (partitions + associations + QOS)
   queues job (PENDING)
   computes priority (fairshare + QOS + age + jobsize)
   runs backfill scheduling
   dispatches to allocated nodes (RUNNING)
   records lifecycle to slurmdbd
        |
        +-- slurmdbd -> MariaDB (full accounting, audit trail)
        |
        +-- slurmd on each node
                |
                +-- cgroups v2   (memory, CPU, GPU isolation)
                +-- Prolog       (pre-job setup, root)
                +-- slurmstepd   (user process, MPI launch)
                +-- Epilog       (post-job cleanup, root)
                +-- heartbeat    (node health to slurmctld)
                |
                +-- slurm-exporter :8080  (job + node metrics)
                +-- DCGM Exporter  :9400  (GPU metrics + job_id)
                        |
                        v
                VMAgent -> VictoriaMetrics -> Grafana

Security stack:
  MUNGE           inter-daemon auth (shared key, signed credentials)
  cgroups v2      resource isolation (memory, CPU, GPU per job)
  Associations    RBAC + fairshare (cluster > account > user > QOS)
  JWT + TLS       API security (slurmrestd behind reverse proxy)
  sacct / slurmdbd  audit trail (full accounting, queryable)
Enter fullscreen mode Exit fullscreen mode

The three files to master before anything else: slurm.conf, cgroup.conf, gres.conf. Everything else builds on top of them.


References


This article is part of the HPC Observability series. Next: Building GPU efficiency dashboards with VictoriaMetrics and Grafana for AI training workloads.


  1. Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. Lecture Notes in Computer Science, 2862, 44-60. https://doi.org/10.1007/10968987_3 

  2. TOP500 Editors (2023). Statistics on Resource Management Software. TOP500 Project. https://www.top500.org/statistics/details/rmsoftware/1 

  3. SchedMD LLC. (2024). High Availability in SLURM. SLURM Documentation. https://slurm.schedmd.com/high_availability.html 

  4. SchedMD LLC. (2024). REST API Guide. SLURM Documentation. https://slurm.schedmd.com/rest.html 

  5. Grondona, M. (2024). MUNGE Authentication Service. GitHub. https://github.com/dun/munge 

  6. Lifka, D. (1995). The ANL/IBM SP Scheduling System. Job Scheduling Strategies for Parallel Processing, 295-303. https://doi.org/10.1007/3-540-60153-8_31 

  7. SchedMD LLC. (2024). Multifactor Priority Plugin. SLURM Documentation. https://slurm.schedmd.com/priority_multifactor.html 

  8. Penso, V. et al. (2024). prometheus-slurm-exporter. GitHub. https://github.com/vpenso/prometheus-slurm-exporter 

  9. NVIDIA Corporation. (2024). DCGM Exporter. GitHub. https://github.com/NVIDIA/dcgm-exporter 

  10. VictoriaMetrics Team. (2024). VictoriaMetrics Documentation. https://docs.victoriametrics.com 

  11. SchedMD LLC. (2024). Cgroups Guide. SLURM Documentation. https://slurm.schedmd.com/cgroups.html 

Top comments (0)