Erythix

Posted on Mar 7

SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters

#hpc #linux #devops #observability

SLURM powers Summit, Frontier, LUMI, and most of the TOP500. If you work with GPU clusters, AI training infrastructure, or scientific computing, understanding how it works is not optional.

What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is an open-source cluster workload manager originally developed at Lawrence Livermore National Laboratory ¹. It is now the de-facto standard for HPC environments worldwide, deployed on more than 60% of TOP500 systems ².

It has three core responsibilities:

Resource allocation assigns compute nodes to jobs based on configured policies: partitions, Quality of Service (QOS) rules, and fairshare weights. It accounts for CPU cores, memory, GPU devices, and network topology simultaneously.

Job scheduling queues submitted jobs and launches them when resources become available. The default algorithm is backfill scheduling, which fills scheduling gaps with smaller jobs without delaying the larger ones already queued.

Accounting records every resource consumption event — who ran what, on which nodes, for how long, consuming how much CPU, memory, and GPU — via a dedicated daemon connected to a relational database.

It operates on a heartbeat model: nodes report their state to a central controller, which dispatches queued jobs as resources free up.

Architecture

The Four Daemons

+------------------------------------------------------------------+
|                        CONTROL PLANE                             |
|                                                                  |
|   +------------------+          +------------------+            |
|   |   slurmctld      |<-------->|   slurmdbd       |            |
|   |   TCP 6817       |          |   TCP 6819       |            |
|   |                  |          |                  |            |
|   |  Scheduler       |          |  Accounting GW   |            |
|   |  State manager   |          |  Only DB client  |            |
|   +--------+---------+          +--------+---------+            |
|            |                             |                       |
+------------|-----------------------------|-----------------------+
             |                             |
             | TCP 6818                    | SQL TCP 3306
             v                             v
+---------------------------+    +--------------------+
|   COMPUTE NODES           |    |   MariaDB          |
|                           |    |   Accounting DB    |
|   slurmd   slurmd   ...   |    +--------------------+
|   node01   node02         |
|                           |
|   cgroups v2 enforcement  |
|   Prolog / Epilog hooks   |
+---------------------------+
             ^
             |
     +-------+--------+
     |   slurmrestd   |
     |   TCP 6820     |
     |   OpenAPI/JWT  |
     +----------------+

`slurmctld` — Controller Daemon (TCP 6817)

The brain of the cluster. It maintains the global state of every node and every job in memory, periodically checkpointing to disk (the StateSaveLocation directory). On restart after a failure, it replays this state to resume operations without losing queued or running jobs.

Key responsibilities:

Runs the scheduler plugin (backfill by default, with optional gang scheduling)
Manages node state transitions (IDLE, ALLOCATED, DOWN, DRAIN, FAIL)
Dispatches jobs to slurmd on compute nodes
Enforces partition and QOS limits
Processes all client commands (sbatch, srun, scontrol)

High availability is supported via a primary/backup pair. If the primary slurmctld fails, the backup takes over within seconds, with minimal job disruption ³.

`slurmd` — Node Daemon (TCP 6818)

One instance runs on every compute node. It is the execution agent: it receives job steps dispatched by slurmctld, spawns user processes inside cgroup hierarchies, monitors resource consumption continuously, and sends periodic heartbeats back to the controller.

When a heartbeat is missed beyond the configured SlurmdTimeout, the controller marks the node as DOWN and can optionally reschedule its jobs.

slurmd also runs the site-defined Prolog script before launching each job (environment setup, filesystem mounting, health checks) and the Epilog script after completion (cleanup, unmounting, node validation).

`slurmdbd` — Database Daemon (TCP 6819)

The exclusive gateway to the accounting database. No other daemon connects to MariaDB directly. This design creates a single point of control for all historical data: job records, resource consumption, user associations, QOS definitions, and the fairshare tree.

slurmdbd can run on a dedicated server, isolated from the controller. Losing it does not stop job execution — running jobs continue — but new accounting records are buffered locally on slurmctld and flushed when connectivity is restored.

`slurmrestd` — REST API Daemon (TCP 6820)

Available since SLURM 20.11 ⁴, slurmrestd exposes the full SLURM management interface as an OpenAPI-documented REST API. It bridges REST calls to internal SLURM RPC, enabling integration with web portals, JupyterHub, workflow orchestrators (Nextflow, Snakemake, Apache Airflow), and cloud bursting systems.

Authentication is via JWT tokens. The API surface is significant and must be treated as a privileged endpoint.

Communication Flows

User (sbatch / srun / salloc)
        |
        | TCP 6817 — job submit, validated against associations + QOS
        v
  +-------------+   TCP 6819   +-------------+   SQL   +-----------+
  | slurmctld   |<------------>| slurmdbd    |-------->| MariaDB   |
  +------+------+   accounting +-------------+         +-----------+
         |
         | TCP 6818 — job dispatch (JobID, allocated nodes, resources)
         |
    +----+----+
    |         |
slurmd #1   slurmd #2  ...
    |
    +-- cgroups v2 (memory.max, cpu.max, devices allowlist)
    +-- Prolog  (runs as root before job)
    +-- job step (runs as user)
    +-- Epilog  (runs as root after job)
    +-- heartbeat -> slurmctld every SlurmdTimeout/3

slurmrestd --REST/JWT--> slurmctld (internal RPC)

All inter-daemon messages: signed + timestamped by MUNGE

Every message exchanged between slurmctld, slurmdbd, and slurmd is signed and timestamped by MUNGE (MUNGE Uid 'N' Gid Emporium). A credential contains the UID/GID of the originating process, a timestamp, and a configurable TTL. Replayed credentials are rejected ⁵.

Scheduling Deep Dive

Backfill Scheduling

The default sched/backfill plugin extends simple first-in-first-out scheduling by maintaining a time-ordered reservation list. When a large job cannot start immediately, the scheduler looks for smaller jobs that can be inserted into the scheduling gap without pushing back the start time of the large job ⁶.

This is why you sometimes see a small 2-node job start before a 100-node job that was submitted earlier: the 100-node job is waiting for enough nodes to free up, and the 2-node job fits in the current available capacity without affecting the projected start time.

Queue state:
  Job A: 100 nodes, submitted T+0, cannot start (only 20 nodes free)
  Job B: 10 nodes, submitted T+10

Backfill logic:
  - Job A projected start: T+45 (when enough nodes finish current jobs)
  - Job B can complete before T+45 if started now
  - Job B is scheduled immediately without delaying Job A

Priority Calculation

SLURM computes a weighted sum for each queued job ⁷:

Priority = w_age        * factor_age
         + w_fairshare  * factor_fairshare
         + w_jobsize    * factor_jobsize
         + w_qos        * factor_qos
         + w_partition  * factor_partition
         + w_assoc      * factor_assoc

The fairshare factor is the most important for multi-tenant clusters. It is computed using a decay algorithm: resource usage from the past contributes less weight over time (configured by PriorityDecayHalfLife). A user who ran 10,000 CPU-hours last week has a lower fairshare score than a user who has not submitted a job in two weeks, pushing the inactive user's jobs to higher priority.

The tool sprio shows the current priority breakdown for every queued job.

QOS and Associations

The association tree controls access at every level:

Cluster: mycluster
  |
  +-- Account: research_lab        (FairShare: 40)
  |       |
  |       +-- User: alice          (FairShare: 20)
  |       |     QOS: normal, gpu_priority
  |       |     MaxTRES: cpu=256,gres/gpu=8
  |       |
  |       +-- User: bob            (FairShare: 20)
  |             QOS: normal
  |             MaxTRES: cpu=128
  |
  +-- Account: ops_team            (FairShare: 60)
          |
          +-- User: carol          (FairShare: 60)
                QOS: normal, high_priority, infra
                MaxTRES: cpu=512,gres/gpu=32

A QOS defines hard limits (GrpTRES, MaxTRESPerJob, MaxWallDurationPerJob) and soft priority boosts. When a user submits a job requesting resources beyond their association or QOS limits, the job is rejected at submission time, not at scheduling time.

Job Lifecycle

  SUBMIT         QUEUE          ALLOCATE         RUN          COMPLETE
     |               |               |              |               |
 sbatch          PENDING          Nodes          RUNNING        COMPLETED
 script.sh        state          reserved         state           state
     |               |               |              |               |
     v               v               v              v               v
 slurmctld      Scheduler        slurmd         slurmd          slurmdbd
 validates      computes         runs           monitors        records
 resources      priority         Prolog         CPU/mem/GPU     all metrics
 + QOS limits   backfill         cgroups        heartbeats      to MariaDB
                analysis         configured     to controller

Submission

#!/bin/bash
#SBATCH --job-name=train_llm
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:a100:8
#SBATCH --mem=512G
#SBATCH --time=48:00:00
#SBATCH --account=research_lab
#SBATCH --qos=gpu_priority

module load cuda/12.2
srun python train.py --config config.yaml

slurmctld validates this script against:

The partition definition (nodes available, max wall time)
The user's association (account exists, user is a member)
The QOS (resource limits not exceeded)
Current cluster capacity (enough GPUs exist)

If all checks pass, the job receives a JobID and enters the PENDING state.

Execution on Nodes

When slurmctld dispatches the job, each slurmd on the allocated nodes:

Runs the site Prolog (as root)
Creates the cgroup hierarchy for the job
Sets memory.max, cpu.max, and the GPU device allowlist
Spawns slurmstepd, which drops privileges to the user and executes the job step
Monitors consumption every JobAcctGatherFrequency seconds
Runs the Epilog on completion (as root)
Reports final resource usage to slurmctld, which forwards it to slurmdbd

Job Arrays

For parameter sweeps, job arrays avoid submitting thousands of individual jobs:

#SBATCH --array=0-99%10    # 100 tasks, max 10 running simultaneously

PARAM=${SLURM_ARRAY_TASK_ID}
python experiment.py --seed $PARAM

Each task gets its own JobID (formatted as ArrayJobID_TaskID) and its own accounting record. The %10 limits concurrent tasks to avoid saturating the cluster.

Observability Stack

Architecture

Compute Nodes
+------------------------+    +------------------------+
| slurmd                 |    | DCGM Exporter          |
|                        |    | (NVIDIA GPU metrics)   |
| slurm-exporter :8080   |    | :9400                  |
|  slurm_jobs_running    |    |  DCGM_FI_DEV_GPU_UTIL  |
|  slurm_jobs_pending    |    |  DCGM_FI_DEV_MEM_COPY  |
|  slurm_nodes_alloc     |    |  DCGM_FI_DEV_NVLINK_*  |
|  slurm_cpus_idle       |    |  label: slurm_job_id   |
+----------+-------------+    +----------+-------------+
           |                             |
           | Prometheus scrape           | Prometheus scrape
           v                             v
+-----------------------------------------------+
|   VMAgent (per node or centralized)           |
|   Relabeling, filtering, remote_write         |
+-------------------+---------------------------+
                    |
                    | remote_write
                    v
+-----------------------------------------------+
|   VictoriaMetrics (vminsert / vmstorage)      |
|   Long-term storage, MetricsQL                |
+-------------------+---------------------------+
                    |
                    | datasource
                    v
+-----------------------------------------------+
|   Grafana                                     |
|   Job efficiency dashboards                   |
|   GPU heatmaps, fairshare visualization       |
|   Alerting (PagerDuty, Slack)                 |
+-----------------------------------------------+

slurm-exporter

The prometheus-slurm-exporter scrapes SLURM CLI tools (squeue, sinfo, sacct) and exposes metrics on port 8080 ⁸.

Key metrics exposed:

Metric	Description
`slurm_jobs_running`	Count of running jobs, by partition
`slurm_jobs_pending`	Count of pending jobs, by reason
`slurm_nodes_alloc`	Nodes in ALLOCATED state
`slurm_nodes_idle`	Nodes in IDLE state
`slurm_nodes_down`	Nodes in DOWN/DRAIN state
`slurm_cpus_total`	Total CPUs in cluster
`slurm_cpus_idle`	Idle CPUs
`slurm_account_cpu_count`	CPUs used per account

A known limitation: the exporter calls CLI binaries, which adds latency and load at scale (thousands of jobs). At very large scale, prefer reading directly from slurmctld's state files or using slurmrestd as a data source.

DCGM Exporter and GPU Correlation

The NVIDIA DCGM Exporter exposes per-GPU hardware metrics ⁹:

DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="...", hostname="node01"} 94
DCGM_FI_DEV_FB_USED{gpu="0", ...} 38654
DCGM_FI_DEV_POWER_USAGE{gpu="0", ...} 387
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", ...} 198432

To correlate GPU metrics with SLURM jobs, DCGM can be configured to expose the SLURM_JOB_ID environment variable as a label. This enables Grafana queries like:

# GPU efficiency for a specific job
DCGM_FI_DEV_GPU_UTIL{slurm_job_id="12345"}

This is the key insight for AI/ML workloads: raw GPU utilization tells you if GPUs are busy, but job_id correlation tells you which specific training run, user, or team is responsible.

Why VictoriaMetrics for HPC

Prometheus alone struggles with HPC-scale workloads for three reasons:

Cardinality: a 1000-node cluster with 8 GPUs each, running thousands of jobs, generates millions of unique time series
Retention: HPC accounting requires months or years of metrics for capacity planning and user reporting
Query performance: job efficiency reports aggregate over large time ranges with complex label filters

VictoriaMetrics addresses all three ¹⁰:

# vmagent config: distributed collection on compute nodes
scrape_configs:
  - job_name: slurm
    static_configs:
      - targets: ["localhost:8080"]
  - job_name: dcgm
    static_configs:
      - targets: ["localhost:9400"]

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"

Compression ratios on HPC workloads are typically 10-15x better than Prometheus TSDB, and MetricsQL supports advanced aggregations like quantile_over_time and increase that are essential for wait time analysis.

KPIs That Actually Matter

Most HPC operators track GPU utilization and stop there. That is not enough. The metrics that reveal actual cluster health:

Metric	Formula	Why it matters
CPU efficiency	`used_cpus / alloc_cpus`	Reveals job over-allocation and poor sizing
Memory waste	`alloc_mem - max_rss`	Often 40-60% on ML clusters
Wait time P95	`start_time - submit_time`	Scheduler health indicator
Fairshare drift	`factor_fairshare` over 30d	Detects long-term resource monopolies
GPU occupancy	`DCGM_GPU_UTIL` weighted by job	Distinguishes idle allocation from compute-bound
Job failure rate	`failed / (completed + failed)`	Infrastructure reliability signal

A sacct query for job efficiency after the fact:

sacct -j 12345 \
  --format=JobID,CPUTime,CPUTimeRAW,AveCPU,MaxRSS,ReqMem,Elapsed \
  --units=G

Security

Authentication: MUNGE

MUNGE is the default authentication mechanism for all inter-daemon communication ⁵. Every message is signed with a shared secret (/etc/munge/munge.key), timestamped, and includes the originating UID/GID. A receiving daemon verifies the signature and rejects credentials outside the configured TTL window, preventing replay attacks.

Node A                              Node B
+------------------+                +------------------+
|  slurmctld       |                |  slurmd          |
|                  |--[credential]->|                  |
|  signs with      |                |  verifies with   |
|  munge.key       |                |  munge.key       |
|                  |<--[response]---|                  |
+------------------+                +------------------+

Credential contains:
  - UID / GID of sender
  - Timestamp (TTL: 300s default)
  - Realm (optional)
  - Payload (encrypted)

Key operational requirements:

munge.key must be identical on all nodes (controller + compute + login + slurmdbd server)
File permissions must be 0400, owned by the munge user
Distribution should use a secrets manager (HashiCorp Vault, Ansible Vault) rather than manual scp
Key rotation requires a coordinated restart of all SLURM daemons — the most disruptive operation on a live cluster

Key rotation procedure on a live cluster:

# 1. Generate new key on the controller
mungekey --create --keyfile /etc/munge/munge.key.new

# 2. Distribute to all nodes (use your config management tool)
ansible all -m copy \
  -a "src=/etc/munge/munge.key.new dest=/etc/munge/munge.key mode=0400 owner=munge"

# 3. Restart munge everywhere simultaneously (parallel SSH)
ansible all -m service -a "name=munge state=restarted"

# 4. Restart SLURM daemons in order
ansible compute -m service -a "name=slurmd state=restarted"
ansible controller -m service -a "name=slurmctld state=restarted"
ansible dbd -m service -a "name=slurmdbd state=restarted"

Resource Isolation: cgroups v2

Without cgroup enforcement, a job that allocates 64GB of memory can consume 512GB, triggering OOM kills across all other jobs on the node. SLURM's cgroup plugin prevents this ¹¹.

slurmd receives job dispatch
        |
        v
Creates cgroup hierarchy:
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_12345/
        |
        +-- memory.max        = 65536M   (allocated memory)
        +-- memory.swap.max   = 0        (no swap for HPC jobs)
        +-- cpu.max           = 6400 100000  (64 cores)
        +-- devices.allow     = c 195:0  (GPU 0 only)
        +-- devices.allow     = c 195:1  (GPU 1 only)

Essential cgroup.conf settings:

CgroupPlugin=autodetect
ConstrainRAMSpace=yes       # OOM kill if job exceeds memory limit
ConstrainSwapSpace=yes      # Disable swap for job processes
ConstrainCores=yes          # Pin processes to allocated CPU cores
ConstrainDevices=yes        # Restrict GPU access to allocated devices
AllowedRAMSpace=100         # No tolerance: enforce hard limit
TaskAffinity=yes            # Bind threads to cores

ConstrainRAMSpace=yes is non-negotiable in any multi-tenant environment. Without it, a misbehaving job can take down an entire node.

Authorization: RBAC and Associations

SLURM's authorization model is hierarchical. Access is validated at every layer:

Level 1 — Cluster
  Who can submit at all?

Level 2 — Account
  Which budget/project does the job charge to?
  What is the fairshare allocation?

Level 3 — User
  Individual limits within the account.

Level 4 — QOS
  Hard limits on resources, wall time, and concurrent jobs.
  Priority boosts or penalties.

Level 5 — Partition
  Which physical nodes? What maximum wall time?
  Restricted to specific groups (AllowGroups)?

Managing associations with sacctmgr:

# Create account hierarchy
sacctmgr add cluster mycluster
sacctmgr add account research_lab cluster=mycluster fairshare=40
sacctmgr add user alice account=research_lab defaultaccount=research_lab

# Define QOS
sacctmgr add qos gpu_priority \
  MaxTRESPerUser=cpu=256,gres/gpu=8 \
  MaxWallDurationPerJob=48:00:00 \
  Priority=100

# Assign QOS to user
sacctmgr modify user alice set qos+=gpu_priority

API Security: JWT and TLS

slurmrestd is the largest attack surface in a modern SLURM deployment. A compromised API token provides full cluster control: job submission, node management, user impersonation.

Hardening checklist:

# 1. Generate JWT signing key
openssl genrsa -out /etc/slurm/jwt_hs256.key 2048
chmod 0600 /etc/slurm/jwt_hs256.key
chown slurm: /etc/slurm/jwt_hs256.key

# In slurm.conf:
# AuthAltTypes=auth/jwt
# AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key

# 2. Issue short-lived tokens (1 hour max)
scontrol token username=alice lifespan=3600

# 3. Run behind nginx with rate limiting
# nginx.conf excerpt:
# limit_req_zone $binary_remote_addr zone=slurm_api:10m rate=10r/s;
# location /slurm/ {
#   limit_req zone=slurm_api burst=20 nodelay;
#   proxy_pass http://127.0.0.1:6820;
# }

# 4. Restrict port 6820 by firewall
# Only the proxy IP should reach slurmrestd directly

For inter-daemon TLS (SLURM 23.x+), add to slurm.conf:

CommunicationParameters=EnableTLS
TLSType=tls/openssl

Audit Trail

slurmdbd maintains a complete, immutable audit trail. Every job submission, modification, start, and completion is recorded with full resource accounting. This data is queryable via sacct:

# Full accounting for a user, last 30 days
sacct -u alice \
  --starttime=$(date -d '30 days ago' +%Y-%m-%d) \
  --format=JobID,JobName,Account,QOS,Partition,NCPUS,NNodes,\
           ReqMem,MaxRSS,CPUTime,Elapsed,State,ExitCode \
  --units=G

# Cluster-wide report
sreport cluster utilization \
  start=2024-01-01 end=2024-03-31 \
  -t hourper

For SIEM integration, SLURM writes structured logs to syslog. These can be forwarded to Wazuh, Elastic SIEM, or Splunk for correlation with authentication events and anomaly detection.

Key Configuration Files

File	Purpose	Critical settings
`slurm.conf`	Main config: nodes, partitions, plugins	`SelectType`, `PriorityType`, `AccountingStorageType`
`slurmdbd.conf`	Accounting daemon: DB credentials	Permissions must be `0600`
`cgroup.conf`	Resource enforcement	`ConstrainRAMSpace`, `ConstrainDevices`
`gres.conf`	GPU/FPGA topology and binding	GPU count, MIG partitions
`topology.conf`	Network topology for MPI placement	Switch hierarchy, InfiniBand fabric
`acct_gather.conf`	Per-job energy and I/O metrics	RAPL, InfiniBand, Lustre

Annotated `slurm.conf` for a GPU cluster

# Identity
ClusterName=mycluster
SlurmctldHost=controller01
SlurmctldHost=controller02  # HA backup

# Ports
SlurmctldPort=6817
SlurmdPort=6818

# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres           # Consumable resources: Track
SelectTypeParameters=CR_Core_Memory   # individual CPUs and memory
SchedulerParameters=bf_max_job_test=500,bf_resolution=60

# Priority (multifactor)
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightJobSize=100
PriorityDecayHalfLife=7-0             # 7 days half-life for fairshare
PriorityMaxAge=7-0

# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStoragePort=6819
AccountingStorageUser=slurm
AccountingStoragePass=<db_password>
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30             # Collect every 30s

# Task and process tracking
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup

# GRES (GPU)
GresTypes=gpu

# Timeouts
SlurmdTimeout=300
SlurmctldTimeout=120
MessageTimeout=10

# Logging
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info

# Nodes (example: 16 nodes, 8x A100 each)
NodeName=node[01-16] \
  CPUs=64 \
  RealMemory=512000 \
  Gres=gpu:a100:8 \
  State=UNKNOWN

# Partitions
PartitionName=gpu \
  Nodes=node[01-16] \
  MaxTime=INFINITE \
  DefaultTime=24:00:00 \
  State=UP \
  Default=YES

PartitionName=debug \
  Nodes=node[01-02] \
  MaxTime=1:00:00 \
  Priority=100 \
  State=UP

Operational Runbook: Common Tasks

Drain a node for maintenance

# Drain: no new jobs, current jobs finish
scontrol update NodeName=node05 State=DRAIN Reason="scheduled maintenance"

# Check when node will be empty
squeue -w node05

# After jobs finish, confirm drain
scontrol show node node05 | grep State

# Return to service
scontrol update NodeName=node05 State=RESUME

Hold and release a job

# Hold a pending job (prevents scheduling)
scontrol hold 12345

# Release
scontrol release 12345

# Requeue a failed running job
scontrol requeue 12345

Identify wasted resources

# Jobs where memory usage < 50% of allocation
sacct --format=JobID,ReqMem,MaxRSS,CPUTime,AveCPU \
  --state=COMPLETED \
  --starttime=2024-01-01 \
  | awk '$3 != 0 && ($3/$2) < 0.5 {print $0}'

Summary

SLURM in one diagram:

User submits job (sbatch / srun / srun --pty)
        |
        v
slurmctld
   validates resources (partitions + associations + QOS)
   queues job (PENDING)
   computes priority (fairshare + QOS + age + jobsize)
   runs backfill scheduling
   dispatches to allocated nodes (RUNNING)
   records lifecycle to slurmdbd
        |
        +-- slurmdbd -> MariaDB (full accounting, audit trail)
        |
        +-- slurmd on each node
                |
                +-- cgroups v2   (memory, CPU, GPU isolation)
                +-- Prolog       (pre-job setup, root)
                +-- slurmstepd   (user process, MPI launch)
                +-- Epilog       (post-job cleanup, root)
                +-- heartbeat    (node health to slurmctld)
                |
                +-- slurm-exporter :8080  (job + node metrics)
                +-- DCGM Exporter  :9400  (GPU metrics + job_id)
                        |
                        v
                VMAgent -> VictoriaMetrics -> Grafana

Security stack:
  MUNGE           inter-daemon auth (shared key, signed credentials)
  cgroups v2      resource isolation (memory, CPU, GPU per job)
  Associations    RBAC + fairshare (cluster > account > user > QOS)
  JWT + TLS       API security (slurmrestd behind reverse proxy)
  sacct / slurmdbd  audit trail (full accounting, queryable)

The three files to master before anything else: slurm.conf, cgroup.conf, gres.conf. Everything else builds on top of them.

References

This article is part of the HPC Observability series. Next: Building GPU efficiency dashboards with VictoriaMetrics and Grafana for AI training workloads.

Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. Lecture Notes in Computer Science, 2862, 44-60. https://doi.org/10.1007/10968987_3 ↩
TOP500 Editors (2023). Statistics on Resource Management Software. TOP500 Project. https://www.top500.org/statistics/details/rmsoftware/1 ↩
SchedMD LLC. (2024). High Availability in SLURM. SLURM Documentation. https://slurm.schedmd.com/high_availability.html ↩
SchedMD LLC. (2024). REST API Guide. SLURM Documentation. https://slurm.schedmd.com/rest.html ↩
Grondona, M. (2024). MUNGE Authentication Service. GitHub. https://github.com/dun/munge ↩
Lifka, D. (1995). The ANL/IBM SP Scheduling System. Job Scheduling Strategies for Parallel Processing, 295-303. https://doi.org/10.1007/3-540-60153-8_31 ↩
SchedMD LLC. (2024). Multifactor Priority Plugin. SLURM Documentation. https://slurm.schedmd.com/priority_multifactor.html ↩
Penso, V. et al. (2024). prometheus-slurm-exporter. GitHub. https://github.com/vpenso/prometheus-slurm-exporter ↩
NVIDIA Corporation. (2024). DCGM Exporter. GitHub. https://github.com/NVIDIA/dcgm-exporter ↩
VictoriaMetrics Team. (2024). VictoriaMetrics Documentation. https://docs.victoriametrics.com ↩
SchedMD LLC. (2024). Cgroups Guide. SLURM Documentation. https://slurm.schedmd.com/cgroups.html ↩

DEV Community

SLURM in a nutshell: Architecture, Observability and Security for HPC Clusters

What is SLURM?

Architecture

The Four Daemons

`slurmctld` — Controller Daemon (TCP 6817)

`slurmd` — Node Daemon (TCP 6818)

`slurmdbd` — Database Daemon (TCP 6819)

`slurmrestd` — REST API Daemon (TCP 6820)

Communication Flows

Scheduling Deep Dive

Backfill Scheduling

Priority Calculation

QOS and Associations

Job Lifecycle

Submission

Execution on Nodes

Job Arrays

Observability Stack

Architecture

slurm-exporter

DCGM Exporter and GPU Correlation

Why VictoriaMetrics for HPC

KPIs That Actually Matter

Security

Authentication: MUNGE

Resource Isolation: cgroups v2

Authorization: RBAC and Associations

API Security: JWT and TLS

Audit Trail

Key Configuration Files

Annotated `slurm.conf` for a GPU cluster

Operational Runbook: Common Tasks

Drain a node for maintenance

Hold and release a job

Identify wasted resources

Summary

References

Top comments (0)

What is SLURM?

Architecture

The Four Daemons

slurmctld — Controller Daemon (TCP 6817)

slurmd — Node Daemon (TCP 6818)

slurmdbd — Database Daemon (TCP 6819)

slurmrestd — REST API Daemon (TCP 6820)

Communication Flows

Scheduling Deep Dive

Backfill Scheduling

Priority Calculation

QOS and Associations

Job Lifecycle

Submission

Execution on Nodes

Job Arrays

Observability Stack

Architecture

slurm-exporter

DCGM Exporter and GPU Correlation

Why VictoriaMetrics for HPC

KPIs That Actually Matter

Security

Authentication: MUNGE

Resource Isolation: cgroups v2

Authorization: RBAC and Associations

API Security: JWT and TLS

Audit Trail

Key Configuration Files

Annotated slurm.conf for a GPU cluster

Operational Runbook: Common Tasks

Drain a node for maintenance

Hold and release a job

Identify wasted resources

Summary

References

`slurmctld` — Controller Daemon (TCP 6817)

`slurmd` — Node Daemon (TCP 6818)

`slurmdbd` — Database Daemon (TCP 6819)

`slurmrestd` — REST API Daemon (TCP 6820)

Annotated `slurm.conf` for a GPU cluster