SLURM powers Summit, Frontier, LUMI, and most of the TOP500. If you work with GPU clusters, AI training infrastructure, or scientific computing, understanding how it works is not optional.
What is SLURM?
SLURM (Simple Linux Utility for Resource Management) is an open-source cluster workload manager originally developed at Lawrence Livermore National Laboratory 1. It is now the de-facto standard for HPC environments worldwide, deployed on more than 60% of TOP500 systems 2.
It has three core responsibilities:
Resource allocation assigns compute nodes to jobs based on configured policies: partitions, Quality of Service (QOS) rules, and fairshare weights. It accounts for CPU cores, memory, GPU devices, and network topology simultaneously.
Job scheduling queues submitted jobs and launches them when resources become available. The default algorithm is backfill scheduling, which fills scheduling gaps with smaller jobs without delaying the larger ones already queued.
Accounting records every resource consumption event — who ran what, on which nodes, for how long, consuming how much CPU, memory, and GPU — via a dedicated daemon connected to a relational database.
It operates on a heartbeat model: nodes report their state to a central controller, which dispatches queued jobs as resources free up.
Architecture
The Four Daemons
+------------------------------------------------------------------+
| CONTROL PLANE |
| |
| +------------------+ +------------------+ |
| | slurmctld |<-------->| slurmdbd | |
| | TCP 6817 | | TCP 6819 | |
| | | | | |
| | Scheduler | | Accounting GW | |
| | State manager | | Only DB client | |
| +--------+---------+ +--------+---------+ |
| | | |
+------------|-----------------------------|-----------------------+
| |
| TCP 6818 | SQL TCP 3306
v v
+---------------------------+ +--------------------+
| COMPUTE NODES | | MariaDB |
| | | Accounting DB |
| slurmd slurmd ... | +--------------------+
| node01 node02 |
| |
| cgroups v2 enforcement |
| Prolog / Epilog hooks |
+---------------------------+
^
|
+-------+--------+
| slurmrestd |
| TCP 6820 |
| OpenAPI/JWT |
+----------------+
slurmctld — Controller Daemon (TCP 6817)
The brain of the cluster. It maintains the global state of every node and every job in memory, periodically checkpointing to disk (the StateSaveLocation directory). On restart after a failure, it replays this state to resume operations without losing queued or running jobs.
Key responsibilities:
- Runs the scheduler plugin (backfill by default, with optional gang scheduling)
- Manages node state transitions (IDLE, ALLOCATED, DOWN, DRAIN, FAIL)
- Dispatches jobs to
slurmdon compute nodes - Enforces partition and QOS limits
- Processes all client commands (
sbatch,srun,scontrol)
High availability is supported via a primary/backup pair. If the primary slurmctld fails, the backup takes over within seconds, with minimal job disruption 3.
slurmd — Node Daemon (TCP 6818)
One instance runs on every compute node. It is the execution agent: it receives job steps dispatched by slurmctld, spawns user processes inside cgroup hierarchies, monitors resource consumption continuously, and sends periodic heartbeats back to the controller.
When a heartbeat is missed beyond the configured SlurmdTimeout, the controller marks the node as DOWN and can optionally reschedule its jobs.
slurmd also runs the site-defined Prolog script before launching each job (environment setup, filesystem mounting, health checks) and the Epilog script after completion (cleanup, unmounting, node validation).
slurmdbd — Database Daemon (TCP 6819)
The exclusive gateway to the accounting database. No other daemon connects to MariaDB directly. This design creates a single point of control for all historical data: job records, resource consumption, user associations, QOS definitions, and the fairshare tree.
slurmdbd can run on a dedicated server, isolated from the controller. Losing it does not stop job execution — running jobs continue — but new accounting records are buffered locally on slurmctld and flushed when connectivity is restored.
slurmrestd — REST API Daemon (TCP 6820)
Available since SLURM 20.11 4, slurmrestd exposes the full SLURM management interface as an OpenAPI-documented REST API. It bridges REST calls to internal SLURM RPC, enabling integration with web portals, JupyterHub, workflow orchestrators (Nextflow, Snakemake, Apache Airflow), and cloud bursting systems.
Authentication is via JWT tokens. The API surface is significant and must be treated as a privileged endpoint.
Communication Flows
User (sbatch / srun / salloc)
|
| TCP 6817 — job submit, validated against associations + QOS
v
+-------------+ TCP 6819 +-------------+ SQL +-----------+
| slurmctld |<------------>| slurmdbd |-------->| MariaDB |
+------+------+ accounting +-------------+ +-----------+
|
| TCP 6818 — job dispatch (JobID, allocated nodes, resources)
|
+----+----+
| |
slurmd #1 slurmd #2 ...
|
+-- cgroups v2 (memory.max, cpu.max, devices allowlist)
+-- Prolog (runs as root before job)
+-- job step (runs as user)
+-- Epilog (runs as root after job)
+-- heartbeat -> slurmctld every SlurmdTimeout/3
slurmrestd --REST/JWT--> slurmctld (internal RPC)
All inter-daemon messages: signed + timestamped by MUNGE
Every message exchanged between slurmctld, slurmdbd, and slurmd is signed and timestamped by MUNGE (MUNGE Uid 'N' Gid Emporium). A credential contains the UID/GID of the originating process, a timestamp, and a configurable TTL. Replayed credentials are rejected 5.
Scheduling Deep Dive
Backfill Scheduling
The default sched/backfill plugin extends simple first-in-first-out scheduling by maintaining a time-ordered reservation list. When a large job cannot start immediately, the scheduler looks for smaller jobs that can be inserted into the scheduling gap without pushing back the start time of the large job 6.
This is why you sometimes see a small 2-node job start before a 100-node job that was submitted earlier: the 100-node job is waiting for enough nodes to free up, and the 2-node job fits in the current available capacity without affecting the projected start time.
Queue state:
Job A: 100 nodes, submitted T+0, cannot start (only 20 nodes free)
Job B: 10 nodes, submitted T+10
Backfill logic:
- Job A projected start: T+45 (when enough nodes finish current jobs)
- Job B can complete before T+45 if started now
- Job B is scheduled immediately without delaying Job A
Priority Calculation
SLURM computes a weighted sum for each queued job 7:
Priority = w_age * factor_age
+ w_fairshare * factor_fairshare
+ w_jobsize * factor_jobsize
+ w_qos * factor_qos
+ w_partition * factor_partition
+ w_assoc * factor_assoc
The fairshare factor is the most important for multi-tenant clusters. It is computed using a decay algorithm: resource usage from the past contributes less weight over time (configured by PriorityDecayHalfLife). A user who ran 10,000 CPU-hours last week has a lower fairshare score than a user who has not submitted a job in two weeks, pushing the inactive user's jobs to higher priority.
The tool sprio shows the current priority breakdown for every queued job.
QOS and Associations
The association tree controls access at every level:
Cluster: mycluster
|
+-- Account: research_lab (FairShare: 40)
| |
| +-- User: alice (FairShare: 20)
| | QOS: normal, gpu_priority
| | MaxTRES: cpu=256,gres/gpu=8
| |
| +-- User: bob (FairShare: 20)
| QOS: normal
| MaxTRES: cpu=128
|
+-- Account: ops_team (FairShare: 60)
|
+-- User: carol (FairShare: 60)
QOS: normal, high_priority, infra
MaxTRES: cpu=512,gres/gpu=32
A QOS defines hard limits (GrpTRES, MaxTRESPerJob, MaxWallDurationPerJob) and soft priority boosts. When a user submits a job requesting resources beyond their association or QOS limits, the job is rejected at submission time, not at scheduling time.
Job Lifecycle
SUBMIT QUEUE ALLOCATE RUN COMPLETE
| | | | |
sbatch PENDING Nodes RUNNING COMPLETED
script.sh state reserved state state
| | | | |
v v v v v
slurmctld Scheduler slurmd slurmd slurmdbd
validates computes runs monitors records
resources priority Prolog CPU/mem/GPU all metrics
+ QOS limits backfill cgroups heartbeats to MariaDB
analysis configured to controller
Submission
#!/bin/bash
#SBATCH --job-name=train_llm
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:a100:8
#SBATCH --mem=512G
#SBATCH --time=48:00:00
#SBATCH --account=research_lab
#SBATCH --qos=gpu_priority
module load cuda/12.2
srun python train.py --config config.yaml
slurmctld validates this script against:
- The partition definition (nodes available, max wall time)
- The user's association (account exists, user is a member)
- The QOS (resource limits not exceeded)
- Current cluster capacity (enough GPUs exist)
If all checks pass, the job receives a JobID and enters the PENDING state.
Execution on Nodes
When slurmctld dispatches the job, each slurmd on the allocated nodes:
- Runs the site Prolog (as root)
- Creates the cgroup hierarchy for the job
- Sets
memory.max,cpu.max, and the GPU device allowlist - Spawns
slurmstepd, which drops privileges to the user and executes the job step - Monitors consumption every
JobAcctGatherFrequencyseconds - Runs the Epilog on completion (as root)
- Reports final resource usage to
slurmctld, which forwards it toslurmdbd
Job Arrays
For parameter sweeps, job arrays avoid submitting thousands of individual jobs:
#SBATCH --array=0-99%10 # 100 tasks, max 10 running simultaneously
PARAM=${SLURM_ARRAY_TASK_ID}
python experiment.py --seed $PARAM
Each task gets its own JobID (formatted as ArrayJobID_TaskID) and its own accounting record. The %10 limits concurrent tasks to avoid saturating the cluster.
Observability Stack
Architecture
Compute Nodes
+------------------------+ +------------------------+
| slurmd | | DCGM Exporter |
| | | (NVIDIA GPU metrics) |
| slurm-exporter :8080 | | :9400 |
| slurm_jobs_running | | DCGM_FI_DEV_GPU_UTIL |
| slurm_jobs_pending | | DCGM_FI_DEV_MEM_COPY |
| slurm_nodes_alloc | | DCGM_FI_DEV_NVLINK_* |
| slurm_cpus_idle | | label: slurm_job_id |
+----------+-------------+ +----------+-------------+
| |
| Prometheus scrape | Prometheus scrape
v v
+-----------------------------------------------+
| VMAgent (per node or centralized) |
| Relabeling, filtering, remote_write |
+-------------------+---------------------------+
|
| remote_write
v
+-----------------------------------------------+
| VictoriaMetrics (vminsert / vmstorage) |
| Long-term storage, MetricsQL |
+-------------------+---------------------------+
|
| datasource
v
+-----------------------------------------------+
| Grafana |
| Job efficiency dashboards |
| GPU heatmaps, fairshare visualization |
| Alerting (PagerDuty, Slack) |
+-----------------------------------------------+
slurm-exporter
The prometheus-slurm-exporter scrapes SLURM CLI tools (squeue, sinfo, sacct) and exposes metrics on port 8080 8.
Key metrics exposed:
| Metric | Description |
|---|---|
slurm_jobs_running |
Count of running jobs, by partition |
slurm_jobs_pending |
Count of pending jobs, by reason |
slurm_nodes_alloc |
Nodes in ALLOCATED state |
slurm_nodes_idle |
Nodes in IDLE state |
slurm_nodes_down |
Nodes in DOWN/DRAIN state |
slurm_cpus_total |
Total CPUs in cluster |
slurm_cpus_idle |
Idle CPUs |
slurm_account_cpu_count |
CPUs used per account |
A known limitation: the exporter calls CLI binaries, which adds latency and load at scale (thousands of jobs). At very large scale, prefer reading directly from slurmctld's state files or using slurmrestd as a data source.
DCGM Exporter and GPU Correlation
The NVIDIA DCGM Exporter exposes per-GPU hardware metrics 9:
DCGM_FI_DEV_GPU_UTIL{gpu="0", UUID="...", hostname="node01"} 94
DCGM_FI_DEV_FB_USED{gpu="0", ...} 38654
DCGM_FI_DEV_POWER_USAGE{gpu="0", ...} 387
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL{gpu="0", ...} 198432
To correlate GPU metrics with SLURM jobs, DCGM can be configured to expose the SLURM_JOB_ID environment variable as a label. This enables Grafana queries like:
# GPU efficiency for a specific job
DCGM_FI_DEV_GPU_UTIL{slurm_job_id="12345"}
This is the key insight for AI/ML workloads: raw GPU utilization tells you if GPUs are busy, but job_id correlation tells you which specific training run, user, or team is responsible.
Why VictoriaMetrics for HPC
Prometheus alone struggles with HPC-scale workloads for three reasons:
- Cardinality: a 1000-node cluster with 8 GPUs each, running thousands of jobs, generates millions of unique time series
- Retention: HPC accounting requires months or years of metrics for capacity planning and user reporting
- Query performance: job efficiency reports aggregate over large time ranges with complex label filters
VictoriaMetrics addresses all three 10:
# vmagent config: distributed collection on compute nodes
scrape_configs:
- job_name: slurm
static_configs:
- targets: ["localhost:8080"]
- job_name: dcgm
static_configs:
- targets: ["localhost:9400"]
remote_write:
- url: "http://victoriametrics:8428/api/v1/write"
Compression ratios on HPC workloads are typically 10-15x better than Prometheus TSDB, and MetricsQL supports advanced aggregations like quantile_over_time and increase that are essential for wait time analysis.
KPIs That Actually Matter
Most HPC operators track GPU utilization and stop there. That is not enough. The metrics that reveal actual cluster health:
| Metric | Formula | Why it matters |
|---|---|---|
| CPU efficiency | used_cpus / alloc_cpus |
Reveals job over-allocation and poor sizing |
| Memory waste | alloc_mem - max_rss |
Often 40-60% on ML clusters |
| Wait time P95 | start_time - submit_time |
Scheduler health indicator |
| Fairshare drift |
factor_fairshare over 30d |
Detects long-term resource monopolies |
| GPU occupancy |
DCGM_GPU_UTIL weighted by job |
Distinguishes idle allocation from compute-bound |
| Job failure rate | failed / (completed + failed) |
Infrastructure reliability signal |
A sacct query for job efficiency after the fact:
sacct -j 12345 \
--format=JobID,CPUTime,CPUTimeRAW,AveCPU,MaxRSS,ReqMem,Elapsed \
--units=G
Security
Authentication: MUNGE
MUNGE is the default authentication mechanism for all inter-daemon communication 5. Every message is signed with a shared secret (/etc/munge/munge.key), timestamped, and includes the originating UID/GID. A receiving daemon verifies the signature and rejects credentials outside the configured TTL window, preventing replay attacks.
Node A Node B
+------------------+ +------------------+
| slurmctld | | slurmd |
| |--[credential]->| |
| signs with | | verifies with |
| munge.key | | munge.key |
| |<--[response]---| |
+------------------+ +------------------+
Credential contains:
- UID / GID of sender
- Timestamp (TTL: 300s default)
- Realm (optional)
- Payload (encrypted)
Key operational requirements:
-
munge.keymust be identical on all nodes (controller + compute + login + slurmdbd server) - File permissions must be
0400, owned by themungeuser - Distribution should use a secrets manager (HashiCorp Vault, Ansible Vault) rather than manual
scp - Key rotation requires a coordinated restart of all SLURM daemons — the most disruptive operation on a live cluster
Key rotation procedure on a live cluster:
# 1. Generate new key on the controller
mungekey --create --keyfile /etc/munge/munge.key.new
# 2. Distribute to all nodes (use your config management tool)
ansible all -m copy \
-a "src=/etc/munge/munge.key.new dest=/etc/munge/munge.key mode=0400 owner=munge"
# 3. Restart munge everywhere simultaneously (parallel SSH)
ansible all -m service -a "name=munge state=restarted"
# 4. Restart SLURM daemons in order
ansible compute -m service -a "name=slurmd state=restarted"
ansible controller -m service -a "name=slurmctld state=restarted"
ansible dbd -m service -a "name=slurmdbd state=restarted"
Resource Isolation: cgroups v2
Without cgroup enforcement, a job that allocates 64GB of memory can consume 512GB, triggering OOM kills across all other jobs on the node. SLURM's cgroup plugin prevents this 11.
slurmd receives job dispatch
|
v
Creates cgroup hierarchy:
/sys/fs/cgroup/system.slice/slurmstepd.scope/job_12345/
|
+-- memory.max = 65536M (allocated memory)
+-- memory.swap.max = 0 (no swap for HPC jobs)
+-- cpu.max = 6400 100000 (64 cores)
+-- devices.allow = c 195:0 (GPU 0 only)
+-- devices.allow = c 195:1 (GPU 1 only)
Essential cgroup.conf settings:
CgroupPlugin=autodetect
ConstrainRAMSpace=yes # OOM kill if job exceeds memory limit
ConstrainSwapSpace=yes # Disable swap for job processes
ConstrainCores=yes # Pin processes to allocated CPU cores
ConstrainDevices=yes # Restrict GPU access to allocated devices
AllowedRAMSpace=100 # No tolerance: enforce hard limit
TaskAffinity=yes # Bind threads to cores
ConstrainRAMSpace=yes is non-negotiable in any multi-tenant environment. Without it, a misbehaving job can take down an entire node.
Authorization: RBAC and Associations
SLURM's authorization model is hierarchical. Access is validated at every layer:
Level 1 — Cluster
Who can submit at all?
Level 2 — Account
Which budget/project does the job charge to?
What is the fairshare allocation?
Level 3 — User
Individual limits within the account.
Level 4 — QOS
Hard limits on resources, wall time, and concurrent jobs.
Priority boosts or penalties.
Level 5 — Partition
Which physical nodes? What maximum wall time?
Restricted to specific groups (AllowGroups)?
Managing associations with sacctmgr:
# Create account hierarchy
sacctmgr add cluster mycluster
sacctmgr add account research_lab cluster=mycluster fairshare=40
sacctmgr add user alice account=research_lab defaultaccount=research_lab
# Define QOS
sacctmgr add qos gpu_priority \
MaxTRESPerUser=cpu=256,gres/gpu=8 \
MaxWallDurationPerJob=48:00:00 \
Priority=100
# Assign QOS to user
sacctmgr modify user alice set qos+=gpu_priority
API Security: JWT and TLS
slurmrestd is the largest attack surface in a modern SLURM deployment. A compromised API token provides full cluster control: job submission, node management, user impersonation.
Hardening checklist:
# 1. Generate JWT signing key
openssl genrsa -out /etc/slurm/jwt_hs256.key 2048
chmod 0600 /etc/slurm/jwt_hs256.key
chown slurm: /etc/slurm/jwt_hs256.key
# In slurm.conf:
# AuthAltTypes=auth/jwt
# AuthAltParameters=jwt_key=/etc/slurm/jwt_hs256.key
# 2. Issue short-lived tokens (1 hour max)
scontrol token username=alice lifespan=3600
# 3. Run behind nginx with rate limiting
# nginx.conf excerpt:
# limit_req_zone $binary_remote_addr zone=slurm_api:10m rate=10r/s;
# location /slurm/ {
# limit_req zone=slurm_api burst=20 nodelay;
# proxy_pass http://127.0.0.1:6820;
# }
# 4. Restrict port 6820 by firewall
# Only the proxy IP should reach slurmrestd directly
For inter-daemon TLS (SLURM 23.x+), add to slurm.conf:
CommunicationParameters=EnableTLS
TLSType=tls/openssl
Audit Trail
slurmdbd maintains a complete, immutable audit trail. Every job submission, modification, start, and completion is recorded with full resource accounting. This data is queryable via sacct:
# Full accounting for a user, last 30 days
sacct -u alice \
--starttime=$(date -d '30 days ago' +%Y-%m-%d) \
--format=JobID,JobName,Account,QOS,Partition,NCPUS,NNodes,\
ReqMem,MaxRSS,CPUTime,Elapsed,State,ExitCode \
--units=G
# Cluster-wide report
sreport cluster utilization \
start=2024-01-01 end=2024-03-31 \
-t hourper
For SIEM integration, SLURM writes structured logs to syslog. These can be forwarded to Wazuh, Elastic SIEM, or Splunk for correlation with authentication events and anomaly detection.
Key Configuration Files
| File | Purpose | Critical settings |
|---|---|---|
slurm.conf |
Main config: nodes, partitions, plugins |
SelectType, PriorityType, AccountingStorageType
|
slurmdbd.conf |
Accounting daemon: DB credentials | Permissions must be 0600
|
cgroup.conf |
Resource enforcement |
ConstrainRAMSpace, ConstrainDevices
|
gres.conf |
GPU/FPGA topology and binding | GPU count, MIG partitions |
topology.conf |
Network topology for MPI placement | Switch hierarchy, InfiniBand fabric |
acct_gather.conf |
Per-job energy and I/O metrics | RAPL, InfiniBand, Lustre |
Annotated slurm.conf for a GPU cluster
# Identity
ClusterName=mycluster
SlurmctldHost=controller01
SlurmctldHost=controller02 # HA backup
# Ports
SlurmctldPort=6817
SlurmdPort=6818
# Scheduler
SchedulerType=sched/backfill
SelectType=select/cons_tres # Consumable resources: Track
SelectTypeParameters=CR_Core_Memory # individual CPUs and memory
SchedulerParameters=bf_max_job_test=500,bf_resolution=60
# Priority (multifactor)
PriorityType=priority/multifactor
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightJobSize=100
PriorityDecayHalfLife=7-0 # 7 days half-life for fairshare
PriorityMaxAge=7-0
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=controller01
AccountingStoragePort=6819
AccountingStorageUser=slurm
AccountingStoragePass=<db_password>
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherFrequency=30 # Collect every 30s
# Task and process tracking
TaskPlugin=task/cgroup,task/affinity
ProctrackType=proctrack/cgroup
# GRES (GPU)
GresTypes=gpu
# Timeouts
SlurmdTimeout=300
SlurmctldTimeout=120
MessageTimeout=10
# Logging
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmctldDebug=info
SlurmdDebug=info
# Nodes (example: 16 nodes, 8x A100 each)
NodeName=node[01-16] \
CPUs=64 \
RealMemory=512000 \
Gres=gpu:a100:8 \
State=UNKNOWN
# Partitions
PartitionName=gpu \
Nodes=node[01-16] \
MaxTime=INFINITE \
DefaultTime=24:00:00 \
State=UP \
Default=YES
PartitionName=debug \
Nodes=node[01-02] \
MaxTime=1:00:00 \
Priority=100 \
State=UP
Operational Runbook: Common Tasks
Drain a node for maintenance
# Drain: no new jobs, current jobs finish
scontrol update NodeName=node05 State=DRAIN Reason="scheduled maintenance"
# Check when node will be empty
squeue -w node05
# After jobs finish, confirm drain
scontrol show node node05 | grep State
# Return to service
scontrol update NodeName=node05 State=RESUME
Hold and release a job
# Hold a pending job (prevents scheduling)
scontrol hold 12345
# Release
scontrol release 12345
# Requeue a failed running job
scontrol requeue 12345
Identify wasted resources
# Jobs where memory usage < 50% of allocation
sacct --format=JobID,ReqMem,MaxRSS,CPUTime,AveCPU \
--state=COMPLETED \
--starttime=2024-01-01 \
| awk '$3 != 0 && ($3/$2) < 0.5 {print $0}'
Summary
SLURM in one diagram:
User submits job (sbatch / srun / srun --pty)
|
v
slurmctld
validates resources (partitions + associations + QOS)
queues job (PENDING)
computes priority (fairshare + QOS + age + jobsize)
runs backfill scheduling
dispatches to allocated nodes (RUNNING)
records lifecycle to slurmdbd
|
+-- slurmdbd -> MariaDB (full accounting, audit trail)
|
+-- slurmd on each node
|
+-- cgroups v2 (memory, CPU, GPU isolation)
+-- Prolog (pre-job setup, root)
+-- slurmstepd (user process, MPI launch)
+-- Epilog (post-job cleanup, root)
+-- heartbeat (node health to slurmctld)
|
+-- slurm-exporter :8080 (job + node metrics)
+-- DCGM Exporter :9400 (GPU metrics + job_id)
|
v
VMAgent -> VictoriaMetrics -> Grafana
Security stack:
MUNGE inter-daemon auth (shared key, signed credentials)
cgroups v2 resource isolation (memory, CPU, GPU per job)
Associations RBAC + fairshare (cluster > account > user > QOS)
JWT + TLS API security (slurmrestd behind reverse proxy)
sacct / slurmdbd audit trail (full accounting, queryable)
The three files to master before anything else: slurm.conf, cgroup.conf, gres.conf. Everything else builds on top of them.
References
This article is part of the HPC Observability series. Next: Building GPU efficiency dashboards with VictoriaMetrics and Grafana for AI training workloads.
-
Yoo, A.B., Jette, M.A., Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. Lecture Notes in Computer Science, 2862, 44-60. https://doi.org/10.1007/10968987_3 ↩
-
TOP500 Editors (2023). Statistics on Resource Management Software. TOP500 Project. https://www.top500.org/statistics/details/rmsoftware/1 ↩
-
SchedMD LLC. (2024). High Availability in SLURM. SLURM Documentation. https://slurm.schedmd.com/high_availability.html ↩
-
SchedMD LLC. (2024). REST API Guide. SLURM Documentation. https://slurm.schedmd.com/rest.html ↩
-
Grondona, M. (2024). MUNGE Authentication Service. GitHub. https://github.com/dun/munge ↩
-
Lifka, D. (1995). The ANL/IBM SP Scheduling System. Job Scheduling Strategies for Parallel Processing, 295-303. https://doi.org/10.1007/3-540-60153-8_31 ↩
-
SchedMD LLC. (2024). Multifactor Priority Plugin. SLURM Documentation. https://slurm.schedmd.com/priority_multifactor.html ↩
-
Penso, V. et al. (2024). prometheus-slurm-exporter. GitHub. https://github.com/vpenso/prometheus-slurm-exporter ↩
-
NVIDIA Corporation. (2024). DCGM Exporter. GitHub. https://github.com/NVIDIA/dcgm-exporter ↩
-
VictoriaMetrics Team. (2024). VictoriaMetrics Documentation. https://docs.victoriametrics.com ↩
-
SchedMD LLC. (2024). Cgroups Guide. SLURM Documentation. https://slurm.schedmd.com/cgroups.html ↩
Top comments (0)