If you work with HPC clusters, chances are you use slurm every day to submit jobs, monitor queues, and manage compute resources.
Most users know commands like sbatch, squeue, and sinfo, but fewer understand what actually happens internally when a job is submitted.
This article explains how Slurm handles resource allocation behind the scenes, from job submission to execution on compute nodes.
⸻
What Happens When You Submit a Job?
When a user runs:
sbatch job.sh
Slurm begins a multi step workflow internally.
The main components involved are:
- slurmctld → Central controller daemon
- slurmd → Compute node daemon
- slurmdbd → Accounting database daemon (optional but common)
- Scheduler plugin
- Select plugin
- Cgroups/task plugins
Each component has a specific role in resource allocation.
⸻
Step 1: Job Submission
The sbatch command sends the job request to slurmctld.
The request includes:
- Number of nodes
- CPUs
- Memory
- GPUs
- Time limit
- Partition
- Constraints
- QoS
- Account information
Example:
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=128G
#SBATCH --time=04:00:00
At this stage, Slurm creates a job record and places it into the pending queue.
⸻
Step 2: Job Validation
Before scheduling the job, Slurm validates several things internally.
User & Account Checks
Slurm verifies:
- User permissions
- Account associations
- QoS limits
- Fairshare policies
- Partition access
If accounting is enabled, slurmdbd provides usage statistics and limits.
⸻
Step 3: Scheduler Evaluation
Now the scheduler starts evaluating the job.
The default scheduler in Slurm is:
sched/backfill
This scheduler performs two important tasks:
Main Scheduling Pass
It checks:
- Available resources
- Job priority
- Node states
- Reservations
- Limits
Backfill Scheduling
Backfill allows smaller jobs to run without delaying higher priority jobs.
This improves overall cluster utilization.
⸻
How Job Priority Is Calculated
Slurm calculates a dynamic priority score.
Factors include:
- Fairshare usage
- Job age
- Job size
- Partition priority
- QoS priority
- Association priority
Internally, the priority plugin combines these values into a single score.
Example:
Priority = Age + Fairshare + JobSize + Partition + QoS
Higher score means earlier scheduling.
⸻
Step 4: Resource Selection
Once the scheduler decides to run the job, Slurm uses the select plugin.
Most clusters use:
select/cons_tres
This plugin handles consumable resources using TRES.
⸻
What Are TRES?
TRES stands for:
Trackable RESources
Examples:
- CPU
- Memory
- GPU
- Node
- License
- Burst buffer
This model allows Slurm to track resources very precisely.
⸻
Internal Node Selection
The select plugin now determines:
- Which nodes are eligible
- How CPUs are distributed
- Memory allocation
- GPU placement
- Socket/core binding
Slurm checks node topology information stored in memory by slurmctld.
Example:
NodeA:
64 CPUs
512 GB RAM
4 GPUs
If the job requests:
32 CPUs + 2 GPUs
Slurm reserves exactly those resources internally.
⸻
Step 5: Resource Reservation
After node selection, Slurm marks resources as allocated.
Internally:
- CPUs become unavailable to other jobs
- Memory counters are reduced
- GPUs are reserved
- Node state changes
You can observe this using:
scontrol show node
or
squeue
⸻
Step 6: Launching the Job
Now slurmctld contacts the slurmd daemon on allocated nodes.
The compute node daemon performs:
- Environment setup
- UID/GID validation
- Cgroup creation
- CPU binding
- Memory enforcement
- Task launching
⸻
How Cgroups Enforce Limits
Modern Slurm clusters heavily rely on Linux cgroups.
Cgroups ensure a job cannot exceed allocated resources.
Examples:
CPU Enforcement
Only allocated CPU cores are accessible
Memory Enforcement
Memory usage beyond limit triggers OOM kill
GPU Isolation
Only assigned GPUs are visible
This is why users see:
CUDA_VISIBLE_DEVICES=0
automatically set inside jobs.
⸻
CPU Binding and Affinity
Slurm also handles CPU affinity internally.
This improves:
- NUMA locality
- Cache efficiency
- MPI performance
Example:
srun --cpu-bind=cores
Internally, Slurm maps tasks to specific CPU cores using topology-aware scheduling.
⸻
Step 7: Job Execution
Once everything is configured:
- Processes start
- Accounting begins
- Usage metrics are collected
Slurm tracks:
- CPU time
- Memory usage
- GPU usage
- Energy consumption
- Exit codes
These statistics are later visible through:
sacct
⸻
What Happens When the Job Finishes?
After completion:
- Resources are released
- Node state is updated
- Accounting data is stored
- Scheduler reevaluates pending jobs
The released resources immediately become available for new allocations.
⸻
Why Understanding This Matters
Knowing how Slurm allocates resources helps administrators and users:
- Troubleshoot pending jobs
- Optimize scheduling
- Improve cluster utilization
- Diagnose CPU or memory contention
- Tune fairshare policies
- Understand performance bottlenecks
It also makes debugging much easier when dealing with issues like:
ReqNodeNotAvail
Resources
Priority
QOSMaxCpuPerUserLimit
⸻
Final Thoughts
Slurm does much more than simply queue jobs.
Internally, it performs:
- Policy validation
- Priority calculations
- Topology aware scheduling
- Precise resource accounting
- Cgroup enforcement
- Distributed task launching
Understanding these internals gives HPC administrators better control over cluster performance and helps users write more efficient jobs.
The next time you run sbatch, remember that an entire scheduling engine is working behind the scenes to decide exactly where and how your workload should run.
Top comments (1)
This is a really clean explanation of what’s happening behind the curtain every time someone casually types sbatch job.sh 😄. A lot of users treat Slurm like a magical black box until their job sits in “PENDING” for six hours and suddenly they become amateur schedulers overnight 😂. I especially liked the breakdown of how select/cons_tres and cgroups work together, because that’s where resource allocation becomes truly precise instead of “best effort.” The section on backfill scheduling is also important — many people don’t realize cluster efficiency depends as much on scheduling strategy as raw hardware. Understanding how priorities, topology awareness, and affinity interact can genuinely improve both performance and troubleshooting. And honestly, once you learn how much orchestration Slurm performs internally, “waiting in queue” starts feeling less like delay and more like air traffic control for CPUs. Great technical walkthrough with enough depth to help both HPC users and admins.