DEV Community

Cover image for How Slurm Handles Resource Allocation Internally
Muhammad Zubair Bin Akbar
Muhammad Zubair Bin Akbar

Posted on

How Slurm Handles Resource Allocation Internally

If you work with HPC clusters, chances are you use slurm every day to submit jobs, monitor queues, and manage compute resources.

Most users know commands like sbatch, squeue, and sinfo, but fewer understand what actually happens internally when a job is submitted.

This article explains how Slurm handles resource allocation behind the scenes, from job submission to execution on compute nodes.

What Happens When You Submit a Job?

When a user runs:

sbatch job.sh
Enter fullscreen mode Exit fullscreen mode

Slurm begins a multi step workflow internally.

The main components involved are:

  • slurmctld → Central controller daemon
  • slurmd → Compute node daemon
  • slurmdbd → Accounting database daemon (optional but common)
  • Scheduler plugin
  • Select plugin
  • Cgroups/task plugins

Each component has a specific role in resource allocation.

Step 1: Job Submission

The sbatch command sends the job request to slurmctld.

The request includes:

  • Number of nodes
  • CPUs
  • Memory
  • GPUs
  • Time limit
  • Partition
  • Constraints
  • QoS
  • Account information

Example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=128G
#SBATCH --time=04:00:00
Enter fullscreen mode Exit fullscreen mode

At this stage, Slurm creates a job record and places it into the pending queue.

Step 2: Job Validation

Before scheduling the job, Slurm validates several things internally.

User & Account Checks

Slurm verifies:

  • User permissions
  • Account associations
  • QoS limits
  • Fairshare policies
  • Partition access

If accounting is enabled, slurmdbd provides usage statistics and limits.

Step 3: Scheduler Evaluation

Now the scheduler starts evaluating the job.

The default scheduler in Slurm is:

sched/backfill
Enter fullscreen mode Exit fullscreen mode

This scheduler performs two important tasks:

Main Scheduling Pass

It checks:

  • Available resources
  • Job priority
  • Node states
  • Reservations
  • Limits

Backfill Scheduling

Backfill allows smaller jobs to run without delaying higher priority jobs.

This improves overall cluster utilization.

How Job Priority Is Calculated

Slurm calculates a dynamic priority score.

Factors include:

  • Fairshare usage
  • Job age
  • Job size
  • Partition priority
  • QoS priority
  • Association priority

Internally, the priority plugin combines these values into a single score.

Example:

Priority = Age + Fairshare + JobSize + Partition + QoS

Higher score means earlier scheduling.

Step 4: Resource Selection

Once the scheduler decides to run the job, Slurm uses the select plugin.

Most clusters use:

select/cons_tres
Enter fullscreen mode Exit fullscreen mode

This plugin handles consumable resources using TRES.

What Are TRES?

TRES stands for:

Trackable RESources
Enter fullscreen mode Exit fullscreen mode

Examples:

  • CPU
  • Memory
  • GPU
  • Node
  • License
  • Burst buffer

This model allows Slurm to track resources very precisely.

Internal Node Selection

The select plugin now determines:

  • Which nodes are eligible
  • How CPUs are distributed
  • Memory allocation
  • GPU placement
  • Socket/core binding

Slurm checks node topology information stored in memory by slurmctld.

Example:

NodeA:
  64 CPUs
  512 GB RAM
  4 GPUs
Enter fullscreen mode Exit fullscreen mode

If the job requests:

32 CPUs + 2 GPUs
Enter fullscreen mode Exit fullscreen mode

Slurm reserves exactly those resources internally.

Step 5: Resource Reservation

After node selection, Slurm marks resources as allocated.

Internally:

  • CPUs become unavailable to other jobs
  • Memory counters are reduced
  • GPUs are reserved
  • Node state changes

You can observe this using:

scontrol show node
Enter fullscreen mode Exit fullscreen mode

or

squeue
Enter fullscreen mode Exit fullscreen mode

Step 6: Launching the Job

Now slurmctld contacts the slurmd daemon on allocated nodes.

The compute node daemon performs:

  • Environment setup
  • UID/GID validation
  • Cgroup creation
  • CPU binding
  • Memory enforcement
  • Task launching

How Cgroups Enforce Limits

Modern Slurm clusters heavily rely on Linux cgroups.

Cgroups ensure a job cannot exceed allocated resources.

Examples:

CPU Enforcement

Only allocated CPU cores are accessible

Memory Enforcement

Memory usage beyond limit triggers OOM kill

GPU Isolation

Only assigned GPUs are visible

This is why users see:

CUDA_VISIBLE_DEVICES=0

automatically set inside jobs.

CPU Binding and Affinity

Slurm also handles CPU affinity internally.

This improves:

  • NUMA locality
  • Cache efficiency
  • MPI performance

Example:

srun --cpu-bind=cores
Enter fullscreen mode Exit fullscreen mode

Internally, Slurm maps tasks to specific CPU cores using topology-aware scheduling.

Step 7: Job Execution

Once everything is configured:

  • Processes start
  • Accounting begins
  • Usage metrics are collected

Slurm tracks:

  • CPU time
  • Memory usage
  • GPU usage
  • Energy consumption
  • Exit codes

These statistics are later visible through:

sacct
Enter fullscreen mode Exit fullscreen mode

What Happens When the Job Finishes?

After completion:

  1. Resources are released
  2. Node state is updated
  3. Accounting data is stored
  4. Scheduler reevaluates pending jobs

The released resources immediately become available for new allocations.

Why Understanding This Matters

Knowing how Slurm allocates resources helps administrators and users:

  • Troubleshoot pending jobs
  • Optimize scheduling
  • Improve cluster utilization
  • Diagnose CPU or memory contention
  • Tune fairshare policies
  • Understand performance bottlenecks

It also makes debugging much easier when dealing with issues like:

ReqNodeNotAvail
Resources
Priority
QOSMaxCpuPerUserLimit

Final Thoughts

Slurm does much more than simply queue jobs.

Internally, it performs:

  • Policy validation
  • Priority calculations
  • Topology aware scheduling
  • Precise resource accounting
  • Cgroup enforcement
  • Distributed task launching

Understanding these internals gives HPC administrators better control over cluster performance and helps users write more efficient jobs.

The next time you run sbatch, remember that an entire scheduling engine is working behind the scenes to decide exactly where and how your workload should run.

Top comments (1)

Collapse
 
godaddy_llc_4e3a2f1804238 profile image
GoDaddy LLC

This is a really clean explanation of what’s happening behind the curtain every time someone casually types sbatch job.sh 😄. A lot of users treat Slurm like a magical black box until their job sits in “PENDING” for six hours and suddenly they become amateur schedulers overnight 😂. I especially liked the breakdown of how select/cons_tres and cgroups work together, because that’s where resource allocation becomes truly precise instead of “best effort.” The section on backfill scheduling is also important — many people don’t realize cluster efficiency depends as much on scheduling strategy as raw hardware. Understanding how priorities, topology awareness, and affinity interact can genuinely improve both performance and troubleshooting. And honestly, once you learn how much orchestration Slurm performs internally, “waiting in queue” starts feeling less like delay and more like air traffic control for CPUs. Great technical walkthrough with enough depth to help both HPC users and admins.