Muhammad Zubair Bin Akbar

Posted on May 13

How Slurm Handles Resource Allocation Internally

#ai #hpc #slurm #productivity

If you work with HPC clusters, chances are you use slurm every day to submit jobs, monitor queues, and manage compute resources.

Most users know commands like sbatch, squeue, and sinfo, but fewer understand what actually happens internally when a job is submitted.

This article explains how Slurm handles resource allocation behind the scenes, from job submission to execution on compute nodes.

⸻

What Happens When You Submit a Job?

When a user runs:

sbatch job.sh

Slurm begins a multi step workflow internally.

The main components involved are:

slurmctld → Central controller daemon
slurmd → Compute node daemon
slurmdbd → Accounting database daemon (optional but common)
Scheduler plugin
Select plugin
Cgroups/task plugins

Each component has a specific role in resource allocation.

⸻

Step 1: Job Submission

The sbatch command sends the job request to slurmctld.

The request includes:

Number of nodes
CPUs
Memory
GPUs
Time limit
Partition
Constraints
QoS
Account information

Example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=128G
#SBATCH --time=04:00:00

At this stage, Slurm creates a job record and places it into the pending queue.

⸻

Step 2: Job Validation

Before scheduling the job, Slurm validates several things internally.

User & Account Checks

Slurm verifies:

User permissions
Account associations
QoS limits
Fairshare policies
Partition access

If accounting is enabled, slurmdbd provides usage statistics and limits.

⸻

Step 3: Scheduler Evaluation

Now the scheduler starts evaluating the job.

The default scheduler in Slurm is:

sched/backfill

This scheduler performs two important tasks:

Main Scheduling Pass

It checks:

Available resources
Job priority
Node states
Reservations
Limits

Backfill Scheduling

Backfill allows smaller jobs to run without delaying higher priority jobs.

This improves overall cluster utilization.

⸻

How Job Priority Is Calculated

Slurm calculates a dynamic priority score.

Factors include:

Fairshare usage
Job age
Job size
Partition priority
QoS priority
Association priority

Internally, the priority plugin combines these values into a single score.

Example:

Priority = Age + Fairshare + JobSize + Partition + QoS

Higher score means earlier scheduling.

⸻

Step 4: Resource Selection

Once the scheduler decides to run the job, Slurm uses the select plugin.

Most clusters use:

select/cons_tres

This plugin handles consumable resources using TRES.

⸻

What Are TRES?

TRES stands for:

Trackable RESources

Examples:

CPU
Memory
GPU
Node
License
Burst buffer

This model allows Slurm to track resources very precisely.

⸻

Internal Node Selection

The select plugin now determines:

Which nodes are eligible
How CPUs are distributed
Memory allocation
GPU placement
Socket/core binding

Slurm checks node topology information stored in memory by slurmctld.

Example:

NodeA:
  64 CPUs
  512 GB RAM
  4 GPUs

If the job requests:

32 CPUs + 2 GPUs

Slurm reserves exactly those resources internally.

⸻

Step 5: Resource Reservation

After node selection, Slurm marks resources as allocated.

Internally:

CPUs become unavailable to other jobs
Memory counters are reduced
GPUs are reserved
Node state changes

You can observe this using:

scontrol show node

squeue

⸻

Step 6: Launching the Job

Now slurmctld contacts the slurmd daemon on allocated nodes.

The compute node daemon performs:

Environment setup
UID/GID validation
Cgroup creation
CPU binding
Memory enforcement
Task launching

⸻

How Cgroups Enforce Limits

Modern Slurm clusters heavily rely on Linux cgroups.

Cgroups ensure a job cannot exceed allocated resources.

Examples:

CPU Enforcement

Only allocated CPU cores are accessible

Memory Enforcement

Memory usage beyond limit triggers OOM kill

GPU Isolation

Only assigned GPUs are visible

This is why users see:

CUDA_VISIBLE_DEVICES=0

automatically set inside jobs.

⸻

CPU Binding and Affinity

Slurm also handles CPU affinity internally.

This improves:

NUMA locality
Cache efficiency
MPI performance

Example:

srun --cpu-bind=cores

Internally, Slurm maps tasks to specific CPU cores using topology-aware scheduling.

⸻

Step 7: Job Execution

Once everything is configured:

Processes start
Accounting begins
Usage metrics are collected

Slurm tracks:

CPU time
Memory usage
GPU usage
Energy consumption
Exit codes

These statistics are later visible through:

sacct

⸻

What Happens When the Job Finishes?

After completion:

Resources are released
Node state is updated
Accounting data is stored
Scheduler reevaluates pending jobs

The released resources immediately become available for new allocations.

⸻

Why Understanding This Matters

Knowing how Slurm allocates resources helps administrators and users:

Troubleshoot pending jobs
Optimize scheduling
Improve cluster utilization
Diagnose CPU or memory contention
Tune fairshare policies
Understand performance bottlenecks

It also makes debugging much easier when dealing with issues like:

ReqNodeNotAvail Resources Priority QOSMaxCpuPerUserLimit

⸻

Final Thoughts

Slurm does much more than simply queue jobs.

Internally, it performs:

Policy validation
Priority calculations
Topology aware scheduling
Precise resource accounting
Cgroup enforcement
Distributed task launching

Understanding these internals gives HPC administrators better control over cluster performance and helps users write more efficient jobs.

The next time you run sbatch, remember that an entire scheduling engine is working behind the scenes to decide exactly where and how your workload should run.

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.