If you’ve worked with Slurm long enough, you’ve definitely seen this:
PD (Pending)
You submit a job, everything looks fine… and then nothing happens.
No errors. No logs. Just waiting.
Let’s break down why this happens and, more importantly, how to fix it without guessing.
What “Pending” Actually Means
A Slurm job in PENDING (PD) state simply means:
The scheduler hasn’t found a suitable way to run your job yet.
That could be due to:
- Resource shortages
- Configuration limits
- Priority issues
- Or constraints you didn’t even realize you set
The key is: Slurm always tells you why — you just need to ask the right way.
Step 1: Check the Real Reason
Run:
squeue -j <job_id> -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
Look at the NODELIST(REASON) column.
Common outputs:
- (Resources)
- (Priority)
- (ReqNodeNotAvail)
- (QOSMaxJobsPerUserLimit)
This reason is your starting point — not the guesswork.
Most Common Reasons (and Fixes)
1. (Resources) — Not Enough Resources Available
Meaning:
Your job is asking for more than what’s currently free.
Example:
- Too many CPUs
- Too much memory
- GPU request when none are available
Fix:
Reduce your request:
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
- Or wait (if the request is valid but large)
Pro tip: Check cluster usage with:
sinfo
2. (Priority) — Your Job Is in Line
Meaning:
Other jobs have higher priority than yours.
Slurm prioritizes based on:
- Fairshare
- Job age
- Partition rules
- QOS
Fix:
- Check priority:
sprio -j <job_id>
- If possible:
- Use a different partition
- Reduce requested resources (smaller jobs start faster)
3. (ReqNodeNotAvail) — Requested Node Is Not Usable
Meaning:
You requested a node that is:
- Down
- Drained
- Reserved
Fix:
- Avoid hardcoding nodes:
#SBATCH --nodelist=node01 ❌
- Check node state:
sinfo -R
- (QOSMaxJobsPerUserLimit) — You Hit a Limit
Meaning:
You’ve reached the maximum number of jobs allowed.
Fix:
- Check your running jobs:
squeue -u $USER
- Wait or cancel unnecessary jobs
- Talk to your admin if limits are too restrictive
- (PartitionLimit) — Partition Constraints
Meaning:
Your job exceeds partition limits (time, memory, nodes).
Fix:
- Check partition config:
sinfo -o "%P %l %m %c"
- Adjust your script:
#SBATCH --time=01:00:00
Advanced Debugging (Admins & Power Users)
If the reason isn’t obvious:
Check job details:
scontrol show job <job_id>
Look at scheduler decisions:
sdiag
Check Slurm logs:
- slurmctld.log
- slurmd.log
These often reveal hidden issues like:
- Invalid accounts
- Association limits
- Misconfigured QOS
Real-World Tip: Smaller Jobs Start Faster
Slurm prefers jobs that can fit quickly.
If your job asks for:
- 2 GPUs → might wait hours
- 1 GPU → might start immediately
Strategy:
Break large jobs into smaller chunks when possible.
Quick Checklist
Before blaming Slurm, check:
- Did I request too many resources?
- Am I hitting a user/job limit?
- Is my priority too low?
- Did I accidentally constrain nodes?
- Does my partition allow this job?
Final Thought
Slurm isn’t “stuck” when jobs are pending — it’s being strict and logical.
The difference between a beginner and an experienced HPC user is simple:
Beginners wait. Experts check the reason and fix it.
Top comments (0)