If you are new to High Performance Computing, one of the first things you will do is submit a job using Slurm.
At first, it can feel confusing. But once you understand the basics, it becomes very straightforward.
Let’s walk through how to write your first Slurm job script.
What is a Slurm Job Script
A Slurm job script is just a simple shell script that tells the scheduler:
- What resources you need
- How long your job will run
- What command should be executed
Instead of running your program directly, you submit this script to Slurm, and it handles everything for you.
Basic Structure of a Job Script
A typical Slurm script looks like this:
#!/bin/bash
#SBATCH --job-name=test_job
#SBATCH --output=output.log
#SBATCH --error=error.log
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
echo "Hello from Slurm!"
hostname
Let’s break this down.
Understanding the SBATCH Directives
Lines starting with #SBATCH are instructions for Slurm.
Job Name
#SBATCH --job-name=test_job
This is just a label to identify your job.
Output and Error Files
#SBATCH --output=output.log
#SBATCH --error=error.log
- output.log → stores normal output
- error.log → stores errors
Very useful for debugging.
Time Limit
#SBATCH --time=00:10:00
This means your job can run for 10 minutes max.
If it exceeds this, Slurm will stop it.
Tasks and CPUs
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
- ntasks → number of processes
- cpus-per-task → CPU cores per process
For simple jobs, keep both as 1.
Memory
#SBATCH --mem=1G
This requests 1 GB of RAM.
If your job needs more and you don’t request it, it may fail.
What Goes Inside the Script
After the SBATCH lines, you add the commands you want to run:
echo "Hello from Slurm!"
hostname
In real use cases, this could be:
- Running a Python script
- Executing a simulation
- Launching an MPI job
Example:
python my_script.py
Submitting the Job
Once your script is ready, save it as:
job.sh
Then submit it:
sbatch job.sh
You will get a job ID like:
Submitted batch job 12345
Checking Job Status
To see if your job is running:
squeue -u your_username
To get more details:
scontrol show job 12345
Viewing Output
After the job finishes:
cat output.log
cat error.log
This is where you check results or debug issues.
Common Beginner Mistakes
A few things that often go wrong:
- Requesting too little memory → job fails
- Setting very short time limits → job gets killed
- Running heavy jobs on login node instead of using Slurm
- Forgetting to check error logs
Final Thoughts
Writing your first Slurm job script might seem small, but it is the foundation of everything you do in HPC.
Once you understand this, you can:
- Run bigger workloads
- Scale across multiple nodes
- Work with GPUs and parallel jobs
Start simple, test small, and build from there.
Top comments (0)