A practical introduction to Linux, SLURM, GPU clusters, AI workloads, and supercomputing workflows.
Supercomputers are no longer reserved only for physicists and national laboratories.
Today, AI engineers, machine learning researchers, data scientists, and students increasingly rely on High Performance Computing (HPC) systems for:
- large-scale AI training
- scientific simulations
- big data analytics
- distributed computing
- computational research
Recently, I started exploring the documentation of the CINECA HPC infrastructure, one of Europe’s major supercomputing environments, and I realized something important:
Most beginners struggle not because HPC is impossible, but because the ecosystem feels overwhelming at first.
This article breaks down the core concepts into a practical workflow that beginners can actually understand.
What Is an HPC Cluster?
An HPC (High Performance Computing) cluster is a network of powerful computers called nodes working together to solve computational problems.
Instead of training a model on:
- one laptop CPU
- one small GPU
you may gain access to:
- hundreds of CPU cores
- multiple A100/H100 GPUs
- terabytes of memory
- ultra-fast networking
- distributed storage systems
Systems like those hosted at CINECA are used for:
- AI training
- weather forecasting
- genomics
- computational physics
- climate simulations
- deep learning research
Understanding the Architecture
Most HPC systems follow this structure:
Your Laptop
↓
Login Node
↓
SLURM Scheduler
↓
Compute Nodes
↓
GPU/CPU Execution
Let’s simplify what each layer means.
| Component | Purpose |
|---|---|
| Login Node | Used for coding, editing, compiling |
| Compute Node | Where actual jobs run |
| GPU Node | Used for AI and deep learning |
| Scheduler | Manages resources and queues |
| Storage System | Stores datasets and outputs |
One important rule:
Never run heavy workloads directly on login nodes.
The Biggest Mindset Shift in HPC
On a laptop, you usually do this:
Write code → Run immediately → See output
In HPC, the workflow becomes:
Write code
↓
Submit job
↓
Wait in queue
↓
Resources allocated
↓
Execution starts
↓
Logs/results generated
HPC is asynchronous.
That is one of the biggest transitions beginners must adapt to.
Connecting to the Cluster
Most HPC systems are accessed using SSH.
Example:
ssh username@login.clustername.cineca.it
Modern systems may also require:
- 2FA
- SSH certificates
- OTP authentication
Linux Skills You MUST Know
Before touching supercomputers seriously, Linux fundamentals are essential.
Navigation Commands
pwd
ls
cd
mkdir
rm
cp
mv
File Inspection
cat
less
head
tail
nano
vim
Process Monitoring
top
htop
ps
kill
If you are weak in Linux, HPC will feel painful very quickly.
Understanding Modules
HPC environments rarely allow random software installations directly on the system.
Instead, they use modules.
Modules dynamically load:
- Python versions
- CUDA versions
- compilers
- MPI libraries
- AI frameworks
Checking Available Modules
module avail
Loading Python
module load python
Viewing Loaded Modules
module list
Clearing Modules
module purge
This prevents dependency conflicts between projects.
Understanding SLURM
Most HPC systems use a scheduler called SLURM.
SLURM manages:
- job queues
- GPU allocation
- runtime limits
- cluster resources
You do not directly “take” GPUs.
You request resources from SLURM.
Essential SLURM Commands
View Available Partitions
sinfo
View Running Jobs
squeue
Submit a Job
sbatch train.sh
Start Interactive Session
srun --pty bash
Cancel a Job
scancel JOB_ID
Your First SLURM Script
Create a file:
nano train.sbatch
Example content:
#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --mem=32G
#SBATCH --output=output.log
module purge
module load cuda
module load python
python train.py
Submit it:
sbatch train.sbatch
SLURM responds with something like:
Submitted batch job 12345
Understanding SBATCH Parameters
| Parameter | Meaning |
|---|---|
--job-name |
Name of the job |
--partition |
Queue name |
--nodes |
Number of machines |
--cpus-per-task |
CPU cores |
--gres=gpu:1 |
Request GPU |
--time |
Runtime limit |
--mem |
RAM allocation |
Monitoring Jobs
Check Queue
squeue -u username
Live Monitoring
watch squeue -u username
View Job Details
scontrol show job 12345
Understanding Job States
| State | Meaning |
|---|---|
| PD | Pending |
| R | Running |
| CG | Completing |
| CD | Completed |
| F | Failed |
If your job stays in PD, it usually means:
- resources are busy
- queue is full
- requested resources are too large
Interactive GPU Sessions
Interactive sessions are useful for:
- debugging
- notebook testing
- experimentation
Example:
srun --partition=gpu \
--gres=gpu:1 \
--cpus-per-task=8 \
--mem=32G \
--pty bash
Python Environments on HPC
Never rely entirely on system Python for serious AI work.
Most researchers use:
- Conda
- virtual environments
Create Environment
conda create -n ml python=3.11
Activate:
conda activate ml
Install packages:
pip install torch transformers datasets
GPU Training Workflow
A common AI workflow looks like this:
Local Machine
↓
Upload Dataset
↓
Prepare Environment
↓
Write SLURM Script
↓
Submit Training
↓
Monitor Logs
↓
Download Results
Storage Systems Matter More Than Beginners Think
HPC systems typically have:
- home storage
- scratch storage
- project storage
- high-speed temporary storage
A common beginner mistake is training directly from slow home directories.
For large AI workloads:
- use scratch storage
- optimize data loading
- clean temporary files regularly
Transferring Files
SCP Example
scp dataset.zip username@cluster:/scratch/project/
RSYNC Example
rsync -av dataset/ username@cluster:/scratch/project/
Running PyTorch on HPC
Simple GPU check:
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
SLURM script:
#SBATCH --gres=gpu:1
python train.py
Multi-GPU Training
Example:
#SBATCH --nodes=1
#SBATCH --gres=gpu:4
PyTorch distributed execution:
torchrun --nproc_per_node=4 train.py
This is where HPC becomes significantly more powerful than normal laptops.
MPI and Distributed Computing
Scientific applications often use MPI.
Example:
module load openmpi
mpirun -np 16 ./simulation
MPI is heavily used in:
- physics
- computational fluid dynamics
- simulations
- weather forecasting
Containers in HPC
Unlike cloud-native environments that heavily use Docker, HPC systems often use:
- Singularity
- Apptainer
Example:
module load singularity
Run container:
singularity exec container.sif python train.py
Common Beginner Mistakes
Running Heavy Jobs on Login Nodes
This annoys system administrators very quickly.
Requesting Too Many Resources
Large requests stay longer in queue.
Ignoring Logs
Logs explain most failures.
Example:
tail -f output.log
CUDA Version Mismatches
Your PyTorch CUDA version must match cluster CUDA support.
HPC vs Cloud GPUs
| HPC | Cloud |
|---|---|
| Shared infrastructure | Commercial infrastructure |
| Often research-funded | Pay-as-you-go |
| Queue-based | Immediate provisioning |
| Strong interconnects | Flexible scaling |
| Excellent for huge workloads | Excellent for startups |
Recommended Learning Path
Stage 1 — Linux
Learn:
- bash
- file systems
- SSH
- permissions
Stage 2 — SLURM
Learn:
- job submission
- monitoring
- partitions
- scheduling
Stage 3 — Python Environments
Learn:
- Conda
- CUDA
- pip
- virtual environments
Stage 4 — GPU Computing
Learn:
- PyTorch
- distributed training
- checkpointing
Stage 5 — Advanced HPC
Learn:
- MPI
- NCCL
- DeepSpeed
- multi-node training
Practice Projects for Beginners
Beginner
- Submit hello-world SLURM job
- Train MNIST on GPU
- Monitor job states
Intermediate
- Multi-GPU image classifier
- Jupyter on HPC
- Distributed preprocessing pipeline
Advanced
- DeepSpeed training
- MPI simulations
- Large language model fine-tuning
Final Thoughts
The people who stand out in HPC environments are usually not just the best programmers.
They are the people who understand:
- systems
- optimization
- storage bottlenecks
- resource management
- debugging workflows
- automation
Supercomputing is less about “running code fast” and more about learning how to think computationally at scale.
If you are entering AI, scientific computing, or large-scale machine learning research, learning HPC may become one of the most valuable technical skills you acquire.
Useful Resources
- CINECA HPC Documentation
- SLURM Documentation
- PyTorch Distributed Training Docs
- OpenMPI Documentation
- Apptainer Documentation
If you are already working with supercomputers or HPC clusters, what was the hardest concept for you when you first started?
Top comments (0)