DEV Community

Ikegbo Ogochukwu
Ikegbo Ogochukwu

Posted on

From Zero to Supercomputing: A Beginner-Friendly Guide to Using HPC Clusters Like CINECA

A practical introduction to Linux, SLURM, GPU clusters, AI workloads, and supercomputing workflows.

Supercomputers are no longer reserved only for physicists and national laboratories.

Today, AI engineers, machine learning researchers, data scientists, and students increasingly rely on High Performance Computing (HPC) systems for:

  • large-scale AI training
  • scientific simulations
  • big data analytics
  • distributed computing
  • computational research

Recently, I started exploring the documentation of the CINECA HPC infrastructure, one of Europe’s major supercomputing environments, and I realized something important:

Most beginners struggle not because HPC is impossible, but because the ecosystem feels overwhelming at first.

This article breaks down the core concepts into a practical workflow that beginners can actually understand.


What Is an HPC Cluster?

An HPC (High Performance Computing) cluster is a network of powerful computers called nodes working together to solve computational problems.

Instead of training a model on:

  • one laptop CPU
  • one small GPU

you may gain access to:

  • hundreds of CPU cores
  • multiple A100/H100 GPUs
  • terabytes of memory
  • ultra-fast networking
  • distributed storage systems

Systems like those hosted at CINECA are used for:

  • AI training
  • weather forecasting
  • genomics
  • computational physics
  • climate simulations
  • deep learning research

Understanding the Architecture

Most HPC systems follow this structure:

Your Laptop
     ↓
Login Node
     ↓
SLURM Scheduler
     ↓
Compute Nodes
     ↓
GPU/CPU Execution
Enter fullscreen mode Exit fullscreen mode

Let’s simplify what each layer means.

Component Purpose
Login Node Used for coding, editing, compiling
Compute Node Where actual jobs run
GPU Node Used for AI and deep learning
Scheduler Manages resources and queues
Storage System Stores datasets and outputs

One important rule:

Never run heavy workloads directly on login nodes.


The Biggest Mindset Shift in HPC

On a laptop, you usually do this:

Write code → Run immediately → See output
Enter fullscreen mode Exit fullscreen mode

In HPC, the workflow becomes:

Write code
↓
Submit job
↓
Wait in queue
↓
Resources allocated
↓
Execution starts
↓
Logs/results generated
Enter fullscreen mode Exit fullscreen mode

HPC is asynchronous.

That is one of the biggest transitions beginners must adapt to.


Connecting to the Cluster

Most HPC systems are accessed using SSH.

Example:

ssh username@login.clustername.cineca.it
Enter fullscreen mode Exit fullscreen mode

Modern systems may also require:

  • 2FA
  • SSH certificates
  • OTP authentication

Linux Skills You MUST Know

Before touching supercomputers seriously, Linux fundamentals are essential.

Navigation Commands

pwd
ls
cd
mkdir
rm
cp
mv
Enter fullscreen mode Exit fullscreen mode

File Inspection

cat
less
head
tail
nano
vim
Enter fullscreen mode Exit fullscreen mode

Process Monitoring

top
htop
ps
kill
Enter fullscreen mode Exit fullscreen mode

If you are weak in Linux, HPC will feel painful very quickly.


Understanding Modules

HPC environments rarely allow random software installations directly on the system.

Instead, they use modules.

Modules dynamically load:

  • Python versions
  • CUDA versions
  • compilers
  • MPI libraries
  • AI frameworks

Checking Available Modules

module avail
Enter fullscreen mode Exit fullscreen mode

Loading Python

module load python
Enter fullscreen mode Exit fullscreen mode

Viewing Loaded Modules

module list
Enter fullscreen mode Exit fullscreen mode

Clearing Modules

module purge
Enter fullscreen mode Exit fullscreen mode

This prevents dependency conflicts between projects.


Understanding SLURM

Most HPC systems use a scheduler called SLURM.

SLURM manages:

  • job queues
  • GPU allocation
  • runtime limits
  • cluster resources

You do not directly “take” GPUs.

You request resources from SLURM.


Essential SLURM Commands

View Available Partitions

sinfo
Enter fullscreen mode Exit fullscreen mode

View Running Jobs

squeue
Enter fullscreen mode Exit fullscreen mode

Submit a Job

sbatch train.sh
Enter fullscreen mode Exit fullscreen mode

Start Interactive Session

srun --pty bash
Enter fullscreen mode Exit fullscreen mode

Cancel a Job

scancel JOB_ID
Enter fullscreen mode Exit fullscreen mode

Your First SLURM Script

Create a file:

nano train.sbatch
Enter fullscreen mode Exit fullscreen mode

Example content:

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --mem=32G
#SBATCH --output=output.log

module purge
module load cuda
module load python

python train.py
Enter fullscreen mode Exit fullscreen mode

Submit it:

sbatch train.sbatch
Enter fullscreen mode Exit fullscreen mode

SLURM responds with something like:

Submitted batch job 12345
Enter fullscreen mode Exit fullscreen mode

Understanding SBATCH Parameters

Parameter Meaning
--job-name Name of the job
--partition Queue name
--nodes Number of machines
--cpus-per-task CPU cores
--gres=gpu:1 Request GPU
--time Runtime limit
--mem RAM allocation

Monitoring Jobs

Check Queue

squeue -u username
Enter fullscreen mode Exit fullscreen mode

Live Monitoring

watch squeue -u username
Enter fullscreen mode Exit fullscreen mode

View Job Details

scontrol show job 12345
Enter fullscreen mode Exit fullscreen mode

Understanding Job States

State Meaning
PD Pending
R Running
CG Completing
CD Completed
F Failed

If your job stays in PD, it usually means:

  • resources are busy
  • queue is full
  • requested resources are too large

Interactive GPU Sessions

Interactive sessions are useful for:

  • debugging
  • notebook testing
  • experimentation

Example:

srun --partition=gpu \
     --gres=gpu:1 \
     --cpus-per-task=8 \
     --mem=32G \
     --pty bash
Enter fullscreen mode Exit fullscreen mode

Python Environments on HPC

Never rely entirely on system Python for serious AI work.

Most researchers use:

  • Conda
  • virtual environments

Create Environment

conda create -n ml python=3.11
Enter fullscreen mode Exit fullscreen mode

Activate:

conda activate ml
Enter fullscreen mode Exit fullscreen mode

Install packages:

pip install torch transformers datasets
Enter fullscreen mode Exit fullscreen mode

GPU Training Workflow

A common AI workflow looks like this:

Local Machine
    ↓
Upload Dataset
    ↓
Prepare Environment
    ↓
Write SLURM Script
    ↓
Submit Training
    ↓
Monitor Logs
    ↓
Download Results
Enter fullscreen mode Exit fullscreen mode

Storage Systems Matter More Than Beginners Think

HPC systems typically have:

  • home storage
  • scratch storage
  • project storage
  • high-speed temporary storage

A common beginner mistake is training directly from slow home directories.

For large AI workloads:

  • use scratch storage
  • optimize data loading
  • clean temporary files regularly

Transferring Files

SCP Example

scp dataset.zip username@cluster:/scratch/project/
Enter fullscreen mode Exit fullscreen mode

RSYNC Example

rsync -av dataset/ username@cluster:/scratch/project/
Enter fullscreen mode Exit fullscreen mode

Running PyTorch on HPC

Simple GPU check:

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
Enter fullscreen mode Exit fullscreen mode

SLURM script:

#SBATCH --gres=gpu:1

python train.py
Enter fullscreen mode Exit fullscreen mode

Multi-GPU Training

Example:

#SBATCH --nodes=1
#SBATCH --gres=gpu:4
Enter fullscreen mode Exit fullscreen mode

PyTorch distributed execution:

torchrun --nproc_per_node=4 train.py
Enter fullscreen mode Exit fullscreen mode

This is where HPC becomes significantly more powerful than normal laptops.


MPI and Distributed Computing

Scientific applications often use MPI.

Example:

module load openmpi

mpirun -np 16 ./simulation
Enter fullscreen mode Exit fullscreen mode

MPI is heavily used in:

  • physics
  • computational fluid dynamics
  • simulations
  • weather forecasting

Containers in HPC

Unlike cloud-native environments that heavily use Docker, HPC systems often use:

  • Singularity
  • Apptainer

Example:

module load singularity
Enter fullscreen mode Exit fullscreen mode

Run container:

singularity exec container.sif python train.py
Enter fullscreen mode Exit fullscreen mode

Common Beginner Mistakes

Running Heavy Jobs on Login Nodes

This annoys system administrators very quickly.

Requesting Too Many Resources

Large requests stay longer in queue.

Ignoring Logs

Logs explain most failures.

Example:

tail -f output.log
Enter fullscreen mode Exit fullscreen mode

CUDA Version Mismatches

Your PyTorch CUDA version must match cluster CUDA support.


HPC vs Cloud GPUs

HPC Cloud
Shared infrastructure Commercial infrastructure
Often research-funded Pay-as-you-go
Queue-based Immediate provisioning
Strong interconnects Flexible scaling
Excellent for huge workloads Excellent for startups

Recommended Learning Path

Stage 1 — Linux

Learn:

  • bash
  • file systems
  • SSH
  • permissions

Stage 2 — SLURM

Learn:

  • job submission
  • monitoring
  • partitions
  • scheduling

Stage 3 — Python Environments

Learn:

  • Conda
  • CUDA
  • pip
  • virtual environments

Stage 4 — GPU Computing

Learn:

  • PyTorch
  • distributed training
  • checkpointing

Stage 5 — Advanced HPC

Learn:

  • MPI
  • NCCL
  • DeepSpeed
  • multi-node training

Practice Projects for Beginners

Beginner

  • Submit hello-world SLURM job
  • Train MNIST on GPU
  • Monitor job states

Intermediate

  • Multi-GPU image classifier
  • Jupyter on HPC
  • Distributed preprocessing pipeline

Advanced

  • DeepSpeed training
  • MPI simulations
  • Large language model fine-tuning

Final Thoughts

The people who stand out in HPC environments are usually not just the best programmers.

They are the people who understand:

  • systems
  • optimization
  • storage bottlenecks
  • resource management
  • debugging workflows
  • automation

Supercomputing is less about “running code fast” and more about learning how to think computationally at scale.

If you are entering AI, scientific computing, or large-scale machine learning research, learning HPC may become one of the most valuable technical skills you acquire.


Useful Resources


If you are already working with supercomputers or HPC clusters, what was the hardest concept for you when you first started?

Top comments (0)