Ikegbo Ogochukwu

Posted on May 14

From Zero to Supercomputing: A Beginner-Friendly Guide to Using HPC Clusters Like CINECA

#gpu #cluster #supercomputers #ai

A practical introduction to Linux, SLURM, GPU clusters, AI workloads, and supercomputing workflows.

Supercomputers are no longer reserved only for physicists and national laboratories.

Today, AI engineers, machine learning researchers, data scientists, and students increasingly rely on High Performance Computing (HPC) systems for:

large-scale AI training
scientific simulations
big data analytics
distributed computing
computational research

Recently, I started exploring the documentation of the CINECA HPC infrastructure, one of Europe’s major supercomputing environments, and I realized something important:

Most beginners struggle not because HPC is impossible, but because the ecosystem feels overwhelming at first.

This article breaks down the core concepts into a practical workflow that beginners can actually understand.

What Is an HPC Cluster?

An HPC (High Performance Computing) cluster is a network of powerful computers called nodes working together to solve computational problems.

Instead of training a model on:

one laptop CPU
one small GPU

you may gain access to:

hundreds of CPU cores
multiple A100/H100 GPUs
terabytes of memory
ultra-fast networking
distributed storage systems

Systems like those hosted at CINECA are used for:

AI training
weather forecasting
genomics
computational physics
climate simulations
deep learning research

Understanding the Architecture

Most HPC systems follow this structure:

Your Laptop
     ↓
Login Node
     ↓
SLURM Scheduler
     ↓
Compute Nodes
     ↓
GPU/CPU Execution

Let’s simplify what each layer means.

Component	Purpose
Login Node	Used for coding, editing, compiling
Compute Node	Where actual jobs run
GPU Node	Used for AI and deep learning
Scheduler	Manages resources and queues
Storage System	Stores datasets and outputs

One important rule:

Never run heavy workloads directly on login nodes.

The Biggest Mindset Shift in HPC

On a laptop, you usually do this:

Write code → Run immediately → See output

In HPC, the workflow becomes:

Write code
↓
Submit job
↓
Wait in queue
↓
Resources allocated
↓
Execution starts
↓
Logs/results generated

HPC is asynchronous.

That is one of the biggest transitions beginners must adapt to.

Connecting to the Cluster

Most HPC systems are accessed using SSH.

Example:

ssh username@login.clustername.cineca.it

Modern systems may also require:

2FA
SSH certificates
OTP authentication

Linux Skills You MUST Know

Before touching supercomputers seriously, Linux fundamentals are essential.

Navigation Commands

pwd
ls
cd
mkdir
rm
cp
mv

File Inspection

cat
less
head
tail
nano
vim

Process Monitoring

top
htop
ps
kill

If you are weak in Linux, HPC will feel painful very quickly.

Understanding Modules

HPC environments rarely allow random software installations directly on the system.

Instead, they use modules.

Modules dynamically load:

Python versions
CUDA versions
compilers
MPI libraries
AI frameworks

Checking Available Modules

module avail

Loading Python

module load python

Viewing Loaded Modules

module list

Clearing Modules

module purge

This prevents dependency conflicts between projects.

Understanding SLURM

Most HPC systems use a scheduler called SLURM.

SLURM manages:

job queues
GPU allocation
runtime limits
cluster resources

You do not directly “take” GPUs.

You request resources from SLURM.

Essential SLURM Commands

View Available Partitions

sinfo

View Running Jobs

squeue

Submit a Job

sbatch train.sh

Start Interactive Session

srun --pty bash

Cancel a Job

scancel JOB_ID

Your First SLURM Script

Create a file:

nano train.sbatch

Example content:

#!/bin/bash
#SBATCH --job-name=my_training
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --mem=32G
#SBATCH --output=output.log

module purge
module load cuda
module load python

python train.py

Submit it:

sbatch train.sbatch

SLURM responds with something like:

Submitted batch job 12345

Understanding SBATCH Parameters

Parameter	Meaning
`--job-name`	Name of the job
`--partition`	Queue name
`--nodes`	Number of machines
`--cpus-per-task`	CPU cores
`--gres=gpu:1`	Request GPU
`--time`	Runtime limit
`--mem`	RAM allocation

Monitoring Jobs

Check Queue

squeue -u username

Live Monitoring

watch squeue -u username

View Job Details

scontrol show job 12345

Understanding Job States

State	Meaning
PD	Pending
R	Running
CG	Completing
CD	Completed
F	Failed

If your job stays in PD, it usually means:

resources are busy
queue is full
requested resources are too large

Interactive GPU Sessions

Interactive sessions are useful for:

debugging
notebook testing
experimentation

Example:

srun --partition=gpu \
     --gres=gpu:1 \
     --cpus-per-task=8 \
     --mem=32G \
     --pty bash

Python Environments on HPC

Never rely entirely on system Python for serious AI work.

Most researchers use:

Conda
virtual environments

Create Environment

conda create -n ml python=3.11

Activate:

conda activate ml

Install packages:

pip install torch transformers datasets

GPU Training Workflow

A common AI workflow looks like this:

Local Machine
    ↓
Upload Dataset
    ↓
Prepare Environment
    ↓
Write SLURM Script
    ↓
Submit Training
    ↓
Monitor Logs
    ↓
Download Results

Storage Systems Matter More Than Beginners Think

HPC systems typically have:

home storage
scratch storage
project storage
high-speed temporary storage

A common beginner mistake is training directly from slow home directories.

For large AI workloads:

use scratch storage
optimize data loading
clean temporary files regularly

Transferring Files

SCP Example

scp dataset.zip username@cluster:/scratch/project/

RSYNC Example

rsync -av dataset/ username@cluster:/scratch/project/

Running PyTorch on HPC

Simple GPU check:

import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

SLURM script:

#SBATCH --gres=gpu:1

python train.py

Multi-GPU Training

Example:

#SBATCH --nodes=1
#SBATCH --gres=gpu:4

PyTorch distributed execution:

torchrun --nproc_per_node=4 train.py

This is where HPC becomes significantly more powerful than normal laptops.

MPI and Distributed Computing

Scientific applications often use MPI.

Example:

module load openmpi

mpirun -np 16 ./simulation

MPI is heavily used in:

physics
computational fluid dynamics
simulations
weather forecasting

Containers in HPC

Unlike cloud-native environments that heavily use Docker, HPC systems often use:

Singularity
Apptainer

Example:

module load singularity

Run container:

singularity exec container.sif python train.py

Common Beginner Mistakes

Running Heavy Jobs on Login Nodes

This annoys system administrators very quickly.

Requesting Too Many Resources

Large requests stay longer in queue.

Ignoring Logs

Logs explain most failures.

Example:

tail -f output.log

CUDA Version Mismatches

Your PyTorch CUDA version must match cluster CUDA support.

HPC vs Cloud GPUs

HPC	Cloud
Shared infrastructure	Commercial infrastructure
Often research-funded	Pay-as-you-go
Queue-based	Immediate provisioning
Strong interconnects	Flexible scaling
Excellent for huge workloads	Excellent for startups

Recommended Learning Path

Stage 1 — Linux

Learn:

bash
file systems
SSH
permissions

Stage 2 — SLURM

Learn:

job submission
monitoring
partitions
scheduling

Stage 3 — Python Environments

Learn:

Conda
CUDA
pip
virtual environments

Stage 4 — GPU Computing

Learn:

PyTorch
distributed training
checkpointing

Stage 5 — Advanced HPC

Learn:

MPI
NCCL
DeepSpeed
multi-node training

Practice Projects for Beginners

Beginner

Submit hello-world SLURM job
Train MNIST on GPU
Monitor job states

Intermediate

Multi-GPU image classifier
Jupyter on HPC
Distributed preprocessing pipeline

Advanced

DeepSpeed training
MPI simulations
Large language model fine-tuning

Final Thoughts

The people who stand out in HPC environments are usually not just the best programmers.

They are the people who understand:

systems
optimization
storage bottlenecks
resource management
debugging workflows
automation

Supercomputing is less about “running code fast” and more about learning how to think computationally at scale.

If you are entering AI, scientific computing, or large-scale machine learning research, learning HPC may become one of the most valuable technical skills you acquire.

Useful Resources

If you are already working with supercomputers or HPC clusters, what was the hardest concept for you when you first started?