DEV Community: Muhammad Zubair Bin Akbar

CPU Pinning and Affinity in HPC: Why Performance Changes Drastically

Muhammad Zubair Bin Akbar — Sat, 16 May 2026 18:58:33 +0000

In HPC environments, users often notice something confusing:

The same application, same input, and same number of CPUs can produce very different performance results across runs.

One of the biggest reasons behind this is CPU pinning and CPU affinity.

Without proper CPU placement, processes can bounce between cores, compete for cache, and suffer from NUMA penalties. In large parallel workloads, this can drastically reduce performance.

This blog explains what CPU pinning and affinity are, why they matter in HPC, and how they impact real workloads.

What Is CPU Affinity?

CPU affinity controls which CPU cores a process or thread is allowed to run on.

The operating system scheduler can still move the process between the allowed cores, but only within that defined CPU set.

For example:

A process may be allowed to run only on cores 0 to 7
The scheduler can move it between those cores if needed

Affinity helps improve cache locality and reduces unnecessary movement across the entire system.

What Is CPU Pinning?

CPU pinning is the actual act of locking a process or thread to a CPU core.

In HPC clusters, schedulers like Slurm often handle this automatically through CPU binding options.

For example:

MPI rank 0 stays on core 0
MPI rank 1 stays on core 1

This minimizes CPU migrations and provides more predictable performance for HPC workloads.

Pinning ensures:

Better cache locality
Reduced scheduler overhead
Predictable performance
Lower NUMA latency
Reduced context switching

Without pinning, Linux may move tasks between cores frequently depending on system activity.

Why Performance Changes So Much

Modern HPC nodes are complex.

A single node may contain:

Multiple CPU sockets
NUMA regions
Shared and private caches
Hyperthreading
Hundreds of logical CPUs

When processes move randomly between CPUs, several problems appear.

Cache Locality Problems

CPUs rely heavily on cache memory.

If a thread keeps running on the same core, cached data remains available and execution becomes faster.

When the thread migrates to another core:

Cache must be rebuilt
Memory access latency increases
CPU cycles are wasted

This becomes extremely expensive for tightly coupled MPI applications.

NUMA Effects

NUMA stands for Non Uniform Memory Access.

In multi socket systems, memory attached to the local CPU socket is faster than memory attached to another socket.

If a process runs on Socket 0 but accesses memory allocated on Socket 1:

Memory latency increases
Bandwidth decreases
Application performance drops

This is one of the most common reasons HPC jobs scale poorly.

Example of Bad CPU Placement

Consider a dual socket server:

Socket 0 → cores 0 to 31
Socket 1 → cores 32 to 63

If an MPI application launches ranks without proper affinity:

Rank 0 may start on core 2
Later move to core 40
Then back to core 10

Now the application suffers from:

Remote memory access
Cache misses
CPU migration overhead

The result can be a major slowdown even though CPU usage appears high.

MPI and CPU Binding

MPI applications are very sensitive to where processes run on the CPU.

If MPI ranks keep moving between cores:

Cache data gets lost
Memory access becomes slower
Communication latency increases

To avoid this, MPI runtimes and schedulers use CPU binding or pinning.

For example with Open MPI:

mpirun --bind-to core --map-by socket ./app

With Slurm:

srun --cpu-bind=cores ./app

These settings keep MPI processes fixed to specific CPU cores, which usually provides more stable and faster performance in HPC workloads.

Hyperthreading Can Also Matter

Some workloads perform poorly when pinned to logical CPUs instead of physical cores.

For compute intensive applications:

Two threads sharing one physical core may compete for resources
Floating point performance may decrease
Memory bandwidth may become limited

This is why many HPC sites disable hyperthreading for production workloads.

Real World Performance Difference

In many HPC benchmarks:

Proper CPU affinity can improve performance by 10% to 40%
NUMA aware placement can reduce latency significantly
Communication heavy MPI jobs benefit the most

Applications such as:

CFD solvers
Molecular dynamics
Finite element simulations
AI training workloads
Weather modeling

are highly sensitive to CPU placement.

How to Check CPU Affinity

Useful Linux tools include:

taskset -p <pid>

numactl --show

lscpu

hwloc-ls

The hwloc package is especially useful for visualizing CPU topology and NUMA layout.

Best Practices in HPC

1. Use Scheduler Managed Affinity

Let the cluster scheduler manage CPU placement whenever possible.

For example:

#SBATCH --cpus-per-task=8
#SBATCH --cpu-bind=cores

2. Keep MPI Ranks NUMA Aware

Try to keep MPI ranks and memory allocations within the same NUMA domain.

Tools like numactl can help.

3. Benchmark Different Configurations

Different applications behave differently.

Always test:

Core binding
Socket binding
NUMA placement
Hyperthreading enabled vs disabled

4. Monitor CPU Migrations

High CPU migrations can indicate poor affinity configuration.

Useful commands:

pidstat -w

perf stat

Final Thoughts

CPU pinning and affinity are often overlooked in HPC environments, but they directly affect application scalability and runtime consistency.

Two jobs using the same resources can perform very differently simply because of process placement. Understanding CPU topology, NUMA behavior, and scheduler affinity policies is essential for getting the best performance from modern HPC clusters.

In many cases, properly placing and pinning processes to CPU cores can improve performance without upgrading the hardware.

How Slurm Handles Resource Allocation Internally

Muhammad Zubair Bin Akbar — Wed, 13 May 2026 19:05:33 +0000

If you work with HPC clusters, chances are you use slurm every day to submit jobs, monitor queues, and manage compute resources.

Most users know commands like sbatch, squeue, and sinfo, but fewer understand what actually happens internally when a job is submitted.

This article explains how Slurm handles resource allocation behind the scenes, from job submission to execution on compute nodes.

⸻

What Happens When You Submit a Job?

When a user runs:

sbatch job.sh

Slurm begins a multi step workflow internally.

The main components involved are:

slurmctld → Central controller daemon
slurmd → Compute node daemon
slurmdbd → Accounting database daemon (optional but common)
Scheduler plugin
Select plugin
Cgroups/task plugins

Each component has a specific role in resource allocation.

⸻

Step 1: Job Submission

The sbatch command sends the job request to slurmctld.

The request includes:

Number of nodes
CPUs
Memory
GPUs
Time limit
Partition
Constraints
QoS
Account information

Example:

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --mem=128G
#SBATCH --time=04:00:00

At this stage, Slurm creates a job record and places it into the pending queue.

⸻

Step 2: Job Validation

Before scheduling the job, Slurm validates several things internally.

User & Account Checks

Slurm verifies:

User permissions
Account associations
QoS limits
Fairshare policies
Partition access

If accounting is enabled, slurmdbd provides usage statistics and limits.

⸻

Step 3: Scheduler Evaluation

Now the scheduler starts evaluating the job.

The default scheduler in Slurm is:

sched/backfill

This scheduler performs two important tasks:

Main Scheduling Pass

It checks:

Available resources
Job priority
Node states
Reservations
Limits

Backfill Scheduling

Backfill allows smaller jobs to run without delaying higher priority jobs.

This improves overall cluster utilization.

⸻

How Job Priority Is Calculated

Slurm calculates a dynamic priority score.

Factors include:

Fairshare usage
Job age
Job size
Partition priority
QoS priority
Association priority

Internally, the priority plugin combines these values into a single score.

Example:

Priority = Age + Fairshare + JobSize + Partition + QoS

Higher score means earlier scheduling.

⸻

Step 4: Resource Selection

Once the scheduler decides to run the job, Slurm uses the select plugin.

Most clusters use:

select/cons_tres

This plugin handles consumable resources using TRES.

⸻

What Are TRES?

TRES stands for:

Trackable RESources

Examples:

CPU
Memory
GPU
Node
License
Burst buffer

This model allows Slurm to track resources very precisely.

⸻

Internal Node Selection

The select plugin now determines:

Which nodes are eligible
How CPUs are distributed
Memory allocation
GPU placement
Socket/core binding

Slurm checks node topology information stored in memory by slurmctld.

Example:

NodeA:
  64 CPUs
  512 GB RAM
  4 GPUs

If the job requests:

32 CPUs + 2 GPUs

Slurm reserves exactly those resources internally.

⸻

Step 5: Resource Reservation

After node selection, Slurm marks resources as allocated.

Internally:

CPUs become unavailable to other jobs
Memory counters are reduced
GPUs are reserved
Node state changes

You can observe this using:

scontrol show node

squeue

⸻

Step 6: Launching the Job

Now slurmctld contacts the slurmd daemon on allocated nodes.

The compute node daemon performs:

Environment setup
UID/GID validation
Cgroup creation
CPU binding
Memory enforcement
Task launching

⸻

How Cgroups Enforce Limits

Modern Slurm clusters heavily rely on Linux cgroups.

Cgroups ensure a job cannot exceed allocated resources.

Examples:

CPU Enforcement

Only allocated CPU cores are accessible

Memory Enforcement

Memory usage beyond limit triggers OOM kill

GPU Isolation

Only assigned GPUs are visible

This is why users see:

CUDA_VISIBLE_DEVICES=0

automatically set inside jobs.

⸻

CPU Binding and Affinity

Slurm also handles CPU affinity internally.

This improves:

NUMA locality
Cache efficiency
MPI performance

Example:

srun --cpu-bind=cores

Internally, Slurm maps tasks to specific CPU cores using topology-aware scheduling.

⸻

Step 7: Job Execution

Once everything is configured:

Processes start
Accounting begins
Usage metrics are collected

Slurm tracks:

CPU time
Memory usage
GPU usage
Energy consumption
Exit codes

These statistics are later visible through:

sacct

⸻

What Happens When the Job Finishes?

After completion:

Resources are released
Node state is updated
Accounting data is stored
Scheduler reevaluates pending jobs

The released resources immediately become available for new allocations.

⸻

Why Understanding This Matters

Knowing how Slurm allocates resources helps administrators and users:

Troubleshoot pending jobs
Optimize scheduling
Improve cluster utilization
Diagnose CPU or memory contention
Tune fairshare policies
Understand performance bottlenecks

It also makes debugging much easier when dealing with issues like:

ReqNodeNotAvail Resources Priority QOSMaxCpuPerUserLimit

⸻

Final Thoughts

Slurm does much more than simply queue jobs.

Internally, it performs:

Policy validation
Priority calculations
Topology aware scheduling
Precise resource accounting
Cgroup enforcement
Distributed task launching

Understanding these internals gives HPC administrators better control over cluster performance and helps users write more efficient jobs.

The next time you run sbatch, remember that an entire scheduling engine is working behind the scenes to decide exactly where and how your workload should run.

InfiniBand vs Omni Path vs Ethernet for AI Workloads

Muhammad Zubair Bin Akbar — Mon, 11 May 2026 19:27:00 +0000

AI workloads are pushing HPC and data center networks harder than ever. Training large language models, distributed deep learning, and high speed data pipelines depend heavily on fast interconnects between compute nodes.

When GPUs spend more time waiting for data than processing it, the network becomes the bottleneck.

Three major networking technologies are commonly discussed in AI and HPC environments:

InfiniBand
Intel Omni Path
Ethernet

Each comes with different strengths, trade offs, and real world use cases.

⸻

Why Network Fabric Matters in AI

Modern AI training is rarely limited to a single GPU or node.

Distributed frameworks like:

PyTorch DDP
DeepSpeed
Horovod
TensorFlow Distributed

constantly exchange gradients, parameters, and synchronization data between nodes.

The faster this communication happens, the better the training performance scales.

Key factors include:

Latency
Bandwidth
RDMA support
Scalability
Congestion handling
GPU communication efficiency

⸻

1. InfiniBand

NVIDIA InfiniBand is considered the gold standard for high performance AI and HPC clusters.

It is designed specifically for ultra low latency and extremely high throughput communication.

Key Features

RDMA (Remote Direct Memory Access)
GPUDirect RDMA support
Very low latency
High bandwidth (HDR, NDR generations)
Adaptive routing
Lossless communication

Why AI Clusters Love InfiniBand

Large AI workloads generate massive all reduce traffic between GPUs.

InfiniBand performs exceptionally well because it minimizes CPU involvement and allows direct GPU to GPU communication across nodes.

This improves:

Multi node GPU scaling
Training efficiency
Synchronization speed
Cluster utilization

Common Use Cases

Large scale LLM training
HPC supercomputers
GPU heavy AI clusters
Research environments

Limitations

Expensive hardware
Complex deployment
Specialized networking expertise required

⸻

2. Omni Path

Intel Omni Path was Intel’s answer to InfiniBand for HPC environments.

It focused on delivering high throughput with strong scalability at a potentially lower cost.

Key Features

Low latency fabric
High port density
Efficient MPI communication
Good scalability for HPC workloads

Strengths

Omni Path performed well in:

MPI based HPC clusters
Scientific simulations
CPU centric workloads

It also reduced switch complexity in some deployments due to its architecture.

Challenges for AI Workloads

While Omni Path worked well for traditional HPC, it struggled to gain traction in GPU dominated AI ecosystems.

Reasons included:

Limited GPU ecosystem support
Less mature GPUDirect integration
Smaller vendor ecosystem
Reduced industry adoption over time

Today, most modern AI deployments lean toward InfiniBand or high speed Ethernet instead.

⸻

3. Ethernet

Broadcom and other vendors continue pushing Ethernet into AI networking with higher speeds like:

100GbE
200GbE
400GbE
800GbE

Ethernet remains the most widely deployed networking technology globally.

Key Features

Easy integration
Lower cost
Massive ecosystem support
Simpler operations
Familiar tooling

Ethernet in Modern AI

Traditional Ethernet had higher latency compared to InfiniBand, but newer technologies have improved performance significantly.

Examples include:

RoCE (RDMA over Converged Ethernet)
SmartNICs
DPU acceleration
Lossless Ethernet configurations

Many organizations now run AI workloads successfully on high speed Ethernet fabrics.

Strengths

Cost effective scaling
Easier maintenance
Better compatibility with enterprise environments
Flexible vendor choices

Weaknesses

Usually higher latency than InfiniBand
Congestion tuning can become complex
RoCE requires careful configuration

⸻

Which One Should You Choose?

Choose InfiniBand if:

You train large AI models
You run multi node GPU clusters
Maximum performance matters
Budget is less of a concern

Choose Omni Path if:

You already operate Intel HPC infrastructure
Your workloads are MPI heavy
GPU scaling is not the main priority

Choose Ethernet if:

You want operational simplicity
You need enterprise compatibility
Budget matters
Your AI workloads are medium scale

⸻

Final Thoughts

There is no universal winner.

The right interconnect depends on:

Workload type
Cluster scale
Budget
GPU usage
Operational expertise

For cutting edge AI training, InfiniBand still dominates performance focused deployments.

For enterprise AI environments, Ethernet continues evolving rapidly and closing the gap.

Omni Path played an important role in HPC networking, but its presence in modern AI infrastructure has become much smaller compared to InfiniBand and Ethernet.

As AI clusters continue growing, networking decisions are becoming just as important as CPU and GPU selection.

How HPC Clusters Accelerate AI/ML Training

Muhammad Zubair Bin Akbar — Sat, 09 May 2026 21:36:44 +0000

Artificial Intelligence and Machine Learning are growing faster than ever. From large language models to computer vision and scientific simulations, modern AI workloads require massive computing power.

Training a model on a normal workstation can take days, weeks, or even months. This is where High Performance Computing, also known as HPC, becomes extremely valuable.

An HPC cluster allows researchers, engineers, startups, and enterprises to train AI models faster, process larger datasets, and scale workloads efficiently.

What is an HPC Cluster?

An HPC cluster is a group of interconnected servers working together as a single powerful computing environment.

These clusters usually contain:

Multiple compute nodes
High core count CPUs
Powerful GPUs
High speed networking
Parallel storage systems
Job scheduling software like Slurm

Instead of relying on a single machine, workloads are distributed across many systems.

Why AI and ML Need HPC

Modern AI training involves billions of calculations. Large datasets and deep neural networks demand huge computational resources.

Without HPC infrastructure, organizations often face:

Slow training times
GPU bottlenecks
Memory limitations
Storage performance issues
Scaling challenges

HPC solves these problems by providing distributed computing and parallel execution.

Faster Model Training

One of the biggest advantages of HPC is reduced training time.

For example, training a deep learning model on a single GPU may take several days. Using an HPC cluster with multiple GPUs across several nodes can reduce this time dramatically.

Frameworks such as:

PyTorch
TensorFlow
Horovod
DeepSpeed

can distribute training across many GPUs simultaneously.

This allows data parallelism and model parallelism at scale.

Efficient GPU Utilization

GPUs are expensive resources. HPC clusters help maximize GPU usage efficiently.

Schedulers like Slurm can:

Allocate GPUs dynamically
Queue workloads efficiently
Prevent resource conflicts
Improve overall cluster utilization

This ensures that GPUs remain productive instead of sitting idle.

Scalability for Large Datasets

AI models continue to grow in size. Datasets now reach terabytes or even petabytes.

HPC clusters provide scalable storage systems such as:

Lustre
BeeGFS
GPFS

These parallel file systems allow high speed data access from multiple nodes at the same time.

As a result, training pipelines become faster and more reliable.

Distributed Training Made Easier

Modern AI frameworks are designed to work well with HPC environments.

Using technologies like:

NCCL
MPI
RDMA
Omni Path or InfiniBand networking

clusters can achieve low latency communication between GPUs and compute nodes.

This becomes critical when training large transformer models or running multi GPU workloads.

Better Resource Sharing

HPC clusters are ideal for universities, research labs, and enterprises where many users need access to computing resources.

Instead of every team purchasing separate hardware, a centralized HPC environment allows shared access to:

GPUs
CPUs
Memory
Storage
Software environments

This reduces cost and improves operational efficiency.

AI Use Cases That Benefit from HPC

HPC clusters are widely used for:

Large Language Models
Computer Vision
Medical Imaging
Weather Prediction
Drug Discovery
Financial Modeling
Autonomous Vehicle Research
Scientific Simulations

Many of these workloads are impossible to run efficiently on a single machine.

Challenges to Consider

Although HPC offers major advantages, there are still challenges:

Infrastructure cost
Power and cooling requirements
GPU availability
Network complexity
Cluster management
Software compatibility

However, the long term performance gains usually outweigh the initial setup effort.

Final Thoughts

AI and Machine Learning workloads are becoming increasingly demanding. Traditional systems are often not enough to handle modern training requirements.

HPC clusters provide the computing power, scalability, and efficiency needed for advanced AI development.

Whether you are training deep learning models, processing massive datasets, or running distributed workloads, HPC can significantly accelerate your AI journey.

As AI continues to evolve, HPC infrastructure will become even more important for research and innovation.

NFS vs Parallel File Systems in HPC: How to Choose the Right Storage Architecture

Muhammad Zubair Bin Akbar — Thu, 07 May 2026 21:49:09 +0000

When building or expanding an HPC cluster, one of the biggest architectural decisions is storage design. Many small and mid-sized clusters start with NFS because it is simple, reliable, and easy to manage. But as workloads grow, storage often becomes the hidden bottleneck.

So the real question is:

When is NFS enough, and when does an HPC cluster actually require a parallel file system like Lustre, BeeGFS, or GPFS?

This article breaks down the practical factors that help HPC admins make that decision.

⸻

Understanding the Difference

NFS (Network File System)

NFS is a centralized file-sharing system where compute nodes access data from a single storage server.

Why admins love it

Easy to configure
Minimal infrastructure
Simple backups
Lower operational overhead
Great for small clusters

Common HPC usage

Home directories
Software repositories
Small research workloads
Shared scripts and configuration files

⸻

Parallel File Systems

A parallel file system distributes storage operations across multiple servers and disks simultaneously.

Examples include:

Lustre
BeeGFS
IBM GPFS / Spectrum Scale
WekaFS

Why they exist

They are designed for:

Massive throughput
High concurrency
Thousands of simultaneous reads/writes
Large-scale HPC and AI workloads

⸻

The Real Decision: Workload, Not Cluster Size

One of the biggest misconceptions is:

“Large cluster = parallel file system.”

Not always.

A 500-node cluster running lightweight CPU simulations may work perfectly fine with NFS.

Meanwhile, a 20-node GPU AI cluster can completely overwhelm NFS in days.

The decision depends more on:

I/O behavior
Data size
Concurrency
Metadata pressure
Performance expectations

⸻

Key Factors That Decide Between NFS and Parallel Storage

1. Number of Concurrent Jobs

This is usually the first warning sign.

NFS works well when:

Few jobs access storage simultaneously
Workloads are mostly compute-heavy
Files are read occasionally

Problems start when:

Hundreds of jobs hit storage together
Many users submit jobs simultaneously
Applications continuously read/write checkpoints

Symptoms

Jobs stuck in I/O wait
Slow application startup
Hanging MPI jobs
High NFS server load

If your storage server becomes the cluster bottleneck, parallel storage should be considered.

⸻

2. I/O Pattern of Applications

Different applications stress storage differently.

NFS handles well:

Sequential reads
Small user datasets
Software sharing
Log files
Light checkpointing

Parallel file systems are better for:

Large checkpoint files
Frequent writes
Multi-node parallel reads
AI training datasets
CFD and FEM simulations
Genomics pipelines
High-throughput workflows

Example

A simulation writing:

1 GB every hour → NFS is usually fine

A deep learning job where:

32 GPUs constantly read millions of small images → NFS may collapse quickly

⸻

3. Metadata Operations

This is one of the most ignored storage bottlenecks in HPC.

Metadata operations include:

Opening files
Closing files
Listing directories
Creating small files
File existence checks

AI and genomics workloads often generate:

Millions of tiny files
Heavy directory scans

NFS struggles badly under metadata storms because a single server handles everything.

Parallel file systems distribute metadata handling across multiple servers.

⸻

4. Storage Throughput Requirements

Ask yourself:

How much aggregate bandwidth does the cluster need?

Example

If:

50 nodes each require 500 MB/s
Total required throughput = 25 GB/s

A single NFS server is unlikely to sustain this consistently.

Parallel storage is specifically designed for aggregate throughput scaling.

⸻

5. GPU Workloads

GPU clusters expose storage weaknesses extremely fast.

Why?

Because GPUs process data faster than CPUs and can become idle waiting for storage.

Common signs

GPU utilization drops
Data loader bottlenecks
Training stalls
NCCL timeout side effects
Slow checkpoint saves

For modern AI clusters, storage throughput becomes just as important as GPU performance.

⸻

6. Checkpointing Frequency

Large HPC jobs periodically save state to disk.

This is called checkpointing.

NFS struggles when:

Hundreds of jobs checkpoint together
Checkpoint files are huge
Writes occur frequently

This creates:

I/O spikes
Server saturation
Job slowdowns

Parallel file systems distribute write operations and handle burst traffic much better.

⸻

7. Scalability Expectations

Think beyond today.

NFS is usually enough for:

Labs
University research groups
Small clusters
Development environments

Parallel storage becomes attractive when:

Cluster growth is expected
More users are added regularly
GPU adoption increases
Storage demand grows every quarter

Migrating later is possible, but painful.

Planning early saves operational headaches.

⸻

8. High Availability Requirements

With NFS:

One storage server often becomes a single point of failure

If that server goes down:

Jobs fail
Mounts freeze
Users lose access

Parallel file systems typically support:

Redundant metadata servers
Distributed storage targets
Better failover models

This matters heavily in production HPC environments.

⸻

When NFS Is Completely Fine

NFS is still a perfectly valid HPC solution when:

Cluster size is small or medium
Workloads are CPU-heavy
I/O demand is modest
User count is limited
Budgets are constrained
Simulations are compute-bound
Storage traffic is predictable

Many successful HPC environments run on NFS for years without major issues.

Do not deploy complex parallel storage just because it sounds “enterprise.”

Operational simplicity matters.

⸻

When a Parallel File System Becomes Necessary

You should seriously evaluate parallel storage if you observe:

High I/O wait times
Saturated NFS server CPU/network
GPU starvation
Slow checkpointing
Metadata bottlenecks
Thousands of simultaneous file operations
Multi-GB/s throughput demand
Frequent user complaints about storage slowness

At that point, storage is no longer infrastructure.

It becomes part of application performance.

⸻

Practical Rule of Thumb

Stay with NFS if:

Storage is not your bottleneck
Applications are compute-heavy
Simplicity is more valuable than scale

Move to parallel storage if:

Storage limits job performance
GPU utilization suffers
I/O scales faster than compute
Metadata load becomes extreme

⸻

Final Thoughts

There is no universal answer in HPC storage architecture.

The best storage system is not the most advanced one.

It is the one that:

Matches workload behavior
Scales with demand
Stays operationally manageable
Delivers consistent performance

For many clusters, NFS remains the right choice.

But once storage starts limiting compute performance, a parallel file system stops being optional and becomes necessary infrastructure.

⸻

How Modules Work in HPC

Muhammad Zubair Bin Akbar — Wed, 06 May 2026 15:31:47 +0000

If you have ever logged into an HPC cluster and typed something like:

module load gcc

…you have already used one of the most important tools in HPC environments, Lmod.

But what’s actually happening behind the scenes? And why do we even need modules in the first place?

Let’s break it down in a simple, practical way.

The Problem: Too Many Software Versions

HPC systems are shared by many users, and different projects often need different versions of the same software.

For example:

One user needs Python 3.8
Another needs Python 3.11
Someone else depends on a specific GCC compiler version

Installing everything globally would create conflicts and chaos.

So instead of forcing one version on everyone, HPC systems use environment modules.

What Lmod Actually Does

Lmod is a system that dynamically modifies your shell environment so you can switch between software versions easily.

When you run:

module load python/3.11

Lmod:

Updates your PATH
Sets environment variables like LD_LIBRARY_PATH
Ensures dependencies are correctly configured

In simple terms:

It prepares your environment so the right software works correctly.

Think of It Like This

Imagine your environment as a workspace.

Each module you load:

Adds tools to your workspace
Configures them correctly
Avoids interfering with other tools

Without modules, you’d have to manually set everything yourself every time.

Basic Commands You’ll Use

List available modules

module avail

Load a module

module load gcc/12.2

Unload a module

module unload gcc

See what’s currently loaded

module list

Swap versions easily

module swap python/3.8 python/3.11

What Are Modulefiles?

Behind every module is a modulefile.

This is just a script (usually written in Lua for Lmod) that tells the system:

What paths to add
What variables to set
What dependencies to load

Example idea:

prepend_path("PATH", "/opt/gcc/12.2/bin")

You don’t usually need to edit these, but it helps to know they exist.

Handling Dependencies Automatically

One of the biggest advantages of Lmod is dependency management.

If you load something like:

module load openmpi

Lmod can automatically:

Load the correct compiler
Avoid incompatible versions
Prevent conflicts

This saves a lot of debugging time.

Common Gotchas

1. Mixing incompatible modules

Loading different compilers and MPI stacks together can break things.

Stick to consistent toolchains.

2. Forgetting to load modules in job scripts

What works in your shell might fail in Slurm if modules aren’t loaded.

Always include:

module load <required-modules>

3. Dirty environments

If things behave strangely:

module purge

This resets everything.

Why Lmod Matters in HPC

Lmod makes HPC usable at scale by:

Avoiding software conflicts
Supporting multiple users and workflows
Simplifying environment setup
Making jobs reproducible

Without it, managing software on clusters would be painful and error prone.

Final Thoughts

You don’t need to understand every detail of Lmod to use it effectively.

Just remember:

Modules control your environment.
Your environment controls your results.

Once you get comfortable with modules, debugging HPC jobs becomes much easier.

Inside Job Logs: What to Look For When Things Break

Muhammad Zubair Bin Akbar — Mon, 04 May 2026 22:37:22 +0000

When a job fails on an HPC cluster, your first instinct might be to rerun it and hope for a different outcome. That rarely works. The real answers are almost always sitting quietly in your job logs.

Understanding how to read those logs effectively can save hours of guesswork and help you fix issues faster and more confidently.

Start With the Basics: Exit Codes

Every job finishes with an exit code. This is the simplest signal of what happened.

0 means success
Non-zero values indicate failure

In Slurm, you will often see something like:

ExitCode=1:0

The first number is the job’s exit status, and the second is the signal. If the signal is non-zero, it usually points to something more abrupt, like a kill or crash.

Check Standard Output and Error Files

Slurm writes logs to files like:

slurm-<jobid>.out

Or custom paths defined in your job script:

#SBATCH --output=job.out #SBATCH --error=job.err

These files are your primary source of truth.

stdout shows normal program output
stderr shows warnings, errors, and crashes

Always read stderr first when debugging.

Look for the First Error, Not the Last

A common mistake is focusing on the last line of the log. In reality, the root cause often appears much earlier.

For example:

File not found: input.dat Segmentation fault (core dumped)

The segmentation fault is just a consequence. The missing file is the real issue.

Memory Issues: Subtle but Common

Memory problems show up in different ways depending on how the system enforces limits.

Typical signs include:

Out Of Memory
Killed
oom-kill event

In Slurm, you might also see:

slurmstepd: error: Detected 1 oom-kill event(s)

If this happens, your job likely exceeded its allocated memory. Increase --mem or optimize memory usage.

Node-Level Failures vs Application Errors

Not every failure is your fault.

Application Errors

Segmentation faults
Python tracebacks
Missing libraries

These point to issues in your code or environment.

System or Node Issues

Block device required
I/O error
Node unreachable messages

These suggest problems with the compute node, filesystem, or scheduler.

If multiple jobs fail on the same node, it’s a strong signal of a node issue.

Environment and Dependency Problems

A job might fail simply because something isn’t loaded.

Look for:

command not found module: not found libXYZ.so: cannot open shared object file

These errors usually mean:

Missing modules
Incorrect environment setup
Wrong software versions

Double-check your module loads and environment variables.

MPI and Multi-Node Clues

For parallel jobs, logs can get noisy. Focus on patterns:

Rank-specific failures
Communication errors
Timeouts

Examples include:

MPI_ABORT was invoked NCCL error connection timed out

These often point to network issues, misconfiguration, or mismatched libraries.

Timing and Resource Clues

Sometimes the issue isn’t a crash, but inefficiency or limits.

Look for:

Jobs stopping exactly at walltime
Slow startup or long idle times
Uneven resource usage

Slurm accounting tools like sacct and seff can complement logs and give a clearer picture.

Build a Debugging Habit

Instead of reacting randomly to failures, follow a consistent approach:

Check exit code
Read stderr from top to bottom
Identify the first real error
Correlate with resource usage and job settings
Verify environment and dependencies

Over time, patterns become familiar, and debugging gets faster.

Final Thoughts

Logs are not just noise. They are structured clues about what went wrong and why.

The more time you spend understanding them, the less time you waste guessing. In HPC environments, that difference matters.

Shared vs Distributed Memory – Why It Matters More Than You Think

Muhammad Zubair Bin Akbar — Sun, 03 May 2026 22:00:20 +0000

When people start working with high performance computing or parallel systems, “memory” often sounds like a background detail. It’s not. The way memory is structured can completely change how your applications behave, scale, and even fail.

Let’s break it down in a practical way.

⸻

What is Shared Memory?

In a shared memory system, all processors access the same memory space.

Think of it like multiple people working on a single Google Doc. Everyone sees the same data, and changes are immediately visible.

Key traits:

One global memory space
Fast communication between threads
Easier to program (generally)
Requires synchronization (locks, semaphores)

Where you see it:

Multi core CPUs
OpenMP based applications
Single node parallel jobs

The catch:

Shared memory doesn’t scale well forever. As you add more cores, contention increases. Memory bandwidth becomes a bottleneck, and performance starts to drop.

⸻

What is Distributed Memory?

In distributed memory systems, each processor (or node) has its own private memory.

Now imagine each person has their own document, and they email updates to each other. Communication is explicit.

Key traits:

Separate memory per node
Communication via message passing
More control, but more complexity
Scales much better across machines

Where you see it:

HPC clusters
MPI based applications
Multi node Slurm jobs

The catch:

You have to manage communication yourself. Poor data exchange design can kill performance.

⸻

Shared vs Distributed: The Real Difference

Memory Access

In shared memory, everything lives in one global space. Any thread can read or modify data directly.

In distributed memory, each node has its own local memory. If you need data from another node, you have to explicitly request it.

Communication Style

Shared memory systems rely on implicit communication. Threads just read and write to the same variables.

Distributed systems are explicit. You send and receive messages, often using MPI. Nothing is shared unless you make it shared.

Performance Behavior

Shared memory is extremely fast at small scale since there’s no network involved.

Distributed memory shines when scaling out. You can add more nodes, but now you pay the cost of network communication.

Complexity

Shared memory is easier to get started with. You can parallelize loops and see quick results.

Distributed memory requires planning. You need to think about data distribution, communication patterns, and synchronization from the beginning.

Bottlenecks

Shared memory systems struggle with contention. Too many threads fighting over the same memory slows everything down.

Distributed systems hit network limits. Latency and bandwidth become the main constraints as you scale.

⸻

Why This Actually Matters

1. Your Code Design Changes

A shared memory program might rely on simple loops with parallel directives.

A distributed memory program forces you to think about:

Data partitioning
Communication patterns
Synchronization across nodes

Same problem, completely different mindset.

⸻

2. Scaling Isn’t Automatic

A program that runs perfectly on 8 cores might fall apart on 100 nodes.

Shared memory hits hardware limits
Distributed memory introduces network overhead

Understanding the model helps you predict scaling behavior instead of guessing.

⸻

3. Debugging Becomes a Different Game

Shared memory bugs → race conditions, deadlocks
Distributed memory bugs → hangs, mismatched sends/receives

Both are painful, just in different ways.

⸻

4. Hybrid is the Reality

Modern HPC systems don’t force you to choose one.

Most real workloads use a hybrid model:

MPI between nodes (distributed)
OpenMP within a node (shared)

This is where performance tuning becomes interesting and tricky.

⸻

A Simple Analogy

Shared memory = One kitchen, many cooks
Distributed memory = Many kitchens, coordinated recipes

One is easier to manage. The other scales better.

⸻

Final Thought

If you’re working with HPC, cloud scaling, or even large data pipelines, memory architecture isn’t just a technical detail, it’s a design decision.

Ignoring it leads to:

Poor scaling
Unpredictable performance
Hard-to-debug systems

Understanding it gives you control.

And in distributed systems, control is everything.

How MPI Works Under the Hood (Without the Jargon)

Muhammad Zubair Bin Akbar — Fri, 01 May 2026 21:39:19 +0000

If you have ever run a job on an HPC cluster, chances are you have used MPI without fully knowing what’s happening behind the scenes. And that’s completely normal. MPI often feels like a black box that just “makes parallel jobs work.”

Let’s open that box a bit, without diving into heavy theory or academic jargon.

⸻

The Basic Idea

MPI (Message Passing Interface) is simply a way for multiple processes to talk to each other while running a program.

Think of it like this:

Instead of one program doing all the work, MPI lets you run many copies of the same program. Each copy handles a portion of the task and communicates with others when needed.

⸻

What Actually Happens When You Run an MPI Job?

When you launch an MPI job using something like:

mpirun -np 4 ./my_app

Here’s what’s going on under the hood:

1. Multiple Processes Are Started

MPI doesn’t create threads. It starts completely separate processes.

Each process:

Has its own memory space
Runs independently
Gets a unique ID called a rank

⸻

2. Each Process Knows Its Role

Every MPI process gets a rank:

Rank 0 → usually the coordinator
Rank 1, 2, 3… → workers

Your code uses these ranks to decide who does what.

⸻

3. Communication Happens via Messages

Processes don’t share memory. Instead, they send and receive messages.

Example:

Process 0 sends data → Process 1 receives it
Process 2 broadcasts something → everyone gets it

This is the core of MPI.

⸻

What Does “Sending a Message” Really Mean?

When one process sends data:

The data is copied into a buffer
MPI hands it to the system (network or shared memory)
It travels to the target process
The receiving process copies it into its memory

If processes are:

On the same node → shared memory is used
On different nodes → network (like InfiniBand or Ethernet)

⸻

How MPI Uses the Hardware

MPI is smarter than it looks. It adapts based on where processes are running:

Same Node

Uses shared memory (fast)
No real “network” involved

Different Nodes

Uses high-speed interconnects
Optimized protocols to reduce latency

Good MPI implementations automatically pick the best method.

⸻

Synchronization (Keeping Everyone in Check)

Sometimes processes need to wait for each other.

MPI provides mechanisms like:

Barriers → everyone pauses until all reach a point
Collective operations → like broadcast, reduce

This ensures coordination across processes.

⸻

A Simple Mental Model

Imagine a group project:

Each person (process) works on their part
They occasionally send updates to others
One person might collect results and combine everything

MPI is just the system that:

Assigns roles
Handles communication
Keeps things in sync

⸻

Why Things Sometimes Go Wrong

MPI issues often come from:

One process waiting for a message that never arrives
Mismatched send/receive calls
Network or node issues
Poor workload distribution

Because everything runs independently, small mistakes can cause hangs or failures.

⸻

Why MPI Is Still So Widely Used

Despite newer technologies, MPI remains dominant in HPC because:

It scales extremely well
Works across thousands of nodes
Gives precise control over communication
Is highly optimized for performance

⸻

Final Thoughts

MPI isn’t magic. It’s just a well-designed system for:

Running multiple processes
Passing messages between them
Coordinating work efficiently

Once you understand that, debugging and optimizing MPI jobs becomes much easier.

Bare Metal vs Virtual Machines vs Containers in HPC

Muhammad Zubair Bin Akbar — Thu, 30 Apr 2026 19:29:11 +0000

High Performance Computing isn’t just about powerful CPUs and fast interconnects. The way workloads are deployed matters just as much. Whether you’re running simulations, AI training, or large-scale data processing, choosing between bare metal, virtual machines, and containers can directly impact performance, flexibility, and efficiency.

Let’s break it down in a practical way.

Bare Metal: Maximum Performance, Minimum Abstraction

Bare metal means running workloads directly on physical hardware without any virtualization layer.

Why HPC loves it:

Full access to CPU, memory, GPUs, and high-speed networks
No virtualization overhead
Best for tightly coupled MPI jobs

Where it shines:

Large-scale simulations
CFD, weather modeling
Latency-sensitive workloads

Trade-offs:

Harder to manage at scale
Less flexible environment control
Software conflicts can become painful

Bare metal is still the gold standard in traditional HPC clusters, especially when every microsecond counts.

Virtual Machines: Isolation with Overhead

Virtual Machines (VMs) add a hypervisor layer, allowing multiple OS instances on the same hardware.

Why they’re used:

Strong isolation between workloads
Easy to snapshot, clone, and migrate
Good for multi-tenant environments

Where they fit in HPC:

Cloud-based HPC setups
Development and testing environments
Workloads that don’t need ultra-low latency

Trade-offs:

Performance overhead (CPU, I/O, networking)
Limited access to specialized hardware (unless using passthrough)

VMs are more common in cloud HPC than in on-prem clusters, where performance loss is less acceptable.

Containers: Lightweight and Portable

Containers package applications with their dependencies, running on the host OS without a full VM.

Why they’re gaining popularity:

Near bare-metal performance
Easy reproducibility
Portable across environments

Popular tools in HPC:

Docker (less common in production HPC)
Singularity / Apptainer (designed for HPC)

Where they shine:

AI/ML workloads
Research reproducibility
Complex software stacks

Trade-offs:

Shared kernel (less isolation than VMs)
Requires proper integration with schedulers like Slurm

Containers strike a strong balance between performance and flexibility, which is why they’re rapidly becoming standard in modern HPC environments.

Choosing the Right Model for Your Use Case

Instead of thinking about which one is “better,” it’s more useful to map each approach to real HPC scenarios.

Go with bare metal when:

You’re running tightly coupled MPI jobs across nodes
Network latency and bandwidth are critical
You need full GPU or accelerator performance
You’re operating a traditional on-prem HPC cluster

This is typical in scientific computing, engineering simulations, and large-scale physics workloads.

Use virtual machines when:

You’re running HPC workloads in the cloud
Multiple users or teams need strict isolation
You want to spin up environments quickly for testing
Performance is important, but not the absolute priority

VMs make sense in hybrid HPC setups or when infrastructure flexibility matters more than squeezing out every bit of performance.

Choose containers when:

You need reproducible environments across clusters
Your workloads depend on complex or conflicting libraries
You’re running AI/ML pipelines or modern data workloads
You want users to bring their own software stack easily

Containers are especially powerful in research environments where portability and consistency are critical.

The Real-World Approach

Most modern HPC environments don’t rely on just one model.

A common pattern looks like this:

Bare metal nodes for raw compute power
Containers for application portability
Virtual machines in cloud or hybrid layers

This hybrid approach gives you the best of all worlds: performance, flexibility, and scalability.

Final Thought

HPC is evolving beyond just hardware. The focus is shifting toward how efficiently workloads can be deployed and reproduced.

Bare metal still dominates performance-critical workloads. Containers are redefining usability and portability. Virtual machines fill the gap where flexibility and isolation are needed.

The right choice depends on what you’re optimizing for, not what’s trending.

What Actually Happens When You Run sbatch in Slurm

Muhammad Zubair Bin Akbar — Tue, 28 Apr 2026 20:40:47 +0000

If you work with HPC clusters, you likely use sbatch every day. You submit a script and expect it to run.

But that single command triggers a full workflow inside Slurm.

Understanding this internal flow helps you debug issues faster, optimize job performance, and better understand how your cluster behaves.

⸻

Step 1: Submitting the Job

When you run:

sbatch job.sh

You are not starting the job. You are submitting a request to Slurm.

The script includes:

Resource requirements such as CPUs, memory, GPUs
Job metadata like name and output paths
The actual commands to execute

At this point, Slurm simply accepts the job.

⸻

Step 2: Communication with slurmctld

The sbatch command sends the job to the Slurm controller daemon, slurmctld.

This daemon:

Assigns a Job ID
Stores the job details
Marks the job as PENDING

Nothing is running yet.

⸻

Step 3: Job Enters the Queue

The job is now placed in the scheduling queue.

evaluates:

Job priority
Fairshare usage
Partition limits
Resource availability

This determines when your job will run.

⸻

Step 4: Scheduling Decision

The scheduler continuously checks:

Free nodes
Resource fragmentation
Backfill opportunities

If your job fits available resources, it gets selected. Otherwise, it stays pending.

⸻

Step 5: Resource Allocation

Once selected, Slurm:

Assigns specific compute nodes
Reserves CPUs, memory, and GPUs
Changes job state to RUNNING

Now your job has allocated resources.

⸻

Step 6: Node-Level Communication

Each compute node runs a daemon called slurmd.

The controller sends job details to these nodes. The nodes prepare the execution environment.

⸻

Step 7: Job Execution via slurmstepd

On the compute node, slurmstepd is launched.

This process:

Starts your application
Manages job steps
Handles output and error streams
Enforces resource limits using cgroups

Your script begins executing here.

⸻

Step 8: Monitoring During Execution

While the job runs:

Slurm tracks resource usage
Logs are written to output files
Accounting data is collected

You can monitor the job using:

squeue
scontrol show job <jobid>

⸻

Step 9: Job Completion

When the job finishes:

slurmstepd exits
Resources are released
Temporary processes are cleaned up

The job state becomes COMPLETED, FAILED, TIMEOUT, or CANCELLED.

⸻

Step 10: Accounting and Logs

Finally:

Job statistics are stored
Output files remain available
Usage data is recorded

You can check this using:

sacct

⸻

Full Flow Summary

Submit job using sbatch
slurmctld receives and queues it
Scheduler evaluates priority
Resources are allocated
slurmd prepares nodes
slurmstepd runs the job
Job completes and resources are released

⸻

Common Misconceptions

“sbatch runs the job immediately”
It only submits the job.

“Pending means failure”
It usually means waiting for resources.

“Slurm just runs scripts”
It manages scheduling, allocation, execution, and cleanup.

⸻

Final Thought

sbatch may look simple, but it triggers a complete orchestration pipeline inside Slurm.

Once you understand this flow, debugging becomes easier, performance tuning improves, and cluster behavior becomes predictable.

⸻

4 Practical Boto3 Scripts for S3 Every DevOps Engineer Should Know

Muhammad Zubair Bin Akbar — Sun, 26 Apr 2026 21:54:05 +0000

Working with AWS S3 through the console is fine until you need automation, repeatability, and control. That’s where Boto3 comes in. In this post, we’ll walk through four practical Python scripts to manage S3 efficiently.

⸻

1. List All S3 Buckets with Creation Dates

A simple script to get visibility into your S3 environment.

import boto3
s3 = boto3.client('s3')
response = s3.list_buckets()
print("S3 Buckets:\n")
for bucket in response['Buckets']:
    print(f"Name: {bucket['Name']} | Created On: {bucket['CreationDate']}")

Why this matters:

Useful for audits, inventory tracking, or quick checks across accounts.

⸻

2. Upload a File to S3 with Error Handling

Uploading files is common but handling failures properly is what makes scripts production-ready.

import boto3
from botocore.exceptions import FileNotFoundError, NoCredentialsError, ClientError
s3 = boto3.client('s3')
file_name = "test.txt"
bucket_name = "your-bucket-name"
object_name = "uploads/test.txt"
try:
    s3.upload_file(file_name, bucket_name, object_name)
    print("File uploaded successfully.")
except FileNotFoundError:
    print("The file was not found.")
except NoCredentialsError:
    print("Credentials not available.")
except ClientError as e:
    print(f"AWS Error: {e}")

Why this matters:

Prevents silent failures and gives clear debugging output.

⸻

3. Download Files from S3 with Progress Tracking

For large files, progress tracking makes a big difference.

import boto3
s3 = boto3.client('s3')
bucket_name = "your-bucket-name"
object_name = "large-file.zip"
file_name = "downloaded.zip"
def progress_callback(bytes_transferred):
    print(f"Transferred: {bytes_transferred} bytes")
s3.download_file(
    bucket_name,
    object_name,
    file_name,
    Callback=progress_callback
)
print("Download complete.")

Why this matters:

Gives visibility into long running downloads especially useful in automation pipelines.

⸻

4. Create and Delete S3 Buckets Programmatically

Automating bucket lifecycle management is useful in testing and dynamic environments.

import boto3
from botocore.exceptions import ClientError
s3 = boto3.client('s3')
bucket_name = "my-unique-bucket-name-12345"
# Create Bucket
try:
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={
            'LocationConstraint': 'eu-west-1'
        }
    )
    print("Bucket created successfully.")
except ClientError as e:
    print(f"Error creating bucket: {e}")
# Delete Bucket
try:
    s3.delete_bucket(Bucket=bucket_name)
    print("Bucket deleted successfully.")
except ClientError as e:
    print(f"Error deleting bucket: {e}")

Note:

Make sure the bucket is empty before deleting, otherwise the delete operation will fail.

⸻

Final Thoughts

These four scripts cover the most common S3 operations:

Visibility (listing buckets)
Data movement (upload/download)
Resource lifecycle (create/delete)

They’re simple, but extremely useful when building automation around AWS.

As you scale, you can extend these with:

Logging
Retry mechanisms
Parallel uploads/downloads

This is the kind of practical automation that saves time in real environments.