Muhammad Zubair Bin Akbar

Posted on May 16

CPU Pinning and Affinity in HPC: Why Performance Changes Drastically

#ai #hpc #slurm #productivity

In HPC environments, users often notice something confusing:

The same application, same input, and same number of CPUs can produce very different performance results across runs.

One of the biggest reasons behind this is CPU pinning and CPU affinity.

Without proper CPU placement, processes can bounce between cores, compete for cache, and suffer from NUMA penalties. In large parallel workloads, this can drastically reduce performance.

This blog explains what CPU pinning and affinity are, why they matter in HPC, and how they impact real workloads.

What Is CPU Affinity?

CPU affinity controls which CPU cores a process or thread is allowed to run on.

The operating system scheduler can still move the process between the allowed cores, but only within that defined CPU set.

For example:

A process may be allowed to run only on cores 0 to 7
The scheduler can move it between those cores if needed

Affinity helps improve cache locality and reduces unnecessary movement across the entire system.

What Is CPU Pinning?

CPU pinning is the actual act of locking a process or thread to a CPU core.

In HPC clusters, schedulers like Slurm often handle this automatically through CPU binding options.

For example:

MPI rank 0 stays on core 0
MPI rank 1 stays on core 1

This minimizes CPU migrations and provides more predictable performance for HPC workloads.

Pinning ensures:

Better cache locality
Reduced scheduler overhead
Predictable performance
Lower NUMA latency
Reduced context switching

Without pinning, Linux may move tasks between cores frequently depending on system activity.

Why Performance Changes So Much

Modern HPC nodes are complex.

A single node may contain:

Multiple CPU sockets
NUMA regions
Shared and private caches
Hyperthreading
Hundreds of logical CPUs

When processes move randomly between CPUs, several problems appear.

Cache Locality Problems

CPUs rely heavily on cache memory.

If a thread keeps running on the same core, cached data remains available and execution becomes faster.

When the thread migrates to another core:

Cache must be rebuilt
Memory access latency increases
CPU cycles are wasted

This becomes extremely expensive for tightly coupled MPI applications.

NUMA Effects

NUMA stands for Non Uniform Memory Access.

In multi socket systems, memory attached to the local CPU socket is faster than memory attached to another socket.

If a process runs on Socket 0 but accesses memory allocated on Socket 1:

Memory latency increases
Bandwidth decreases
Application performance drops

This is one of the most common reasons HPC jobs scale poorly.

Example of Bad CPU Placement

Consider a dual socket server:

Socket 0 → cores 0 to 31
Socket 1 → cores 32 to 63

If an MPI application launches ranks without proper affinity:

Rank 0 may start on core 2
Later move to core 40
Then back to core 10

Now the application suffers from:

Remote memory access
Cache misses
CPU migration overhead

The result can be a major slowdown even though CPU usage appears high.

MPI and CPU Binding

MPI applications are very sensitive to where processes run on the CPU.

If MPI ranks keep moving between cores:

Cache data gets lost
Memory access becomes slower
Communication latency increases

To avoid this, MPI runtimes and schedulers use CPU binding or pinning.

For example with Open MPI:

mpirun --bind-to core --map-by socket ./app

With Slurm:

srun --cpu-bind=cores ./app

These settings keep MPI processes fixed to specific CPU cores, which usually provides more stable and faster performance in HPC workloads.

Hyperthreading Can Also Matter

Some workloads perform poorly when pinned to logical CPUs instead of physical cores.

For compute intensive applications:

Two threads sharing one physical core may compete for resources
Floating point performance may decrease
Memory bandwidth may become limited

This is why many HPC sites disable hyperthreading for production workloads.

Real World Performance Difference

In many HPC benchmarks:

Proper CPU affinity can improve performance by 10% to 40%
NUMA aware placement can reduce latency significantly
Communication heavy MPI jobs benefit the most

Applications such as:

CFD solvers
Molecular dynamics
Finite element simulations
AI training workloads
Weather modeling

are highly sensitive to CPU placement.

How to Check CPU Affinity

Useful Linux tools include:

taskset -p <pid>

numactl --show

lscpu

hwloc-ls

The hwloc package is especially useful for visualizing CPU topology and NUMA layout.

Best Practices in HPC

1. Use Scheduler Managed Affinity

Let the cluster scheduler manage CPU placement whenever possible.

For example:

#SBATCH --cpus-per-task=8
#SBATCH --cpu-bind=cores

2. Keep MPI Ranks NUMA Aware

Try to keep MPI ranks and memory allocations within the same NUMA domain.

Tools like numactl can help.

3. Benchmark Different Configurations

Different applications behave differently.

Always test:

Core binding
Socket binding
NUMA placement
Hyperthreading enabled vs disabled

4. Monitor CPU Migrations

High CPU migrations can indicate poor affinity configuration.

Useful commands:

pidstat -w

perf stat

Final Thoughts

CPU pinning and affinity are often overlooked in HPC environments, but they directly affect application scalability and runtime consistency.

Two jobs using the same resources can perform very differently simply because of process placement. Understanding CPU topology, NUMA behavior, and scheduler affinity policies is essential for getting the best performance from modern HPC clusters.

In many cases, properly placing and pinning processes to CPU cores can improve performance without upgrading the hardware.

DEV Community

CPU Pinning and Affinity in HPC: Why Performance Changes Drastically

What Is CPU Affinity?

For example:

What Is CPU Pinning?

For example:

Pinning ensures:

Why Performance Changes So Much

A single node may contain:

Cache Locality Problems

NUMA Effects

Example of Bad CPU Placement

MPI and CPU Binding

For example with Open MPI:

With Slurm:

Hyperthreading Can Also Matter

Real World Performance Difference

How to Check CPU Affinity

Best Practices in HPC

1. Use Scheduler Managed Affinity

2. Keep MPI Ranks NUMA Aware

3. Benchmark Different Configurations

4. Monitor CPU Migrations

Final Thoughts

Top comments (0)