DEV Community

Cover image for CPU Pinning and Affinity in HPC: Why Performance Changes Drastically
Muhammad Zubair Bin Akbar
Muhammad Zubair Bin Akbar

Posted on

CPU Pinning and Affinity in HPC: Why Performance Changes Drastically

In HPC environments, users often notice something confusing:

The same application, same input, and same number of CPUs can produce very different performance results across runs.

One of the biggest reasons behind this is CPU pinning and CPU affinity.

Without proper CPU placement, processes can bounce between cores, compete for cache, and suffer from NUMA penalties. In large parallel workloads, this can drastically reduce performance.

This blog explains what CPU pinning and affinity are, why they matter in HPC, and how they impact real workloads.


What Is CPU Affinity?

CPU affinity controls which CPU cores a process or thread is allowed to run on.

The operating system scheduler can still move the process between the allowed cores, but only within that defined CPU set.

For example:

  • A process may be allowed to run only on cores 0 to 7
  • The scheduler can move it between those cores if needed

Affinity helps improve cache locality and reduces unnecessary movement across the entire system.


What Is CPU Pinning?

CPU pinning is the actual act of locking a process or thread to a CPU core.

In HPC clusters, schedulers like Slurm often handle this automatically through CPU binding options.

For example:

  • MPI rank 0 stays on core 0
  • MPI rank 1 stays on core 1

This minimizes CPU migrations and provides more predictable performance for HPC workloads.

Pinning ensures:

  • Better cache locality
  • Reduced scheduler overhead
  • Predictable performance
  • Lower NUMA latency
  • Reduced context switching

Without pinning, Linux may move tasks between cores frequently depending on system activity.


Why Performance Changes So Much

Modern HPC nodes are complex.

A single node may contain:

  • Multiple CPU sockets
  • NUMA regions
  • Shared and private caches
  • Hyperthreading
  • Hundreds of logical CPUs

When processes move randomly between CPUs, several problems appear.


Cache Locality Problems

CPUs rely heavily on cache memory.

If a thread keeps running on the same core, cached data remains available and execution becomes faster.

When the thread migrates to another core:

  • Cache must be rebuilt
  • Memory access latency increases
  • CPU cycles are wasted

This becomes extremely expensive for tightly coupled MPI applications.


NUMA Effects

NUMA stands for Non Uniform Memory Access.

In multi socket systems, memory attached to the local CPU socket is faster than memory attached to another socket.

If a process runs on Socket 0 but accesses memory allocated on Socket 1:

  • Memory latency increases
  • Bandwidth decreases
  • Application performance drops

This is one of the most common reasons HPC jobs scale poorly.


Example of Bad CPU Placement

Consider a dual socket server:

  • Socket 0 → cores 0 to 31
  • Socket 1 → cores 32 to 63

If an MPI application launches ranks without proper affinity:

  • Rank 0 may start on core 2
  • Later move to core 40
  • Then back to core 10

Now the application suffers from:

  • Remote memory access
  • Cache misses
  • CPU migration overhead

The result can be a major slowdown even though CPU usage appears high.


MPI and CPU Binding

MPI applications are very sensitive to where processes run on the CPU.

If MPI ranks keep moving between cores:

  • Cache data gets lost
  • Memory access becomes slower
  • Communication latency increases

To avoid this, MPI runtimes and schedulers use CPU binding or pinning.

For example with Open MPI:

mpirun --bind-to core --map-by socket ./app
Enter fullscreen mode Exit fullscreen mode

With Slurm:

srun --cpu-bind=cores ./app
Enter fullscreen mode Exit fullscreen mode

These settings keep MPI processes fixed to specific CPU cores, which usually provides more stable and faster performance in HPC workloads.


Hyperthreading Can Also Matter

Some workloads perform poorly when pinned to logical CPUs instead of physical cores.

For compute intensive applications:

  • Two threads sharing one physical core may compete for resources
  • Floating point performance may decrease
  • Memory bandwidth may become limited

This is why many HPC sites disable hyperthreading for production workloads.


Real World Performance Difference

In many HPC benchmarks:

  • Proper CPU affinity can improve performance by 10% to 40%
  • NUMA aware placement can reduce latency significantly
  • Communication heavy MPI jobs benefit the most

Applications such as:

  • CFD solvers
  • Molecular dynamics
  • Finite element simulations
  • AI training workloads
  • Weather modeling

are highly sensitive to CPU placement.


How to Check CPU Affinity

Useful Linux tools include:

taskset -p <pid>
Enter fullscreen mode Exit fullscreen mode
numactl --show
Enter fullscreen mode Exit fullscreen mode
lscpu
Enter fullscreen mode Exit fullscreen mode
hwloc-ls
Enter fullscreen mode Exit fullscreen mode

The hwloc package is especially useful for visualizing CPU topology and NUMA layout.


Best Practices in HPC

1. Use Scheduler Managed Affinity

Let the cluster scheduler manage CPU placement whenever possible.

For example:

#SBATCH --cpus-per-task=8
#SBATCH --cpu-bind=cores
Enter fullscreen mode Exit fullscreen mode

2. Keep MPI Ranks NUMA Aware

Try to keep MPI ranks and memory allocations within the same NUMA domain.

Tools like numactl can help.


3. Benchmark Different Configurations

Different applications behave differently.

Always test:

  • Core binding
  • Socket binding
  • NUMA placement
  • Hyperthreading enabled vs disabled

4. Monitor CPU Migrations

High CPU migrations can indicate poor affinity configuration.

Useful commands:

pidstat -w
Enter fullscreen mode Exit fullscreen mode
perf stat
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

CPU pinning and affinity are often overlooked in HPC environments, but they directly affect application scalability and runtime consistency.

Two jobs using the same resources can perform very differently simply because of process placement. Understanding CPU topology, NUMA behavior, and scheduler affinity policies is essential for getting the best performance from modern HPC clusters.

In many cases, properly placing and pinning processes to CPU cores can improve performance without upgrading the hardware.

Top comments (0)