In HPC environments, users often notice something confusing:
The same application, same input, and same number of CPUs can produce very different performance results across runs.
One of the biggest reasons behind this is CPU pinning and CPU affinity.
Without proper CPU placement, processes can bounce between cores, compete for cache, and suffer from NUMA penalties. In large parallel workloads, this can drastically reduce performance.
This blog explains what CPU pinning and affinity are, why they matter in HPC, and how they impact real workloads.
What Is CPU Affinity?
CPU affinity controls which CPU cores a process or thread is allowed to run on.
The operating system scheduler can still move the process between the allowed cores, but only within that defined CPU set.
For example:
- A process may be allowed to run only on cores 0 to 7
- The scheduler can move it between those cores if needed
Affinity helps improve cache locality and reduces unnecessary movement across the entire system.
What Is CPU Pinning?
CPU pinning is the actual act of locking a process or thread to a CPU core.
In HPC clusters, schedulers like Slurm often handle this automatically through CPU binding options.
For example:
- MPI rank 0 stays on core 0
- MPI rank 1 stays on core 1
This minimizes CPU migrations and provides more predictable performance for HPC workloads.
Pinning ensures:
- Better cache locality
- Reduced scheduler overhead
- Predictable performance
- Lower NUMA latency
- Reduced context switching
Without pinning, Linux may move tasks between cores frequently depending on system activity.
Why Performance Changes So Much
Modern HPC nodes are complex.
A single node may contain:
- Multiple CPU sockets
- NUMA regions
- Shared and private caches
- Hyperthreading
- Hundreds of logical CPUs
When processes move randomly between CPUs, several problems appear.
Cache Locality Problems
CPUs rely heavily on cache memory.
If a thread keeps running on the same core, cached data remains available and execution becomes faster.
When the thread migrates to another core:
- Cache must be rebuilt
- Memory access latency increases
- CPU cycles are wasted
This becomes extremely expensive for tightly coupled MPI applications.
NUMA Effects
NUMA stands for Non Uniform Memory Access.
In multi socket systems, memory attached to the local CPU socket is faster than memory attached to another socket.
If a process runs on Socket 0 but accesses memory allocated on Socket 1:
- Memory latency increases
- Bandwidth decreases
- Application performance drops
This is one of the most common reasons HPC jobs scale poorly.
Example of Bad CPU Placement
Consider a dual socket server:
- Socket 0 → cores 0 to 31
- Socket 1 → cores 32 to 63
If an MPI application launches ranks without proper affinity:
- Rank 0 may start on core 2
- Later move to core 40
- Then back to core 10
Now the application suffers from:
- Remote memory access
- Cache misses
- CPU migration overhead
The result can be a major slowdown even though CPU usage appears high.
MPI and CPU Binding
MPI applications are very sensitive to where processes run on the CPU.
If MPI ranks keep moving between cores:
- Cache data gets lost
- Memory access becomes slower
- Communication latency increases
To avoid this, MPI runtimes and schedulers use CPU binding or pinning.
For example with Open MPI:
mpirun --bind-to core --map-by socket ./app
With Slurm:
srun --cpu-bind=cores ./app
These settings keep MPI processes fixed to specific CPU cores, which usually provides more stable and faster performance in HPC workloads.
Hyperthreading Can Also Matter
Some workloads perform poorly when pinned to logical CPUs instead of physical cores.
For compute intensive applications:
- Two threads sharing one physical core may compete for resources
- Floating point performance may decrease
- Memory bandwidth may become limited
This is why many HPC sites disable hyperthreading for production workloads.
Real World Performance Difference
In many HPC benchmarks:
- Proper CPU affinity can improve performance by 10% to 40%
- NUMA aware placement can reduce latency significantly
- Communication heavy MPI jobs benefit the most
Applications such as:
- CFD solvers
- Molecular dynamics
- Finite element simulations
- AI training workloads
- Weather modeling
are highly sensitive to CPU placement.
How to Check CPU Affinity
Useful Linux tools include:
taskset -p <pid>
numactl --show
lscpu
hwloc-ls
The hwloc package is especially useful for visualizing CPU topology and NUMA layout.
Best Practices in HPC
1. Use Scheduler Managed Affinity
Let the cluster scheduler manage CPU placement whenever possible.
For example:
#SBATCH --cpus-per-task=8
#SBATCH --cpu-bind=cores
2. Keep MPI Ranks NUMA Aware
Try to keep MPI ranks and memory allocations within the same NUMA domain.
Tools like numactl can help.
3. Benchmark Different Configurations
Different applications behave differently.
Always test:
- Core binding
- Socket binding
- NUMA placement
- Hyperthreading enabled vs disabled
4. Monitor CPU Migrations
High CPU migrations can indicate poor affinity configuration.
Useful commands:
pidstat -w
perf stat
Final Thoughts
CPU pinning and affinity are often overlooked in HPC environments, but they directly affect application scalability and runtime consistency.
Two jobs using the same resources can perform very differently simply because of process placement. Understanding CPU topology, NUMA behavior, and scheduler affinity policies is essential for getting the best performance from modern HPC clusters.
In many cases, properly placing and pinning processes to CPU cores can improve performance without upgrading the hardware.
Top comments (0)