Muhammad Zubair Bin Akbar

Posted on Apr 21

Optimizing MPI Performance (Real Examples)

#ai #gpu #mpi #performance

MPI jobs that run are easy.
MPI jobs that run fast and efficiently — that’s where things get interesting.

If your application scales poorly, takes longer than expected, or wastes CPU time, the issue is usually not the code itself… it’s how it’s running.

That said, performance tuning is rarely about a single fix. The examples below highlight common issues and improvements, but in real-world HPC workloads, these are often just one of several factors impacting performance.

Here’s a practical breakdown of MPI performance tuning, with real examples you can apply immediately.

Where MPI Performance Actually Breaks

Most MPI slowdowns come from:

Poor process placement
Network bottlenecks
Imbalanced workloads
Excessive communication
Memory/NUMA issues

The tricky part is that these don’t show up as errors — just slow jobs.

1. CPU Binding & Process Placement

The Problem

MPI processes float across CPUs, leading to cache misses and context switching.

The Fix

Bind processes to cores explicitly.

mpirun --bind-to core --map-by socket -np 32 ./app

On Slurm:

srun --cpu-bind=cores ./app

Real Impact

Before: ~65% CPU efficiency
After: ~90%+ CPU utilization

2. NUMA Awareness (Hidden Performance Killer)

On multi-socket systems, memory is not equal.

The Problem

A process runs on one socket but accesses memory from another, increasing latency.

The Fix

Use NUMA-aware mapping:

mpirun --map-by ppr:1:numa ./app

Check layout:

numactl --hardware

Real Impact

Reduced memory latency
Better scaling across nodes

3. Network Optimization (InfiniBand / Omni-Path)

MPI performance heavily depends on interconnect.

The Problem

MPI falls back to TCP instead of using high-speed fabric.

The Fix

Set the correct transport layer.

For Intel MPI:

export I_MPI_FABRICS=shm:ofi

For OpenMPI:

mpirun --mca pml ucx --mca btl ^tcp ./app

Real Impact

Lower latency
Faster multi-node scaling

4. Load Imbalance Between Processes

The Problem

Some ranks finish early while others continue working, leaving CPUs idle

Detect It

mpirun -np 4 ./app

If one rank consistently lags, there is imbalance.

The Fix

Distribute work evenly
Use dynamic scheduling where possible

Real Impact

Even a small imbalance can reduce performance by 30–50%.

5. Too Much Communication

MPI applications often slow down due to excessive messaging.

The Problem

Frequent small messages create high communication overhead.

The Fix

Batch messages
Use collective operations such as:
- MPI_Bcast
- MPI_Reduce

Real Example

Replacing multiple MPI_Send calls with a single MPI_Bcast significantly improved runtime in real workloads.

6. Benchmark Before You Guess

Avoid blind optimization. Measure first.

Useful Tools

mpirun --report-bindings
Intel MPI Benchmarks: IMB-MPI1
OSU Micro-Benchmarks: osu_latency

What to Look For

Latency
Bandwidth
CPU utilization

7. Slurm-Specific Optimization

If you are using Slurm, your job script plays a critical role.

Example

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1

srun --cpu-bind=cores ./app

Key Tips

Match ntasks with available cores
Avoid oversubscription
Use --exclusive for consistent performance

Real Scenario (Before vs After)

Before Optimization:

No CPU binding
Default MPI settings
TCP communication
Runtime: 120 minutes

After Optimization:

Core binding enabled
NUMA-aware mapping
High-speed fabric (OFI/UCX)
Reduced communication

Runtime reduced to approximately 70 minutes, without any changes to the application code.

Final Takeaway

MPI performance is rarely about rewriting your application.

It is about:

Running it on the right cores
Using the right network
Avoiding unnecessary overhead

Small configuration changes can lead to significant improvements, but real-world performance is always influenced by multiple factors working together.

If your MPI job feels slower than expected, the limitation is often not the hardware — it is how efficiently it is being used.

DEV Community

Optimizing MPI Performance (Real Examples)

Where MPI Performance Actually Breaks

1. CPU Binding & Process Placement

2. NUMA Awareness (Hidden Performance Killer)

3. Network Optimization (InfiniBand / Omni-Path)

4. Load Imbalance Between Processes

5. Too Much Communication

6. Benchmark Before You Guess

7. Slurm-Specific Optimization

Real Scenario (Before vs After)

Before Optimization:

After Optimization:

Final Takeaway

Top comments (0)