DEV Community

Cover image for Optimizing MPI Performance (Real Examples)
Muhammad Zubair Bin Akbar
Muhammad Zubair Bin Akbar

Posted on

Optimizing MPI Performance (Real Examples)

MPI jobs that run are easy.
MPI jobs that run fast and efficiently — that’s where things get interesting.

If your application scales poorly, takes longer than expected, or wastes CPU time, the issue is usually not the code itself… it’s how it’s running.

That said, performance tuning is rarely about a single fix. The examples below highlight common issues and improvements, but in real-world HPC workloads, these are often just one of several factors impacting performance.

Here’s a practical breakdown of MPI performance tuning, with real examples you can apply immediately.

Where MPI Performance Actually Breaks

Most MPI slowdowns come from:

  • Poor process placement
  • Network bottlenecks
  • Imbalanced workloads
  • Excessive communication
  • Memory/NUMA issues

The tricky part is that these don’t show up as errors — just slow jobs.

1. CPU Binding & Process Placement

The Problem

MPI processes float across CPUs, leading to cache misses and context switching.

The Fix

Bind processes to cores explicitly.

mpirun --bind-to core --map-by socket -np 32 ./app
Enter fullscreen mode Exit fullscreen mode

On Slurm:

srun --cpu-bind=cores ./app
Enter fullscreen mode Exit fullscreen mode

Real Impact

  • Before: ~65% CPU efficiency
  • After: ~90%+ CPU utilization

2. NUMA Awareness (Hidden Performance Killer)

On multi-socket systems, memory is not equal.

The Problem

A process runs on one socket but accesses memory from another, increasing latency.

The Fix

Use NUMA-aware mapping:

mpirun --map-by ppr:1:numa ./app
Enter fullscreen mode Exit fullscreen mode

Check layout:

numactl --hardware
Enter fullscreen mode Exit fullscreen mode

Real Impact

  • Reduced memory latency
  • Better scaling across nodes

3. Network Optimization (InfiniBand / Omni-Path)

MPI performance heavily depends on interconnect.

The Problem

MPI falls back to TCP instead of using high-speed fabric.

The Fix

Set the correct transport layer.

For Intel MPI:

export I_MPI_FABRICS=shm:ofi
Enter fullscreen mode Exit fullscreen mode

For OpenMPI:

mpirun --mca pml ucx --mca btl ^tcp ./app
Enter fullscreen mode Exit fullscreen mode

Real Impact

  • Lower latency
  • Faster multi-node scaling

4. Load Imbalance Between Processes

The Problem

Some ranks finish early while others continue working, leaving CPUs idle

Detect It

mpirun -np 4 ./app
Enter fullscreen mode Exit fullscreen mode

If one rank consistently lags, there is imbalance.

The Fix

  • Distribute work evenly
  • Use dynamic scheduling where possible

Real Impact

Even a small imbalance can reduce performance by 30–50%.

5. Too Much Communication

MPI applications often slow down due to excessive messaging.

The Problem

Frequent small messages create high communication overhead.

The Fix

  • Batch messages
  • Use collective operations such as:
    • MPI_Bcast
    • MPI_Reduce

Real Example

Replacing multiple MPI_Send calls with a single MPI_Bcast significantly improved runtime in real workloads.

6. Benchmark Before You Guess

Avoid blind optimization. Measure first.

Useful Tools

  • mpirun --report-bindings
  • Intel MPI Benchmarks: IMB-MPI1
  • OSU Micro-Benchmarks: osu_latency

What to Look For

  • Latency
  • Bandwidth
  • CPU utilization

7. Slurm-Specific Optimization

If you are using Slurm, your job script plays a critical role.

Example

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=1

srun --cpu-bind=cores ./app
Enter fullscreen mode Exit fullscreen mode

Key Tips

  • Match ntasks with available cores
  • Avoid oversubscription
  • Use --exclusive for consistent performance

Real Scenario (Before vs After)

Before Optimization:

  • No CPU binding
  • Default MPI settings
  • TCP communication
  • Runtime: 120 minutes

After Optimization:

  • Core binding enabled
  • NUMA-aware mapping
  • High-speed fabric (OFI/UCX)
  • Reduced communication

Runtime reduced to approximately 70 minutes, without any changes to the application code.

Final Takeaway

MPI performance is rarely about rewriting your application.

It is about:

  • Running it on the right cores
  • Using the right network
  • Avoiding unnecessary overhead

Small configuration changes can lead to significant improvements, but real-world performance is always influenced by multiple factors working together.

If your MPI job feels slower than expected, the limitation is often not the hardware — it is how efficiently it is being used.

Top comments (0)