As large language models continue to scale, they consistently exceed the memory and compute limits of any single GPU. Tensor parallelism addresses the capacity issue by distributing layers across multiple GPUs-and often across multiple servers—but it introduces a new challenge: how do we synchronize shards, route requests, and share KV-cache efficiently enough to behave like a single cohesive accelerator?
This orchestration gap is exactly what NVIDIA Dynamo is designed to solve.
What Is NVIDIA Dynamo?
NVIDIA Dynamo is a distributed orchestration layer that enhances LLM inference by intelligently coordinating multi-GPU and multi-node workloads. It is inference-engine-agnostic and plugs seamlessly into frameworks such as TRT-LLM, vLLM, SGLang, and others.
Dynamo introduces several LLM-specific capabilities that dramatically improve system performance:
Key Capabilities
- Disaggregated prefill & decode inference Maximizes GPU utilization and enables fine-grained latency/throughput trade-offs.
- Dynamic GPU scheduling Adapts resource allocation based on real-time workload demand.
- LLM-aware request routing Eliminates redundant KV-cache recomputation for faster inference.
- Accelerated data transfer (NIXL) Reduces inter-GPU communication overhead and improves response times.
- KV-cache offloading Leverages multi-tier memory hierarchies (HBM, DRAM, SSD) for higher throughput at lower cost.
Altogether, Dynamo provides the distributed intelligence required to make large-scale LLM inference behave as though all hardware resources were a single unified accelerator.
Installation & Setup Guide
1. Clone the Dynamo Repository
git clone --branch v0.4.1 --depth 1 https://github.com/ai-dynamo/dynamo.git
cd dynamo
2. Build the Docker Image
docker compose -f deploy/docker-compose.yml up -d
./container/build.sh --framework VLLM
3. Create and Run the Container
./container/run.sh -it --framework VLLM [--mount-workspace]
Or attach to an existing one:
docker exec -it <container_name> bash
Running Dynamo on a Single Node
Inside the container, launch Dynamo with a specified model:
python -m dynamo.vllm --model <path_to_model>
If HBM capacity is limited, extend model length via:
--max-model-len <size>
Then start the backend services:
cd components/backends/vllm
bash launch/agg.sh
Running Dynamo with LMCache Integration
To enable LMCache and configure CPU offload size:
LMCACHE_MAX_LOCAL_CPU_SIZE=500 \
python -m dynamo.vllm --model <path_to_model>
Launch the LMCache-enabled backend:
cd components/backends/vllm
bash launch/agg_lmcache.sh

Top comments (2)
Thanks a lot
I'm trying to run Dynamo with a container
And I get an error when running components/backends/vllm/launch/agg.sh
This is the error:
No module named 'nvshmem'
Regarding the error No module named 'nvshmem' - there is a solution described here: github.com/ai-dynamo/dynamo/issues...
It's worth going through the instructions there, they should solve the problem.