DEV Community

Yocheved k
Yocheved k

Posted on

Deploying NVIDIA Dynamo & LMCache for LLMs: Installation, Containers, and Integration

As large language models continue to scale, they consistently exceed the memory and compute limits of any single GPU. Tensor parallelism addresses the capacity issue by distributing layers across multiple GPUs-and often across multiple servers—but it introduces a new challenge: how do we synchronize shards, route requests, and share KV-cache efficiently enough to behave like a single cohesive accelerator?

This orchestration gap is exactly what NVIDIA Dynamo is designed to solve.


What Is NVIDIA Dynamo?

NVIDIA Dynamo is a distributed orchestration layer that enhances LLM inference by intelligently coordinating multi-GPU and multi-node workloads. It is inference-engine-agnostic and plugs seamlessly into frameworks such as TRT-LLM, vLLM, SGLang, and others.

Dynamo introduces several LLM-specific capabilities that dramatically improve system performance:

Key Capabilities

  • Disaggregated prefill & decode inference Maximizes GPU utilization and enables fine-grained latency/throughput trade-offs.
  • Dynamic GPU scheduling Adapts resource allocation based on real-time workload demand.
  • LLM-aware request routing Eliminates redundant KV-cache recomputation for faster inference.
  • Accelerated data transfer (NIXL) Reduces inter-GPU communication overhead and improves response times.
  • KV-cache offloading Leverages multi-tier memory hierarchies (HBM, DRAM, SSD) for higher throughput at lower cost.

Altogether, Dynamo provides the distributed intelligence required to make large-scale LLM inference behave as though all hardware resources were a single unified accelerator.


Installation & Setup Guide

1. Clone the Dynamo Repository

git clone --branch v0.4.1 --depth 1 https://github.com/ai-dynamo/dynamo.git
cd dynamo

2. Build the Docker Image

docker compose -f deploy/docker-compose.yml up -d

./container/build.sh --framework VLLM

3. Create and Run the Container

./container/run.sh -it --framework VLLM [--mount-workspace]

Or attach to an existing one:

docker exec -it <container_name> bash


Running Dynamo on a Single Node

Inside the container, launch Dynamo with a specified model:

python -m dynamo.vllm --model <path_to_model>

If HBM capacity is limited, extend model length via:

--max-model-len <size>

Then start the backend services:

cd components/backends/vllm
bash launch/agg.sh


Running Dynamo with LMCache Integration

To enable LMCache and configure CPU offload size:

LMCACHE_MAX_LOCAL_CPU_SIZE=500 \
python -m dynamo.vllm --model <path_to_model>

Launch the LMCache-enabled backend:

cd components/backends/vllm
bash launch/agg_lmcache.sh

Top comments (2)

Collapse
 
chaya_z profile image
Chaya Z.

Thanks a lot
I'm trying to run Dynamo with a container
And I get an error when running components/backends/vllm/launch/agg.sh
This is the error:
No module named 'nvshmem'

Collapse
 
yocheved profile image
Yocheved k • Edited

Regarding the error No module named 'nvshmem' - there is a solution described here: github.com/ai-dynamo/dynamo/issues...

It's worth going through the instructions there, they should solve the problem.