Discussion on: Optimizing LLM Serving: The Engineering Truth of vLLM & NVLink

View post

Great deep-dive! The KV cache management and tensor parallelism trade-offs are real pain points for production LLM serving. NVLink bandwidth becoming the bottleneck over compute is a fascinating inflection point.

One follow-up question: how are you handling the vector embedding storage/retrieval side? For inference-time RAG, the database layer often becomes a hidden bottleneck. We are working on moteDB (Rust-native embedded multimodal DB) which co-locates vector search with time-series and structured data to minimize the inference pipeline latency on edge deployments. Would love to hear how others tackle the storage side of LLM serving.