Great deep-dive! The KV cache management and tensor parallelism trade-offs are real pain points for production LLM serving. NVLink bandwidth becoming the bottleneck over compute is a fascinating inflection point.
One follow-up question: how are you handling the vector embedding storage/retrieval side? For inference-time RAG, the database layer often becomes a hidden bottleneck. We are working on moteDB (Rust-native embedded multimodal DB) which co-locates vector search with time-series and structured data to minimize the inference pipeline latency on edge deployments. Would love to hear how others tackle the storage side of LLM serving.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Great deep-dive! The KV cache management and tensor parallelism trade-offs are real pain points for production LLM serving. NVLink bandwidth becoming the bottleneck over compute is a fascinating inflection point.
One follow-up question: how are you handling the vector embedding storage/retrieval side? For inference-time RAG, the database layer often becomes a hidden bottleneck. We are working on moteDB (Rust-native embedded multimodal DB) which co-locates vector search with time-series and structured data to minimize the inference pipeline latency on edge deployments. Would love to hear how others tackle the storage side of LLM serving.