From Cold Starts to Hot Paths: How I Cut LLM Inference Latency by 40% with a Simple Routing Trick

#python #api #ai #tutorial

I’ve been experimenting with an inference stack for a side project and wanted to share something that surprised me.

The problem: cold starts were killing my UX. Users hitting a chat endpoint would occasionally wait 3-5 seconds because the model was serving from a cold container.

Here’s what I did:

Session-aware routing
Instead of round-robin to any available node, I pinned sessions to warm instances for a sliding TTL window. If a user returns within 60 seconds, they hit the same GPU node.
lightweight pre-fetch
I added a health-check route that primes the KV cache by sending a dummy token before the actual request. This keeps the model hot without wasting real compute.
Model choice mattered more than I expected
I tested several providers and models. The biggest latency wins came from the model architecture itself. For my workload (multi-turn reasoning with long context), DeepSeek-V4-Pro cut decoding time noticeably compared to what I was using before. If you want to try it, here’s a minimal Python snippet:

import requests
import json

API_URL = "https://api.api.novapai.ai/v1/chat/completions"
HEADERS = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}

payload = {
    "model": "DeepSeek-V4-Pro",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformer attention in simple terms."}
    ],
    "temperature": 0.7,
    "max_tokens": 512
}

response = requests.post(API_URL, headers=HEADERS, json=payload)
print(json.dumps(response.json(), indent=2))

The combo of session pinning + model swap brought my p95 latency from ~4.2s down to ~2.5s. Not groundbreaking, but enough to make the app feel snappy.

Curious if anyone else has tried session pinning for LLM workloads, or if you’ve found better ways to handle cold starts without keeping GPUs running 24/7.

AI #LLM #Inference #GPU #NovaStack

DEV Community

From Cold Starts to Hot Paths: How I Cut LLM Inference Latency by 40% with a Simple Routing Trick

AI #LLM #Inference #GPU #NovaStack

Top comments (0)