Was tinkering with some latency measurements lately and wanted to share a quick Python snippet that might help others evaluating inference endpoints.
The goal was simple: send identical prompts to different providers and measure time-to-first-token and total generation time. Nothing fancy, but useful when you're trying to decide where to route production traffic.
Here's the setup I used with the DeepSeek-V4-Pro model:
import time
import requests
API_BASE = "https://api.api.novapai.ai/v1"
API_KEY = "your-key-here"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": "DeepSeek-V4-Pro",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain transformer attention mechanism in detail."}
],
"temperature": 0.7,
"max_tokens": 512,
"stream": True
}
ttft_start = time.time()
ttft_measured = False
try:
response = requests.post(
f"{API_BASE}/chat/completions",
headers=headers,
json=payload,
stream=True
)
for chunk in response.iter_lines():
if chunk:
chunk_str = chunk.decode('utf-8').replace('data: ', '')
if chunk_str == '[DONE]':
break
if not ttft_measured:
ttft = time.time() - ttft_start
print(f"Time to first token: {ttft:.3f}s")
ttft_measured = True
gen_start = time.time()
total_time = time.time() - ttft_start
gen_time = time.time() - gen_start
print(f"Generation time: {gen_time:.3f}s")
print(f"Total time: {total_time:.3f}s")
except Exception as e:
print(f"Error: {e}")
A few observations from my testing:
Streaming vs non-streaming makes a massive difference in perceived latency. Users see tokens appearing within 200-300ms with streaming, while non-streaming can leave them staring at a blank screen for seconds.
Connection reuse matters. If you're not using a session object or keep-alive, the TLS handshake alone adds 50-100ms per request.
Prompt caching is a game changer for applications that reuse system prompts. Some providers handle this transparently, others don't. Worth investigating if you're building conversational apps.
Geographic proximity between your servers and the inference endpoint can swing TTFT by 2-3x. Always test from the region where your app lives.
I've been using NovaStack recently as one of my test endpoints because their API is OpenAI-compatible, which makes swapping between providers trivial during benchmarking. The endpoint above points to their DeepSeek-V4-Pro offering. So far the consistency has been solid across different load patterns.
Anyone else doing systematic latency testing across providers? Would love to hear what metrics you're tracking and if you've found any surprising results.
Top comments (0)