Most teams do not need to wait for SDK wrappers to get serious cost visibility.
You can ship useful LLM cost spike detection now with a direct ingest contract and a safe async sender.
This post shows a practical setup that gives you:
- endpoint-level cost attribution
- tenant/user concentration views
- prompt deploy regression detection
- budget and spend-alert workflows
without changing provider traffic paths.
What "No-SDK" actually means
It does not mean "manual forever".
It means:
- Keep provider calls as-is.
- Extract usage metadata from provider response.
- Send a normalized telemetry payload asynchronously.
SDK wrappers later can reduce boilerplate, but they are not required for production value.
Architecture in 3 layers
Layer A: Provider call + usage extraction
Map provider-specific usage fields into a normalized model.
Layer B: Telemetry sender (safe path)
Send telemetry with timeout + swallow so user request path is never blocked.
Layer C: Root-cause workflow
Query by endpoint, user/tenant, and promptVersion to explain spikes.
Minimal payload contract
{
"externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
"provider": "openai",
"model": "gpt-4o-mini",
"endpointTag": "chat_summary",
"promptVersion": "summary_v3",
"userId": "tenant_acme_hash",
"inputTokens": 1420,
"outputTokens": 518,
"latencyMs": 892,
"status": "success",
"dataMode": "real",
"environment": "prod"
}
Required for reliable diagnosis:
-
externalRequestId(stable on retries) -
provider,model,endpointTag,promptVersion - token counts + latency + status
Recommended:
-
userId(hash if needed) -
dataModeandenvironment
Safe sender pattern (TypeScript)
type TelemetryPayload = {
externalRequestId: string;
provider: string;
model: string;
endpointTag: string;
promptVersion: string;
userId?: string;
inputTokens: number;
outputTokens: number;
latencyMs: number;
status: 'success' | 'error';
dataMode: 'real' | 'test' | 'demo';
environment: 'prod' | 'staging' | 'dev';
};
async function sendTelemetrySafe(payload: TelemetryPayload): Promise<void> {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 700);
try {
const res = await fetch('https://api.opsmeter.io/v1/ingest', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Api-Key': process.env.OPSMETER_API_KEY ?? ''
},
body: JSON.stringify(payload),
signal: controller.signal
});
// Plan limit reached: telemetry pauses, app traffic should continue.
if (res.status === 402) {
// Mark local telemetry as paused for a short window.
// Do not fail user request path.
return;
}
if (res.status === 429) {
// Respect Retry-After if present.
// Optional: backoff queue here.
return;
}
// Swallow other non-2xx responses on user path.
} catch {
// Swallow: telemetry must never break production requests.
} finally {
clearTimeout(timeout);
}
}
Call it asynchronously after provider response handling.
Keep idempotency stable on retries
For the same logical LLM request:
- generate one
externalRequestId - reuse it on retry attempts
If you generate a new ID on each retry, you create fake volume and break root-cause analysis.
15-minute spike workflow
0-5 min
- classify as volume spike vs token spike
- check if deploy happened in same window
5-10 min
- rank spend by endpoint
- rank spend by tenant/user
- compare promptVersion cost/request deltas
10-15 min
- cap retries/backoff
- apply temporary token/model constraints
- isolate suspicious traffic
Threshold template that avoids noise
Start simple:
- warning: 80% budget
- exceeded: 100% budget
- burn-rate: >2.5x trailing baseline
- endpoint concentration: >40% spend from one endpoint
Add one owner per threshold class.
No owner = no response.
Mistakes to avoid
- sync telemetry on user path
- mixed test/demo/real traffic in same view
- inconsistent endpointTag taxonomy
- missing promptVersion on deploy
- ignoring
Retry-Afteron 429
Why this wins before SDK wrappers
You get high-value controls quickly:
- detect spikes early
- explain cause, not just totals
- ship budget guardrails now
SDKs later improve ergonomics. They are not a blocker for cost governance.
If you want to copy this setup
Use this order:
- implement payload contract
- ship safe async sender
- instrument 2-3 critical endpoints first
- set budget and concentration thresholds
- run one incident drill
That is enough to stop most bill-shock surprises.
If you want a simple way to implement this
I’m building Opsmeter, a telemetry-first tool that attributes LLM spend by endpointTag and promptVersion (and optionally user/customer), with budgets and alerts.
Docs: https://opsmeter.io/docs
Pricing: https://opsmeter.io/pricing
Compare (why totals aren’t enough): https://opsmeter.io/compare
If you’re shipping LLM features in production, I’d love to hear how you handle cost regressions today — and what would make this a must-have.
Top comments (0)