Opsmeter

Posted on Feb 24

No-SDK LLM Cost Spike Detection in Production (Endpoint + User + PromptVersion)

#ai #llm #openai #devops

Most teams do not need to wait for SDK wrappers to get serious cost visibility.

You can ship useful LLM cost spike detection now with a direct ingest contract and a safe async sender.

This post shows a practical setup that gives you:

endpoint-level cost attribution
tenant/user concentration views
prompt deploy regression detection
budget and spend-alert workflows

without changing provider traffic paths.

What "No-SDK" actually means

It does not mean "manual forever".

It means:

Keep provider calls as-is.
Extract usage metadata from provider response.
Send a normalized telemetry payload asynchronously.

SDK wrappers later can reduce boilerplate, but they are not required for production value.

Architecture in 3 layers

Layer A: Provider call + usage extraction

Map provider-specific usage fields into a normalized model.

Layer B: Telemetry sender (safe path)

Send telemetry with timeout + swallow so user request path is never blocked.

Layer C: Root-cause workflow

Query by endpoint, user/tenant, and promptVersion to explain spikes.

Minimal payload contract

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "endpointTag": "chat_summary",
  "promptVersion": "summary_v3",
  "userId": "tenant_acme_hash",
  "inputTokens": 1420,
  "outputTokens": 518,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}

Required for reliable diagnosis:

externalRequestId (stable on retries)
provider, model, endpointTag, promptVersion
token counts + latency + status

Recommended:

userId (hash if needed)
dataMode and environment

Safe sender pattern (TypeScript)

type TelemetryPayload = {
  externalRequestId: string;
  provider: string;
  model: string;
  endpointTag: string;
  promptVersion: string;
  userId?: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  status: 'success' | 'error';
  dataMode: 'real' | 'test' | 'demo';
  environment: 'prod' | 'staging' | 'dev';
};

async function sendTelemetrySafe(payload: TelemetryPayload): Promise<void> {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 700);

  try {
    const res = await fetch('https://api.opsmeter.io/v1/ingest', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Api-Key': process.env.OPSMETER_API_KEY ?? ''
      },
      body: JSON.stringify(payload),
      signal: controller.signal
    });

    // Plan limit reached: telemetry pauses, app traffic should continue.
    if (res.status === 402) {
      // Mark local telemetry as paused for a short window.
      // Do not fail user request path.
      return;
    }

    if (res.status === 429) {
      // Respect Retry-After if present.
      // Optional: backoff queue here.
      return;
    }

    // Swallow other non-2xx responses on user path.
  } catch {
    // Swallow: telemetry must never break production requests.
  } finally {
    clearTimeout(timeout);
  }
}

Call it asynchronously after provider response handling.

Keep idempotency stable on retries

For the same logical LLM request:

generate one externalRequestId
reuse it on retry attempts

If you generate a new ID on each retry, you create fake volume and break root-cause analysis.

15-minute spike workflow

0-5 min

classify as volume spike vs token spike
check if deploy happened in same window

5-10 min

rank spend by endpoint
rank spend by tenant/user
compare promptVersion cost/request deltas

10-15 min

cap retries/backoff
apply temporary token/model constraints
isolate suspicious traffic

Threshold template that avoids noise

Start simple:

warning: 80% budget
exceeded: 100% budget
burn-rate: >2.5x trailing baseline
endpoint concentration: >40% spend from one endpoint

Add one owner per threshold class.

No owner = no response.

Mistakes to avoid

sync telemetry on user path
mixed test/demo/real traffic in same view
inconsistent endpointTag taxonomy
missing promptVersion on deploy
ignoring Retry-After on 429

Why this wins before SDK wrappers

You get high-value controls quickly:

detect spikes early
explain cause, not just totals
ship budget guardrails now

SDKs later improve ergonomics. They are not a blocker for cost governance.

If you want to copy this setup

Use this order:

implement payload contract
ship safe async sender
instrument 2-3 critical endpoints first
set budget and concentration thresholds
run one incident drill

That is enough to stop most bill-shock surprises.

If you want a simple way to implement this
I’m building Opsmeter, a telemetry-first tool that attributes LLM spend by endpointTag and promptVersion (and optionally user/customer), with budgets and alerts.

Docs: https://opsmeter.io/docs
Pricing: https://opsmeter.io/pricing
Compare (why totals aren’t enough): https://opsmeter.io/compare
If you’re shipping LLM features in production, I’d love to hear how you handle cost regressions today — and what would make this a must-have.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.