DEV Community

Opsmeter
Opsmeter

Posted on

No-SDK LLM Cost Spike Detection in Production (Endpoint + User + PromptVersion)

Most teams do not need to wait for SDK wrappers to get serious cost visibility.

You can ship useful LLM cost spike detection now with a direct ingest contract and a safe async sender.

This post shows a practical setup that gives you:

  • endpoint-level cost attribution
  • tenant/user concentration views
  • prompt deploy regression detection
  • budget and spend-alert workflows

without changing provider traffic paths.


What "No-SDK" actually means

It does not mean "manual forever".

It means:

  1. Keep provider calls as-is.
  2. Extract usage metadata from provider response.
  3. Send a normalized telemetry payload asynchronously.

SDK wrappers later can reduce boilerplate, but they are not required for production value.


Architecture in 3 layers

Layer A: Provider call + usage extraction

Map provider-specific usage fields into a normalized model.

Layer B: Telemetry sender (safe path)

Send telemetry with timeout + swallow so user request path is never blocked.

Layer C: Root-cause workflow

Query by endpoint, user/tenant, and promptVersion to explain spikes.


Minimal payload contract

{
  "externalRequestId": "req_01HZXB6MQZ2WQ9D2KCF9M4V2QY",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "endpointTag": "chat_summary",
  "promptVersion": "summary_v3",
  "userId": "tenant_acme_hash",
  "inputTokens": 1420,
  "outputTokens": 518,
  "latencyMs": 892,
  "status": "success",
  "dataMode": "real",
  "environment": "prod"
}
Enter fullscreen mode Exit fullscreen mode

Required for reliable diagnosis:

  • externalRequestId (stable on retries)
  • provider, model, endpointTag, promptVersion
  • token counts + latency + status

Recommended:

  • userId (hash if needed)
  • dataMode and environment

Safe sender pattern (TypeScript)

type TelemetryPayload = {
  externalRequestId: string;
  provider: string;
  model: string;
  endpointTag: string;
  promptVersion: string;
  userId?: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  status: 'success' | 'error';
  dataMode: 'real' | 'test' | 'demo';
  environment: 'prod' | 'staging' | 'dev';
};

async function sendTelemetrySafe(payload: TelemetryPayload): Promise<void> {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 700);

  try {
    const res = await fetch('https://api.opsmeter.io/v1/ingest', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Api-Key': process.env.OPSMETER_API_KEY ?? ''
      },
      body: JSON.stringify(payload),
      signal: controller.signal
    });

    // Plan limit reached: telemetry pauses, app traffic should continue.
    if (res.status === 402) {
      // Mark local telemetry as paused for a short window.
      // Do not fail user request path.
      return;
    }

    if (res.status === 429) {
      // Respect Retry-After if present.
      // Optional: backoff queue here.
      return;
    }

    // Swallow other non-2xx responses on user path.
  } catch {
    // Swallow: telemetry must never break production requests.
  } finally {
    clearTimeout(timeout);
  }
}
Enter fullscreen mode Exit fullscreen mode

Call it asynchronously after provider response handling.


Keep idempotency stable on retries

For the same logical LLM request:

  • generate one externalRequestId
  • reuse it on retry attempts

If you generate a new ID on each retry, you create fake volume and break root-cause analysis.


15-minute spike workflow

0-5 min

  • classify as volume spike vs token spike
  • check if deploy happened in same window

5-10 min

  • rank spend by endpoint
  • rank spend by tenant/user
  • compare promptVersion cost/request deltas

10-15 min

  • cap retries/backoff
  • apply temporary token/model constraints
  • isolate suspicious traffic

Threshold template that avoids noise

Start simple:

  • warning: 80% budget
  • exceeded: 100% budget
  • burn-rate: >2.5x trailing baseline
  • endpoint concentration: >40% spend from one endpoint

Add one owner per threshold class.

No owner = no response.


Mistakes to avoid

  • sync telemetry on user path
  • mixed test/demo/real traffic in same view
  • inconsistent endpointTag taxonomy
  • missing promptVersion on deploy
  • ignoring Retry-After on 429

Why this wins before SDK wrappers

You get high-value controls quickly:

  • detect spikes early
  • explain cause, not just totals
  • ship budget guardrails now

SDKs later improve ergonomics. They are not a blocker for cost governance.


If you want to copy this setup

Use this order:

  1. implement payload contract
  2. ship safe async sender
  3. instrument 2-3 critical endpoints first
  4. set budget and concentration thresholds
  5. run one incident drill

That is enough to stop most bill-shock surprises.

If you want a simple way to implement this
I’m building Opsmeter, a telemetry-first tool that attributes LLM spend by endpointTag and promptVersion (and optionally user/customer), with budgets and alerts.

Docs: https://opsmeter.io/docs
Pricing: https://opsmeter.io/pricing
Compare (why totals aren’t enough): https://opsmeter.io/compare
If you’re shipping LLM features in production, I’d love to hear how you handle cost regressions today — and what would make this a must-have.

Top comments (0)