DEV Community

relayhop
relayhop

Posted on

Anthropic API in production: 5 things the docs don't tell you

I've been running the Anthropic API in production across a few small experiments for the last couple months. The docs are good but they don't cover the things that actually bite you. Here are 5 things I wish I'd known on day 1.

A free 1-page cheatsheet with these is at the bottom if you just want a take-home.

1. Prompt caching has a write cost — and that cost can wipe out your savings

The Anthropic docs make caching sound free. It isn't. Cache writes cost about 1.25× normal input rate. Cache reads cost about 0.1× normal input rate. So:

cache_write = 1.25 × normal_input
cache_read  = 0.10 × normal_input
breakeven   = ~2 reuses after the write
Enter fullscreen mode Exit fullscreen mode

The trap: an A/B experiment that randomizes system prompts. Suddenly each variant has its own cache, each variant is written far more often than it's read, and your bill goes up.

# Symptom in your usage block:
{
  "cache_creation_input_tokens": 50000,  # bad
  "cache_read_input_tokens": 5000        # bad ratio: 10:1
}
Enter fullscreen mode Exit fullscreen mode

Fix: put A/B variation in messages[] not in system. Keep system stable.

Rule of thumb: if your cache hit ratio is under 50%, caching is probably losing money.

2. 529 is normal — build for it

Anthropic returns 529 (overload) more often than people expect. At peak hours on Sonnet, I've seen 1-3% of requests get a 529 even at reasonable concurrency.

What newcomers do: treat 529 as a bug, fail the request, surface error to user.

What works: build a fallback chain:

Try Sonnet 4.5
  if 529 (3 retries with exp backoff):
    → fall back to Sonnet 4
      if 529 (2 retries):
        → fall back to Haiku
          if 529 (1 retry):
            → return cached canned response with disclaimer
Enter fullscreen mode Exit fullscreen mode

Nobody likes the canned response, but the alternative is a 1% rate of broken UI that you didn't choose. The fallback chain quietly catches it and your p99 stays usable.

3. There's no native idempotency key — build it yourself

Anthropic API doesn't ship idempotency keys (as of mid-2026). Network timeout + your retry = the same prompt gets billed twice.

Minimum viable client-side idempotency:

import hashlib
import redis

r = redis.Redis()

def call_anthropic(body: dict):
    key = "anth:" + hashlib.sha256(
        json.dumps({k: body[k] for k in ("model", "system", "messages", "max_tokens")}, sort_keys=True).encode()
    ).hexdigest()

    cached = r.get(key)
    if cached: return json.loads(cached)

    r.setex(key, 60, b"pending")  # mark in-flight
    resp = anthropic.messages.create(**body)
    r.setex(key, 86400, resp.model_dump_json().encode())
    return resp
Enter fullscreen mode Exit fullscreen mode

Critical: hash does NOT include metadata.user_id or stream (those vary per call but don't change the result). Easy to mess up.

4. Streaming connections drop more than you expect — have a state machine

Not "once a week" drop. More like 0.1-0.5% of streams drop mid-response, especially through proxies / CDNs / corporate networks.

The failure mode that bit me hardest: client hangs forever waiting for message_stop that never comes, because the SSE stream silently closed.

Minimum streaming state machine:

async def stream_with_timeout(client, **kwargs):
    timeout_s = 30  # silence threshold
    last_event_at = time.time()
    buffer = []

    async with client.messages.stream(**kwargs) as stream:
        async for event in stream:
            last_event_at = time.time()
            if event.type == "content_block_delta":
                buffer.append(event.delta.text)
            elif event.type == "message_stop":
                return "".join(buffer), "complete"
            elif event.type == "error":
                return "".join(buffer), "error"

            if time.time() - last_event_at > timeout_s:
                return "".join(buffer), "timeout"

    return "".join(buffer), "unexpected_close"
Enter fullscreen mode Exit fullscreen mode

Return the partial output even on failure. Don't silently swallow.

5. Sonnet is usually the wrong choice for classification at scale

This is the most expensive mistake I see teams make.

Classification = short output, easily-evaluatable correctness, repeated millions of times.

Math at 10M support tickets:

  • Sonnet (3 input tokens cost): roughly $30,000 for 10M tickets
  • Haiku: roughly $2,500 for 10M tickets at similar accuracy on well-defined labels

The move: 2-stage pipeline.

For each ticket:
  result = haiku.classify(ticket)
  if result.confidence < 0.85 OR result.label in HARD_LABELS:
    result = sonnet.classify(ticket)
  log(result)
Enter fullscreen mode Exit fullscreen mode

Validate with a 1000-ticket eval set first. In every case I've measured, Haiku hits 92%+ on the common 80% of labels. The 20% that need Sonnet escalation are the only ones billed at Sonnet rates.

Result: ~85% cost saving with no measurable quality drop.

Take-home

Free 5-question cheatsheet covering these → Anthropic API — Free 5-Question Production Interview Cheatsheet ($0, no email gate).

If you want the full 50-question kit with prompt caching deep dive (8 Qs), Batch API (5 Qs), tool use (5 Qs), system design scenarios (10 Qs) — Anthropic API Production Interview Kit ($29, 50% OFF launch).

If you've shipped with the Anthropic API and have your own version of "things the docs don't tell you", I'd love to compare notes in the comments.


Same author as I shipped a Notion stack on Gumroad using only Claude Code. Both built with Claude Code; both shipped fully autonomously.

Top comments (0)