Juan Torchia

Posted on Apr 17 • Edited on Apr 20 • Originally published at juanchi.dev

The Month Anthropic Didn't Respond: Billing, Trust, and the Hidden Cost of AI API Dependency

#english #reflections #typescript #anthropic

There's a belief baked into the dev community about AI APIs that is, with all due respect, pretty incomplete: that they're reliable infrastructure. That you can build on them the same way you build on AWS S3 or Stripe. That if something goes wrong, there's someone on the other end who picks up.

Not necessarily.

A Hacker News thread with 365 points — the kind that shows up on a Tuesday morning and wrecks your week — documented in detail what happened to someone with Anthropic: a month of tickets with no response, billing that kept running, and an institutional silence that doesn't line up with what it costs to use those APIs. I read that thread three times. Not because it surprised me. But because I'd already lived something almost identical.

Anthropic Billing, AI Support, Vendor Lock-in: The Problem Nobody Names

Friday. 11pm. I had a client with a production system using an AI API to process forms — nothing critical in theory, but completely critical in practice because it was the main business flow. The system stopped working. No explicit error in the logs. No email notification. No banner on the status page.

Silence.

// What I was seeing in the logs:
// { status: 200, body: { error: null, result: null } }
// A 200 response with an empty body. An elegant way to die.

async function processForm(data: FormData) {
  const response = await aiClient.complete({
    model: 'whatever-model-was-current',
    prompt: buildPrompt(data),
  });

  // The problem: no validation for empty responses
  // I assumed that if there was no error, there was a result
  // I assumed wrong
  return response.result; // undefined, silently
}

I spent two hours debugging before I realized the problem wasn't my code. It was the API. They'd made a change to the response format — no explicit versioning, no heads-up — and my code was just receiving nulls with a 200 status.

I opened a support ticket. I waited.

The response came 72 hours later. A Friday at 11pm is not a time when AI companies have people on call. This isn't a criticism of the individuals — it's a criticism of the support model that's implicitly sold to you when you're being charged per API call at prices that aren't cheap.

The Real Technical Problem: Building on Sand with Granite-Looking Foundations

The issue isn't that APIs fail. Everything fails. The issue is the asymmetry of information and the asymmetry of power.

When you build on Stripe, you get:

Explicit API versioning (/v1/, /v2/)
Deprecation notices months in advance
Webhooks with verifiable signatures
Documented SLAs
Support with guaranteed response times based on your tier

When you build on AI APIs today, you get at best:

Model versioning (which is not the same as API versioning)
Status pages that sometimes reflect reality
Rate limits that change without much warning
Billing that runs even when the service is degraded
Support quality that literally depends on how much you spend per month

// What you should always do — and what I didn't do that night:

interface AIResponse {
  result: string | null;
  metadata: Record<string, unknown>;
}

function validateAIResponse(response: unknown): AIResponse {
  // Never trust the shape of a response from an external API
  // Especially AI APIs where the schema evolves fast
  if (!response || typeof response !== 'object') {
    throw new Error('Invalid response: not an object');
  }

  const r = response as Record<string, unknown>;

  if (!r.result && r.result !== '') {
    // Log with enough context to debug at 11pm
    console.error('[AI] Unexpected empty response', {
      timestamp: new Date().toISOString(),
      shape: Object.keys(r),
    });
    throw new Error('AI response has no result');
  }

  return r as AIResponse;
}

// Basic circuit breaker — not optional, it's infrastructure
class AICircuitBreaker {
  private failures = 0;
  private readonly failureThreshold = 3;
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private lastFailure?: Date;

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      const waitTime = 30_000; // 30 seconds
      const elapsed = Date.now() - (this.lastFailure?.getTime() ?? 0);

      if (elapsed < waitTime) {
        // Fallback — don't leave the user hanging
        throw new Error('AI service temporarily unavailable');
      }

      this.state = 'half-open';
    }

    try {
      const result = await fn();
      this.reset();
      return result;
    } catch (error) {
      this.recordFailure();
      throw error;
    }
  }

  private recordFailure() {
    this.failures++;
    this.lastFailure = new Date();
    if (this.failures >= this.failureThreshold) {
      this.state = 'open';
      console.error('[CircuitBreaker] State: OPEN — too many consecutive failures');
    }
  }

  private reset() {
    this.failures = 0;
    this.state = 'closed';
  }
}

This isn't sophisticated. It's the minimum. It's what you should have before pushing any third-party API integration to production. With AI APIs it matters even more because the silent failure mode — empty response with a 200 — is more common than with more mature APIs.

I talk about this in my post about vibe-coding vs stress-coding: there's a massive difference between using AI as a tool and depending on AI as infrastructure. The first one amplifies you. The second is a contract nobody explicitly signed.

The Mistakes You Make When You Trust Too Fast

1. Not modeling the fallback from day one.

When you add an AI integration, the happy path is easy. 99% of the time it works. The problem is the 1% that happens on a Friday at 11pm. What does your app do if the API doesn't respond? If it responds empty? If it responds with 30 seconds of latency? If you don't have answers to those three questions before you deploy, you're not ready.

2. Confusing the status page with reality.

AI vendor status pages are... optimistic. I've seen real degradation while the status page showed green. Implement your own health check:

// Real health check — don't rely solely on the vendor's status page
async function checkAPIHealth(): Promise<boolean> {
  try {
    const start = Date.now();

    // Test call with minimal prompt and strict timeout
    const response = await Promise.race([
      aiClient.complete({
        model: 'your-model',
        prompt: 'Reply with just "ok"',
        maxTokens: 5,
      }),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Timeout')), 5_000)
      ),
    ]);

    const latency = Date.now() - start;

    // Log latency — latency changes are an early signal of problems
    console.info('[HealthCheck] AI API latency:', latency, 'ms');

    return Boolean(response);
  } catch {
    return false;
  }
}

3. Not tracking cost in real time.

AI API billing is per token, and tokens accumulate. If your app has a bug that makes redundant calls, you find out on the monthly invoice. By the time you open a ticket, you've already burned the money. Set up cost alerts before the problem becomes several days of billing with no support.

4. Betting everything on a single provider.

This connects to the work I do thinking about orchestration. When I read about Scion, Google's testbed for agents, the first thing I thought wasn't about the technical capabilities — it was about portability: if I switch providers, how much of my business logic do I have to rewrite?

The honest answer in most cases: too much.

// Abstract the provider from the start — what I should have done
interface AIClient {
  complete(options: CompletionOptions): Promise<AIResponse>;
  calculateCost(tokens: number): number;
}

// Swappable implementations
class AnthropicClient implements AIClient {
  async complete(options: CompletionOptions): Promise<AIResponse> {
    // provider-specific implementation
  }
  calculateCost(tokens: number): number { /* ... */ }
}

class OpenAIClient implements AIClient {
  async complete(options: CompletionOptions): Promise<AIResponse> {
    // provider-specific implementation
  }
  calculateCost(tokens: number): number { /* ... */ }
}

// Your business logic doesn't know who's behind the curtain
class FormService {
  constructor(private readonly ai: AIClient) {}

  async process(data: FormData) {
    // This works with any provider
    return this.ai.complete({ prompt: buildPrompt(data) });
  }
}

It's the same principle I apply when building VS Code extensions: abstraction isn't free complexity, it's what lets you change the parts that change without breaking the parts that don't.

FAQ: What People Ask When They Get Burned by AI APIs

Does Anthropic have a documented SLA for their APIs?

Not publicly, at least not in the terms that other infrastructure providers like AWS or GCP offer. There are implicit uptime commitments, but support terms depend on your spending tier. If you don't know how much you need to spend to get priority support, you probably don't have it.

Is it any different with OpenAI or Google AI?

In general, the support problem cuts across AI vendors. The bigger ones have better status infrastructure and more support capacity, but the power asymmetry still exists: they decide when to change the model, when to deprecate a version, when to adjust pricing. You agree to all of that when you sign the ToS.

How do you know if your AI integration is silently failing in production?

If you don't have explicit metrics for: (1) empty response rate, (2) p95 latency, and (3) error rate broken down by type — you don't know. The silent failure mode — 200 with an empty body — doesn't trigger traditional error alerts. You need schema validation on the response and business metrics (how many forms processed this hour vs last hour?) to detect it.

Is it still worth building on these APIs given the risk?

Yes, but with your eyes open. The question isn't whether to use AI APIs but how. Provider abstraction, circuit breaker, explicit fallback, and your own health check are not optional if you're in production. They're the real entry cost that nobody tells you when you're reading the docs. Same as with accessibility metrics: the number the tool shows you and the user's actual experience are two different things.

How do I structure the fallback without degrading user experience?

Depends on the use case, but the principle is: the user should never see the internal error. If the AI doesn't respond, can you process with simpler logic? Can you queue and process later? Can you show an honest "we're processing, we'll notify you" message? Any of those options is better than a generic 500, or worse, a silently empty result.

Is AI vendor lock-in different from lock-in with other APIs?

Yes, and in the worst way. AWS lock-in is technical but predictable — you migrate the data and rewrite the infrastructure. AI lock-in also includes model behavior: the same prompt can give different results across providers, and your business logic is sometimes built around the idiosyncrasies of one specific model. It's behavioral lock-in, not just API lock-in.

What I'd Do Differently (and What You Should Do Before Next Friday)

I passed Calculus II on my fourth attempt. I was working full time while studying at UBA. I showed up to the exam straight from the office, still in my work clothes. What I learned from that wasn't just math — I learned that the number of attempts it takes to pass something says nothing about whether you're capable. It says how many times you're willing to come back.

With AI APIs, we're on the first collective attempt. The ecosystem is young, the support contracts are immature, and the failure modes are still being discovered in production — literally in other people's production, on a Friday at 11pm.

That's not a reason to stop building. It's a reason to build more carefully.

What I'd do differently:

Provider abstraction from commit one — not after you already have 40 direct calls to the Anthropic SDK
Circuit breaker and your own health check — the vendor's status page is their version of the story, not yours
Strict schema validation on every response — especially if the provider gives you no versioning guarantees
Real-time cost alerts — before billing runs for days with no support
Explicit documented fallback — if the AI doesn't respond, what does the system do? If the answer is "I don't know," you're not ready for production

Side note: the same care you put into the behavior of an external API is the same care you should put into the tools you use every day. When I built my VS Code extension for SSL certificates, I did it precisely because I didn't want to depend on someone else maintaining a critical tool in my workflow. Same principle.

The 365-point HN thread isn't an isolated case. It's a symptom. The hidden cost of AI APIs isn't just the token price — it's the trust cost we still haven't finished calculating.

If you're building something that matters on top of one of these APIs, send me a message. I genuinely want to know how you're solving it.