DEV Community: Sindhu Murthy

Billing & Account Issues: A Support Engineer's Runbook

Sindhu Murthy — Wed, 18 Feb 2026 18:29:57 +0000

Who this is for: This runbook is a practical reference for support engineers and anyone preparing for a support engineering role with AI API providers. It covers the 6 most common billing incident types — how to diagnose them, how to fix them, and what to communicate to customers. Patterns here apply across providers including OpenAI, Anthropic, Google, Cohere, and others.

⚡ Quick Reference

Match the customer's symptom to the incident type, then jump to that section.

Customer Says	Jump To
🚫 "My API calls suddenly stopped working"	Incident 1 — Payment Failure / Credit Exhaustion
😱 "My bill is way higher than I expected"	Incident 2 — Unexpected High Bill
📍 "I hit my limit and the API stopped"	Incident 3 — Spending Limit Reached
⏳ "My free credits ran out"	Incident 4 — Free Tier / Trial Expiry
💸 "I want a refund for accidental charges"	Incident 5 — Refund Request
🔒 "My account has been suspended / locked"	Incident 6 — Account Suspension

🔵 Before anything else: Always check the provider's status page first (e.g. status.openai.com, status.anthropic.com). If there is an active incident, that is your answer — inform the customer and monitor. Do not proceed further until you have ruled out a provider-side outage.

Incident 1 — Payment Failure / Credit Exhaustion
Incident 2 — Unexpected High Bill
Incident 3 — Spending Limit Reached Without Warning
Incident 4 — Free Tier / Trial Credit Expiry
Incident 5 — Refund Request for Accidental Usage
Incident 6 — Account Suspension
Master Decision Tree
Support Engineer Troubleshooting Checklist

Incident 1 — Payment Failure / Credit Exhaustion {#incident-1}

Error: 402 Payment Required — API access stops immediately

What the customer says

"My API calls were working fine and then suddenly stopped."
"I'm getting 402 errors on every request."
"Nothing in my code changed."

What actually happened

AI API providers stop access immediately and without a grace period when a payment fails or a prepaid credit balance hits $0. Unlike a SaaS subscription that might give you days to fix a payment issue, the API cuts off the moment the billing system flags a failure.

Root Cause	How to Confirm It
Card expired or declined	Provider dashboard → Billing → red banner or failed payment status
Prepaid credit balance at $0	Dashboard → Billing → credit balance shows $0.00
Auto-recharge enabled but card declined	Dashboard → Billing → payment history → failed recharge entry
Invoice overdue (enterprise accounts)	Dashboard → Billing → Invoices → unpaid invoice

Diagnosis Flow

402 Payment Required
  │
  ├── Provider Dashboard → Billing section
  │     │
  │     ├── Red banner or "Payment failed" message?
  │     │     └── Card expired / declined
  │     │           FIX: Ask customer to update card.
  │     │               Dashboard → Billing → Payment methods → Update
  │     │               API resumes within ~5 minutes of payment clearing.
  │     │
  │     ├── Credit balance shows $0?
  │     │     ├── Auto-recharge OFF
  │     │     │     FIX: Add credits manually + enable auto-recharge.
  │     │     │
  │     │     └── Auto-recharge ON but balance still $0
  │     │           → The recharge itself failed (card issue).
  │     │             FIX: Same as expired/declined card above.
  │     │
  │     └── Invoice overdue? (enterprise customers)
  │           FIX: Route to finance team for payment processing.
  │
  └── Dashboard looks fine — balance > $0, card valid?
        → Rare sync delay. Wait 10 minutes.
          Still failing? Escalate with account ID + timestamps.

How to fix it

Fix	Where	Time to Resolution
Update payment card	Dashboard → Billing → Payment methods	~5 min after payment clears
Add prepaid credits	Dashboard → Billing → Add credits	~2–5 minutes
Enable auto-recharge	Dashboard → Billing → Auto-recharge settings	Prevents future incidents

What to tell the customer

"Your payment method needs to be updated. Go to your provider dashboard under Billing → Payment methods, update your card, and API access should resume within a few minutes. I'd also recommend enabling auto-recharge so your balance never hits zero unexpectedly."

💚 Post-resolution: Always recommend enabling auto-recharge with a top-up threshold set to at least 2× the customer's average daily spend. This single setting prevents the majority of these tickets.

Incident 2 — Unexpected High Bill {#incident-2}

Customer's invoice is significantly higher than expected

What the customer says

"My bill last month was $40. This month it's $800. Nothing changed."
"I think I'm being charged incorrectly."
"We only have 200 users — how is this possible?"

What actually happened

Something in the customer's usage changed — even if they don't know what. In practice, 95% of high-bill tickets trace back to one of five root causes. Your job is to identify which one using the Usage dashboard.

Root Cause	What It Looks Like in Usage Dashboard	How Common
Runaway loop — bug calling API thousands of times	One day with a massive spike, thousands of requests in minutes	Very common
Model swap — switched to a more expensive model	Usage shifts to a pricier model mid-month	Very common
Context bloat — sending full documents instead of chunks	High token count per request, not high request count	Common
Retry storm — failed requests retrying without backoff	Clusters of identical requests at the same timestamps	Common
Dev key in production — test environment hitting real API	Usage spikes during business hours or CI/CD run times	Moderate

Planned vs Unplanned Model Changes — Know the Difference

Using multiple models intentionally for different tasks is one of the best cost strategies in AI engineering — not a problem. The issue is when a model change happens accidentally: a developer swaps a model name in one place without checking the pricing impact, and the bill spikes before anyone notices.

	✅ Intentional Multi-Model Routing	❌ Accidental Model Swap
What it is	Deliberately using cheap models for simple tasks, expensive ones for complex tasks	Someone changes a model name in code without checking pricing
Planned?	Yes — documented in architecture	No — discovered on the invoice
Is it a problem?	No — this is best practice	Yes — surprise bill with no warning
Example	Classification → economy model; complex reasoning → premium model	gpt-4o-mini quietly changed to gpt-4o in a config file

Smart Multi-Model Routing — Recommended Approach

Task Type	Recommended Model Tier	Why
Classification, routing, tagging, simple Q&A	Economy (e.g. gpt-4o-mini, claude-haiku, gemini-flash)	Doesn't need deep reasoning
Customer-facing chat, summarisation	Standard (e.g. gpt-4o, claude-sonnet, gemini-pro)	Good quality-to-cost balance
Complex analysis, code, legal/financial reasoning	Premium (e.g. o1, claude-opus, gemini-ultra)	Worth the cost when accuracy matters

The Pricing Gap That Catches People Off Guard

Model Tier	Examples	Approx. Cost per 1M input tokens	Relative Cost
Economy / Lightweight	gpt-4o-mini, claude-haiku, gemini-flash	~$0.10–0.20	🟢 Cheapest
Standard	gpt-4o, claude-sonnet, gemini-pro	~$2.50–3.00	🟠 ~15–20× more
Premium / Reasoning	o1, claude-opus, gemini-ultra	~$15.00+	🔴 ~100× more

⚠️ Always direct customers to their provider's current pricing page — these numbers change as models evolve. Use the table above for illustration only.

How Context Bloat Compounds Cost

Same number of requests — very different cost:

  Request with 1K tokens:
  └── Cost on a standard model: ~$0.0025

  Request with 10K tokens (full document sent):
  └── Cost on a standard model: ~$0.025  ← 10× more expensive

  500 such requests/day × 30 days:
  ├── 1K tokens:  ~$37.50/month
  └── 10K tokens: ~$375.00/month  ← same traffic, 10× the bill

  FIX: Send only relevant chunks. Use retrieval (RAG).
       Summarize long docs with a cheap model before
       passing to an expensive one.

Diagnosis Flow

Customer reports high bill
  │
  ├── Dashboard → Usage → set date range to billing period
  │     │
  │     ├── Single-day spike visible?
  │     │     → Likely runaway loop or retry storm.
  │     │       Are requests clustered by timestamp?
  │     │       Clustered      → retry storm (no exponential backoff)
  │     │       Spread but massive volume → runaway loop (code bug)
  │     │
  │     ├── Usage shifted to a more expensive model mid-month?
  │     │     → Model swap.
  │     │       Ask: "Did anyone on your team change the model name recently?"
  │     │
  │     ├── High token count per request?
  │     │     → Context bloat.
  │     │       Ask: "Are you sending full documents or just relevant sections?"
  │     │
  │     └── Usage spread evenly but higher overall?
  │           → Traffic grew OR dev key hitting production API.
  │             Ask: "Do you use the same API key in dev and production?"
  │
  └── Usage dashboard total matches the invoice?
        YES → Usage is legitimate. Explain pricing, suggest optimizations.
        NO  → Escalate with account ID, date range, and the discrepancy figures.

Post-Resolution Recommendations

Root Cause Found	Recommend
Runaway loop	Set a monthly hard spend limit. Add request-level logging.
Model swap	Lock model names to constants or environment variables. Review pricing on every model change.
Context bloat	Use retrieval-augmented generation (RAG). Send relevant chunks only.
Retry storm	Implement exponential backoff with jitter. Cap total retries per request.
Dev key in production	Separate API keys per environment. Set lower spend limits on dev keys.

Incident 3 — Spending Limit Reached Without Warning {#incident-3}

API stops mid-month — customer didn't realise a hard limit was set

What the customer says

"The API just stopped working. I have money in my account."
"I'm getting errors even though my balance is positive."
"It was fine yesterday — nothing changed."

What actually happened

Most AI providers allow users to set a monthly spending cap (hard limit). When this cap is reached, all API calls fail — even with a valid payment method and positive credit balance. This is a customer-configured safety feature, not a bug. The confusion usually happens because:

The limit was set a long time ago and forgotten
Usage grew beyond the original projection
A spike consumed the monthly budget faster than expected
The customer confused the soft limit (notification only) with the hard limit (cutoff)

Soft Limit vs Hard Limit — The Critical Difference

	Soft Limit	Hard Limit
What it does	Sends an email/alert notification when reached	Stops all API calls immediately when reached
Does it cut off the API?	No — API keeps working	Yes — API stops
Error seen when hit	No error — just a notification	429 or billing-related error
Best use	Early warning at 70–80% of budget	Circuit breaker at 100% of budget

How Limits Should Be Configured

Monthly budget: $500
  │
  ├── Soft limit: $375  (75%)
  │     → Notification sent: "You've used 75% of your budget"
  │     → API still works
  │     → Time to review: Is this expected? Should the limit be raised?
  │
  └── Hard limit: $500  (100%)
        → All API calls stop
        → Protects against runaway costs above the budget

  ┌──────────────────────────────────────────────────────┐
  │  $0          $375 (soft)          $500 (hard)         │
  │  ├──────────────┼───────────────────┤                 │
  │  │  SAFE ZONE   │   WARNING ZONE    │   API OFFLINE   │
  └──────────────────────────────────────────────────────┘

How to Fix It

Go to the provider's Billing → Spending limits settings and either:

Raise the hard limit to a higher value (takes effect immediately)
Wait for the monthly reset (usually the 1st of the calendar month)

⚠️ Before raising the limit: Check the Usage dashboard to confirm whether the spend was expected. If it's from a bug or spike, raising the limit without fixing the root cause just defers the problem.

What to Tell the Customer

"Your account has a monthly spending cap set, and you've reached it — that's why the API stopped. This is a safety feature you configured, not a bug. You can raise it in your billing settings. Before you do, I'd recommend checking your usage dashboard to confirm the spending was expected."

Incident 4 — Free Tier / Trial Credit Expiry {#incident-4}

Credits ran out or expired — customer didn't expect it

What the customer says

"I just created my account and the API is already not working."
"I thought I had free credits — why am I getting errors?"
"It worked last week. Now I'm getting 402 errors."

What actually happened

New accounts on most AI providers receive a free credit grant. These credits have two ways to disappear: fully consumed, or expired (credits carry a time limit). When they're gone, behaviour changes in ways that aren't always obvious.

Situation	Error Seen	Fix
Free credits fully consumed	`402` on all requests	Add a payment method in Billing settings
Free credits expired (time limit hit)	`402` even if credits appeared available	Credits are gone — add a payment method
On free tier with very low rate limits	`429` even at low request volume	Add payment method to move to paid tier
Upgraded to paid but limits feel unchanged	`429` at low volume	Tier upgrades can take time to propagate — check current tier in dashboard

Free Tier vs Paid Tier — Why It Feels Broken

Tier	Who	Rate Limits	Notes
Free	New accounts, no payment method	Very restrictive (e.g. 3 RPM on premium models)	Fine for testing; not suitable for real applications
Paid Tier 1	Payment method added + minimum spend reached	Significantly higher	Most developers land here first
Paid Tier 2+	Based on cumulative spend history	Progressively higher	Limits increase automatically as spend grows

⚠️ Common confusion: A customer adds a payment method but still hits very low rate limits. Most providers require a payment method AND a minimum spend AND a minimum account age — all three conditions must be met before a tier upgrade is applied.

What to Tell the Customer

"Your free credits have been used up or have expired. To continue, add a payment method in your billing settings. Once you meet the provider's tier criteria — typically a minimum spend and account age — you'll automatically move to a higher rate limit tier."

Incident 5 — Refund Request for Accidental Usage {#incident-5}

Customer was charged for usage they say was unintentional

What the customer says

"I had a bug that made thousands of API calls — can I get a refund?"
"My account was compromised and someone used my API key."
"I forgot to turn off my dev environment."

The Refund Policy Reality — Set Expectations Early

Most AI providers have a no-refund policy for API usage because the compute was actually consumed. There is no automatic refund process. That said, some situations may qualify for a goodwill credit. Being honest with customers before they escalate saves everyone time.

Situation	Realistic Outcome	What Helps the Customer's Case
Bug caused a clear runaway loop	🟠 Possible goodwill credit	Application logs, timestamps, request IDs, evidence it was unintentional
Account compromised / key stolen	🟢 Usually resolved in customer's favour	Report immediately. Show usage inconsistent with normal activity (IPs, models, times).
Provider outage caused excessive retries	🟢 Usually credited	Reference the outage from the provider's status page with matching timestamps
Didn't realise a model was expensive	🔴 Very unlikely	Pricing is publicly listed
"Forgot to cancel" / dev env left running	🔴 Unlikely	This is what spend limits are for

How to Handle the Ticket

Customer submits refund request
  │
  ├── Evidence of account compromise?
  │     → YES: Flag as security incident.
  │             Ask customer to rotate API key immediately.
  │             Collect: unusual IPs, models used, timestamps.
  │             Escalate to security / trust & safety team.
  │
  ├── Matching provider outage at that time?
  │     → YES: Cross-reference with provider's status page.
  │             If confirmed, credit is likely appropriate. Escalate to billing team.
  │
  ├── Clear code bug with log evidence?
  │     → Collect: timestamps, request IDs, total requests vs. normal baseline.
  │       Escalate to billing team with evidence.
  │       Do NOT promise a refund — only the billing team can approve.
  │
  └── No clear evidence / "I just forgot"?
        → Empathise but set expectations honestly.
          Recommend: hard spend limit + auto-recharge threshold.
          Offer to help configure it.

Information to Collect Before Escalating

Info Needed	Why
Account / Org ID	Identifies the account for the billing team
Date range of charges in question	Narrows the investigation window
Request IDs if available	Allows billing team to trace exact usage
Description of what went wrong (customer's words)	Establishes intent and context
Supporting logs or screenshots	Evidence for goodwill consideration

🔴 Never promise a refund. Only the billing team can approve credits. Promising what you can't deliver creates a worse outcome than being upfront from the start.

What to Tell the Customer

"I understand this is frustrating. The general policy is that API usage is non-refundable since the compute was consumed, but I'll escalate this to our billing team with the details you've shared. They'll review it and follow up. In the meantime, I'd recommend setting a monthly spend limit so this can't happen again — I can walk you through that now if you'd like."

Incident 6 — Account Suspension {#incident-6}

Account locked due to policy violation or fraud flag

What the customer says

"My account was suddenly disabled. I didn't do anything wrong."
"I'm getting 401 errors on a key that worked yesterday."
"I got an email saying my account violated usage policies but I don't understand why."

Why Accounts Get Suspended

Suspension Type	Common Triggers	Who Handles It
Automated — Policy violation	Usage patterns matching prohibited use cases, abuse detection	Trust & Safety team
Automated — Fraud flag	Suspicious payment method, unusual signup signals, sanctioned region	Trust & Safety / Finance
Manual — Policy violation	Reported abuse, investigation-triggered review	Trust & Safety team
Manual — Outstanding balance	Invoice not paid after repeated reminders	Finance / Billing team

Diagnosis Flow

Customer reports account suspended / 401 on all keys
  │
  ├── Can the customer log into the provider dashboard?
  │     │
  │     ├── Login WORKS but API fails
  │     │     → NOT an account suspension.
  │     │       This is a key-level issue.
  │     │       → Treat as Incident 1 or investigate API key directly.
  │     │
  │     └── Login FAILS
  │           → Account-level suspension confirmed. Continue below.
  │
  ├── Did the customer receive a suspension email?
  │     ├── YES — policy violation notice
  │     │     → Route to Trust & Safety.
  │     │       Do NOT reinstate at support level.
  │     │       Do NOT share what triggered the automated system.
  │     │
  │     ├── YES — payment / fraud notice
  │     │     → Outstanding invoice? Route to Finance.
  │     │       Fraud flag?          Route to Trust & Safety.
  │     │
  │     └── NO email received
  │           → Check internally if account is flagged.
  │             Could also be a key issue rather than true suspension.
  │
  └── Customer wants to appeal?
        → Direct to provider's official support/appeal process.
          Do NOT bypass or pre-approve reinstatement at support level.

What You Can and Cannot Do

	Support Engineer CAN	Support Engineer CANNOT
Policy suspension	Confirm suspension, route to T&S, explain appeal process	Reinstate the account, share what triggered the suspension
Fraud flag	Confirm status, collect info, route to correct team	Lift the fraud flag, process reinstatement
Outstanding invoice	Confirm invoice exists, direct to payment, route to Finance	Waive the amount, manually reinstate

🔴 Do not reinstate suspended accounts at the support level. All reinstatements for policy or fraud-related suspensions must go through Trust & Safety. Bypassing this process creates liability.

What to Tell the Customer

"I can see your account has been suspended. I've escalated this to the appropriate team for review. You can also submit a formal appeal through the provider's support portal — include your account ID and a description of your use case. The team will review and respond. I'm not able to share details of what triggered the review, but the appeals team will have full context."

Master Decision Tree

Start here for every billing or account ticket. The error code is the most reliable entry point.

Billing or account ticket received
  │
  ├── STEP 1: Check the provider's status page
  │     Active incident? → Inform customer, monitor, close when resolved.
  │     No incident?     → Continue.
  │
  ├── STEP 2: What error is the customer seeing?
  │     │
  │     ├── 402 Payment Required
  │     │     ├── Balance $0 or card failed?    → Incident 1 (Payment Failure)
  │     │     └── Hard spending limit reached?  → Incident 3 (Spending Limit)
  │     │
  │     ├── 401 Unauthorized
  │     │     ├── Account suspended?            → Incident 6 (Account Suspension)
  │     │     └── Key issue (no suspension)?    → API key troubleshooting
  │     │
  │     ├── 403 Forbidden
  │     │     └── Free tier / model access?     → Incident 4 (Free Tier Expiry)
  │     │
  │     ├── No specific error / vague report
  │     │     ├── "Bill too high"               → Incident 2 (Unexpected High Bill)
  │     │     ├── "Want a refund"               → Incident 5 (Refund Request)
  │     │     └── "Account locked"              → Incident 6 (Account Suspension)
  │     │
  │     └── 429 Too Many Requests
  │           → NOT a billing issue.
  │             See the Rate Limits Runbook.
  │
  └── STEP 3: After resolution
        → Send post-resolution recommendation (see each incident section above)
        → Log case notes: incident type, root cause, fix applied

✅ Support Engineer Troubleshooting Checklist {#checklist}

Work through this top to bottom for every billing or account ticket.

🔍 Step 1 — Initial Triage

[ ] Check the provider's status page for active incidents — stop here if one exists
[ ] Get the exact HTTP status code from the customer's logs (402, 401, 403, 429)
[ ] Get the exact error message from the response body (e.g. "insufficient_quota", "invalid_api_key")
[ ] Confirm the Account / Org ID (found in provider dashboard → Settings → Organization)
[ ] Get timestamp of last successful request and first failed request

💳 Step 2 — Billing Dashboard Check

[ ] Check payment method status — any red banners or declined payments?
[ ] Check credit balance — is it $0? Is auto-recharge enabled?
[ ] Check spending limits — has the hard limit been reached this month?
[ ] Check account tier — Free / Paid Tier 1 / Higher? Does it match what the customer expects?
[ ] Check for outstanding invoices (enterprise / invoice-billed accounts)

📊 Step 3 — Usage Investigation (for high-bill tickets)

[ ] Open Usage dashboard for the billing period in question
[ ] Look for a single-day spike — note the date
[ ] Filter by model — did usage shift to a more expensive model mid-month?
[ ] Check tokens per request — high count = context bloat
[ ] Confirm usage dashboard total matches invoice total — discrepancy? Escalate with both figures

🔒 Step 4 — Account Status Check (for 401 / suspension tickets)

[ ] Can the customer log into the provider dashboard? Login works but API fails = key issue, not suspension
[ ] Did the customer receive a suspension email? Policy violation? Fraud flag? Outstanding balance?
[ ] Verify the API key is organisation-level, not a personal key from a departed team member
[ ] For suspension: route to Trust & Safety — do NOT reinstate at support level

📋 Step 5 — Resolution & Close-out

[ ] Confirm API is working again before closing the ticket
[ ] Send the appropriate post-resolution recommendation based on root cause
[ ] Add case notes: incident type, root cause, fix applied, recommendation given
[ ] If escalated: confirm escalation was received with a follow-up timeline set for the customer

⚠️ Always — Safety & Escalation Rules

[ ] Never ask for a full API key — if the customer sends one, tell them to rotate it immediately
[ ] Never promise a refund — only the billing team can approve credits
[ ] Never reinstate a suspended account at the support level — all reinstatements go through Trust & Safety

A general troubleshooting reference for support engineers working with AI API providers. Patterns apply across providers — OpenAI, Anthropic, Google, Cohere, and others follow similar billing models.

API Rate Limits & Throttling: What's Actually Happening and How to Fix It

Sindhu Murthy — Tue, 17 Feb 2026 19:56:07 +0000

**** Rate limiting is the #1 reason AI API calls fail in production. It's not a bug — it's the provider protecting their infrastructure. This guide explains what's happening, how to read the signals, and how to stop it from breaking your app.

The Scenario

Your app has been running fine for weeks. Then on a Monday morning, users start seeing errors. Not everyone — just some. The errors come and go. Sometimes the same question works on the second try.

Your logs are full of this:

HTTP 429 — Too Many Requests

You're being rate limited. And if you handle it wrong, you'll make it worse.

What Is Rate Limiting?

Think of a highway on-ramp with a traffic light. When too many cars try to merge at once, the light turns red and lets them through one at a time. Nobody's banned from the highway — they just have to wait their turn.

AI providers (OpenAI, Anthropic, Google) work the same way. When too many requests come in, they start telling some customers: "Slow down."

That's a rate limit. It's not an error in your code. It's the provider saying: "I can handle your request, just not right now."

Term	What It Means
Rate limit	Maximum number of requests allowed in a time window
Throttling	The provider actively slowing down or rejecting your requests
429 status code	The HTTP response that means "too many requests"
Quota	Your total allocation (per minute, per day, or per month)

The Three Types of Rate Limits

Most people think there's one rate limit. There are actually three, and they trigger independently.

Type	What It Limits	Example Limit	How You Hit It
Requests per minute (RPM)	Number of API calls	60 RPM	Sending too many questions, even short ones
Tokens per minute (TPM)	Total tokens processed	90,000 TPM	Sending fewer requests, but each one is huge (long documents, big prompts)
Tokens per day (TPD)	Daily token budget	1,000,000 TPD	Sustained high usage over hours

Important: You can hit TPM while staying under RPM. A single request with a 50,000-token document eats more than half your minute's budget. You only sent one request — but you're already throttled. Always check your provider's current documentation for exact limits — they change frequently and vary by tier.

How to Read a 429 Error

When you get rate limited, the provider doesn't just say "no." They tell you when to try again. Most people ignore this information.

The Response Headers

HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s

Header	What It Tells You
`retry-after`	Seconds to wait before trying again. Use this number.
`x-ratelimit-limit-requests`	Your RPM cap
`x-ratelimit-remaining-requests`	How many requests you have left this window
`x-ratelimit-reset-requests`	When your request limit resets
`x-ratelimit-limit-tokens`	Your TPM cap
`x-ratelimit-remaining-tokens`	How many tokens you have left this window
`x-ratelimit-reset-tokens`	When your token limit resets

Example: You get a 429. The retry-after header says 2. That means: wait 2 seconds and try again. Not 0 seconds. Not 30 seconds. Exactly 2. The provider is literally telling you the answer.

Status Codes: Which Errors to Retry

Not every error is a rate limit. Here's the simple rule:

Code	Meaning	Retry?	What to Do
429	Too Many Requests	Yes	Wait and retry with backoff
500	Server Error	Once	Try once more, then check the provider's status page
503	Service Unavailable	Yes	Provider is overloaded — wait and retry
400	Bad Request	No	Your request is malformed — fix your code
401	Unauthorized	No	Your API key is invalid or expired — fix it
403	Forbidden	No	Your key doesn't have permission for this model or action

The key rule: Only retry on 429, 500, and 503. Everything else means something is wrong on your end — retrying won't help.

The Retry Problem (And Why Most Teams Make It Worse)

Here's what happens when teams don't handle rate limits properly:

The Retry Storm

Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5x worse

This is called a retry storm. Your retry logic is creating more traffic, which causes more 429s, which causes more retries. It's a death spiral.

Retry Approach	What Happens	Result
No retry	User sees an error	Bad UX, but no damage
Immediate retry	Same request hits the same limit	Retry storm — makes it worse
Fixed delay (wait 1s every time)	All retries fire at the same time	Thundering herd — same problem
Exponential backoff	Wait 1s, 2s, 4s, 8s	Spreads load, gives limits time to reset
Exponential backoff + jitter	Same as above + random 0-1s added	Prevents synchronized retries across users

The Right Way: Exponential Backoff with Jitter

Instead of retrying immediately (which makes things worse), wait a little longer each time:

First retry: wait ~1 second
Second retry: wait ~2 seconds
Third retry: wait ~4 seconds
Keep doubling up to a max of 5 retries
Still failing? Stop and show the user a helpful error

Add a small random delay ("jitter") to each wait so that multiple users don't all retry at the exact same moment.

That's it. Double the wait each time, add a pinch of randomness, and give up after 5 tries.

Preventing Rate Limits Before They Happen

Three strategies, in order of impact:

1. Request Queuing

Without a queue, every user hits the API directly. With a queue, your app controls the flow.

WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API     →  100 simultaneous calls  →  429s
  User C ──→ API
  ...
  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API  →  No 429s
  ...      │
  User Z ──┘

Users A and B get instant responses. User Z waits a few seconds. Nobody gets an error. The queue absorbs the traffic spike and releases it at a rate the API can handle.

2. Caching

If 200 users ask "How do I reset my password?" in one day — why call the API 200 times?

Strategy	How It Works	Best For
Exact match	Same question → cached answer	FAQs, common queries
Semantic cache	Similar questions → cached answer	Support bots, knowledge bases
TTL-based	Cache expires after X minutes	Data that changes periodically

Example: 200 identical questions per day. Without cache: 200 API calls. With cache: 1 API call + 199 cache hits. Rate limit usage drops by 99.5%.

3. Smaller Prompts

TPM limits are about total tokens. A 10,000-token request eats 100x more budget than a 100-token request.

Optimization	Token Savings
Send only relevant chunks, not full documents	30-60%
Shorter system prompts	10-20%
Summarize long docs with a cheap model first	50-70%

Monitoring: What to Watch

Don't wait for users to report 429s. Watch these numbers:

Metric	Warning	Critical	Action
RPM usage %	70% of limit	90% of limit	Enable queuing or caching
TPM usage %	70% of limit	90% of limit	Optimize prompt sizes
429 count/hour	Any	10+ per hour	Check for retry storms
Retry rate	5% of requests	15% of requests	Backoff isn't aggressive enough
P95 response time	5 seconds	15 seconds	Rate limit delays hitting UX
Daily token spend	70% of TPD	90% of TPD	Will run out of daily quota

Enterprise: The Noisy Neighbor Problem

One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate limited. Now every customer is affected.

Problem	Solution
One customer blocks everyone	Per-tenant rate limiting — your app enforces limits per customer before hitting the API
Real-time chat delayed by batch jobs	Priority queues — chat requests go before batch jobs
Shared key runs out of quota	Separate API keys — different keys for different customers or use cases
Unpredictable usage spikes	Batch vs. real-time separation — batch jobs use a different key with lower priority

Troubleshooting Checklist

When 429s start showing up, work through this in order:

Step	What to Check
1	Check `x-ratelimit-remaining-requests` and `x-ratelimit-remaining-tokens` — which limit did you hit?
2	Is it RPM or TPM? Too many requests or too many tokens per request?
3	Check for retry storms — is your retry count multiplying the problem?
4	Check `retry-after` header — are you waiting the recommended time?
5	Check if one user or tenant is consuming disproportionate quota
6	Check prompt sizes — did someone add a huge system prompt or send large documents?
7	Check for duplicate requests — is the frontend sending the same request multiple times?
8	Check your tier — did you recently exceed a billing threshold that changes your limits?
9	Check provider status page — is the provider having capacity issues?
10	Check time of day — peak hours (US business hours) have tighter effective limits

Common Patterns Quick Reference

Symptom	Likely Cause	Fix
429s for everyone at once	Shared rate limit exhausted	Per-tenant limits or request queue
429s for one customer only	That customer is sending too much	Per-customer throttling
429s only during peak hours	Hitting RPM at high traffic times	Queue + cache
429s after deploying new feature	New feature sends more or larger requests	Audit token usage
429s that get worse over time	Retry storm	Exponential backoff + jitter
429s on token limit but low RPM	Sending very large prompts	Reduce context and prompt size
Intermittent 429s, no pattern	Hovering near the limit	Add 20% buffer below your limit
429s after a billing change	Tier downgrade reduced limits	Check provider dashboard for current tier

The Bottom Line

Rate limits aren't bugs. They're a feature of every AI API. The difference between a junior and senior engineer:

Junior: "The API is broken, it keeps returning errors."
Senior: "We're hitting our TPM limit during peak hours. I'm adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70% utilization."

Know your limits. Monitor your usage. Retry smart, not fast. And when in doubt, check the headers — the answer is usually right there.

API Rate Limits & Throttling: What's Actually Happening and How to Fix It

Sindhu Murthy — Tue, 17 Feb 2026 06:10:40 +0000

TL;DR: Rate limiting is the #1 reason AI API calls fail in production. It's not a bug — it's the provider protecting their infrastructure. This guide explains what's happening, how to read the signals, and how to stop it from breaking your app.

The Scenario

Your app has been running fine for weeks. Then on a Monday morning, users start seeing errors. Not everyone — just some. The errors come and go. Sometimes the same question works on the second try.

Your logs are full of this:

HTTP 429 — Too Many Requests

You're being rate limited. And if you handle it wrong, you'll make it worse.

What Is Rate Limiting?

Imagine a restaurant with 10 tables. You can't seat 50 people at once — you ask some to wait.

AI providers (OpenAI, Anthropic, Google) do the same thing. Their servers have capacity limits. When too many requests come in, they start telling some customers: "Slow down."

That's a rate limit. It's not an error in your code. It's the provider saying: "I can handle your request, just not right now."

Term	What It Means
Rate limit	Maximum number of requests allowed in a time window
Throttling	The provider actively slowing down or rejecting your requests
429 status code	The HTTP response that means "too many requests"
Quota	Your total allocation (per minute, per day, or per month)

The Three Types of Rate Limits

Most people think there's one rate limit. There are actually three, and they trigger independently.

Type	What It Limits	Example Limit	How You Hit It
Requests per minute (RPM)	Number of API calls	60 RPM	Sending too many questions, even short ones
Tokens per minute (TPM)	Total tokens processed	90,000 TPM	Sending fewer requests, but each one is huge (long documents, big prompts)
Tokens per day (TPD)	Daily token budget	1,000,000 TPD	Sustained high usage over hours

Important: You can hit TPM while staying under RPM. A single request with a 50,000-token document eats more than half your minute's budget. You only sent one request — but you're already throttled. Always check your provider's current documentation for exact limits — they change frequently and vary by tier.

How to Read a 429 Error

When you get rate limited, the provider doesn't just say "no." They tell you when to try again. Most people ignore this information.

The Response Headers

HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s

Header	What It Tells You
`retry-after`	Seconds to wait before trying again. Use this number.
`x-ratelimit-limit-requests`	Your RPM cap
`x-ratelimit-remaining-requests`	How many requests you have left this window
`x-ratelimit-reset-requests`	When your request limit resets
`x-ratelimit-limit-tokens`	Your TPM cap
`x-ratelimit-remaining-tokens`	How many tokens you have left this window
`x-ratelimit-reset-tokens`	When your token limit resets

Example: You get a 429. The retry-after header says 2. That means: wait 2 seconds and try again. Not 0 seconds. Not 30 seconds. Exactly 2. The provider is literally telling you the answer.

Status Codes: What Each One Means and When to Retry

Not every error is a rate limit. Different status codes mean different things — and some should never be retried.

Retryable Errors

Code	Meaning	Retry?	Real-World Example
429	Too Many Requests	Yes, with backoff	Your app sends 80 requests in a minute. Your limit is 60 RPM. Requests 61-80 all come back as 429. Wait for the window to reset.
503	Service Unavailable	Yes, with backoff	It's 2 PM EST on a Tuesday. OpenAI's GPT-4o is overloaded because every company in the US is using it. Your request gets a 503. Try again in a few seconds — or switch to a less busy model.
500	Internal Server Error	Maybe once	You send a perfectly valid request. The provider's server crashes mid-response. You get a 500 back. Try once more — if it fails again, it's their problem, not yours. Check the status page.

Non-Retryable Errors

Code	Meaning	Retry?	Real-World Example
400	Bad Request	No — fix your code	You set `temperature` to `2.5` but the max allowed is `2.0`. Or you send `max_tokens: -1`. Or your JSON body is malformed. The API can't understand what you're asking for. Retrying the same bad request will get the same error every time.
401	Unauthorized	No — fix your key	Your API key is `sk-abc123...` but it expired last week. Or someone rotated the key and didn't update the environment variable. Or you're sending the key in the wrong header. No amount of retrying will make an invalid key valid.
403	Forbidden	No — fix permissions	Your API key is valid, but it only has access to GPT-4o-mini. You're trying to call GPT-4o. Or your organization has a policy that blocks certain models. The key works — it just doesn't have permission for what you're asking.

The key rule: Only retry on 429 and 503. A 400 means your request is broken. A 401 means your key is wrong. A 403 means you don't have permission. Waiting and retrying won't fix any of those.

The Retry Problem (And Why Most Teams Make It Worse)

Here's what happens when teams don't handle rate limits properly:

The Retry Storm

Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5x worse

This is called a retry storm. Your retry logic is creating more traffic, which causes more 429s, which causes more retries. It's a death spiral.

Retry Approach	What Happens	Result
No retry	User sees an error	Bad UX, but no damage
Immediate retry	Same request hits the same limit	Retry storm — makes it worse
Fixed delay (wait 1s every time)	All retries fire at the same time	Thundering herd — same problem
Exponential backoff	Wait 1s, 2s, 4s, 8s	Spreads load, gives limits time to reset
Exponential backoff + jitter	Same as above + random 0-1s added	Prevents synchronized retries across users

The Right Way: Exponential Backoff with Jitter

This is the industry standard. Every provider recommends it.

Retry #	Base Wait	With Jitter (random 0-1s)	Total Wait From First Request
1st	1 second	1.0 - 2.0s	~1.5s
2nd	2 seconds	2.0 - 3.0s	~4s
3rd	4 seconds	4.0 - 5.0s	~8.5s
4th	8 seconds	8.0 - 9.0s	~17s
5th	16 seconds	16.0 - 17.0s	~34s
Give up	—	—	Show user a helpful error

The Logic in Plain English

attempt = 1
max_retries = 5

while attempt <= max_retries:
    response = call_api()

    if response.status == 200:
        return response        # Success — done

    if response.status == 429:
        wait = (2 ^ attempt) + random(0, 1)    # Exponential + jitter
        sleep(wait)
        attempt += 1

    else:
        raise error            # Not a rate limit — don't retry

show_user("Service is busy, please try again in a minute")

Preventing Rate Limits Before They Happen

Three strategies, in order of impact:

1. Request Queuing

Without a queue, every user hits the API directly. With a queue, your app controls the flow.

WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API     →  100 simultaneous calls  →  429s
  User C ──→ API
  ...
  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API  →  No 429s
  ...      │
  User Z ──┘

Users A and B get instant responses. User Z waits a few seconds. Nobody gets an error. The queue absorbs the traffic spike and releases it at a rate the API can handle.

2. Caching

If 200 users ask "How do I reset my password?" in one day — why call the API 200 times?

Strategy	How It Works	Best For
Exact match	Same question → cached answer	FAQs, common queries
Semantic cache	Similar questions → cached answer	Support bots, knowledge bases
TTL-based	Cache expires after X minutes	Data that changes periodically

Example: 200 identical questions per day. Without cache: 200 API calls. With cache: 1 API call + 199 cache hits. Rate limit usage drops by 99.5%.

3. Smaller Prompts

TPM limits are about total tokens. A 10,000-token request eats 100x more budget than a 100-token request.

Optimization	Token Savings
Send only relevant chunks, not full documents	30-60%
Shorter system prompts	10-20%
Summarize long docs with a cheap model first	50-70%

Monitoring: What to Watch

Don't wait for users to report 429s. Watch these numbers:

Metric	Warning	Critical	Action
RPM usage %	70% of limit	90% of limit	Enable queuing or caching
TPM usage %	70% of limit	90% of limit	Optimize prompt sizes
429 count/hour	Any	10+ per hour	Check for retry storms
Retry rate	5% of requests	15% of requests	Backoff isn't aggressive enough
P95 response time	5 seconds	15 seconds	Rate limit delays hitting UX
Daily token spend	70% of TPD	90% of TPD	Will run out of daily quota

Enterprise: The Noisy Neighbor Problem

One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate limited. Now every customer is affected.

Problem	Solution
One customer blocks everyone	Per-tenant rate limiting — your app enforces limits per customer before hitting the API
Real-time chat delayed by batch jobs	Priority queues — chat requests go before batch jobs
Shared key runs out of quota	Separate API keys — different keys for different customers or use cases
Unpredictable usage spikes	Batch vs. real-time separation — batch jobs use a different key with lower priority

Troubleshooting Checklist

When 429s start showing up, work through this in order:

Step	What to Check
1	Check `x-ratelimit-remaining-requests` and `x-ratelimit-remaining-tokens` — which limit did you hit?
2	Is it RPM or TPM? Too many requests or too many tokens per request?
3	Check for retry storms — is your retry count multiplying the problem?
4	Check `retry-after` header — are you waiting the recommended time?
5	Check if one user or tenant is consuming disproportionate quota
6	Check prompt sizes — did someone add a huge system prompt or send large documents?
7	Check for duplicate requests — is the frontend sending the same request multiple times?
8	Check your tier — did you recently exceed a billing threshold that changes your limits?
9	Check provider status page — is the provider having capacity issues?
10	Check time of day — peak hours (US business hours) have tighter effective limits

Common Patterns Quick Reference

Symptom	Likely Cause	Fix
429s for everyone at once	Shared rate limit exhausted	Per-tenant limits or request queue
429s for one customer only	That customer is sending too much	Per-customer throttling
429s only during peak hours	Hitting RPM at high traffic times	Queue + cache
429s after deploying new feature	New feature sends more or larger requests	Audit token usage
429s that get worse over time	Retry storm	Exponential backoff + jitter
429s on token limit but low RPM	Sending very large prompts	Reduce context and prompt size
Intermittent 429s, no pattern	Hovering near the limit	Add 20% buffer below your limit
429s after a billing change	Tier downgrade reduced limits	Check provider dashboard for current tier

The Bottom Line

Rate limits aren't bugs. They're a feature of every AI API. The difference between a junior and senior engineer:

Junior: "The API is broken, it keeps returning errors."
Senior: "We're hitting our TPM limit during peak hours. I'm adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70% utilization."

Know your limits. Monitor your usage. Retry smart, not fast. And when in doubt, check the headers — the answer is usually right there.

How to Troubleshoot RAG in Production: A Field Guide

Sindhu Murthy — Mon, 16 Feb 2026 23:28:36 +0000

TL;DR: RAG isn't one system — it's a pipeline with 6 stages. When something breaks, follow the data from start to finish. This guide shows you exactly which log fields to check at each stage and what they mean.

The Scenario

A customer messages you at 2 PM on a Tuesday:

"The AI is giving wrong answers."

That's it. No logs. No screenshots. Just vibes.

You have 25 fields scattered across 6 pipeline stages, and somewhere in there is the answer. This guide tells you where to look.

The Pipeline at a Glance

User Query → Embedding → Retrieval → Context Assembly → LLM Call → Response

The mistake most people make: they jump straight to the LLM. "Must be a model problem." It usually isn't. 70% of RAG failures happen before the LLM is ever called — in retrieval and context assembly.

Stage 1: The Query Comes In

Fields: request_id · user_id · timestamp

Always start with request_id. This is your case number. Every other log field is useless without it because you can't tell which retrieval, which LLM call, which response belongs to this specific complaint.

Then check user_id. One user affected = their data or permissions. Hundreds of users at the same time = infrastructure.

Then check timestamp. Correlate with:

Recent deployments — did someone push a change?
Known outages — is the LLM provider having issues?
Batch jobs — did an embedding re-index just run?

Example: Customer says answers broke "recently." You check timestamps — every bad answer started at 3:47 AM, exactly when a cron job re-indexed the knowledge base with a new embedding model. Mystery solved in 30 seconds.

Stage 2: The Embedding Step

Fields: embedding_model · embedding_latency_ms · embedding_job_failed

The user's question gets converted into a vector (a list of numbers) so it can be compared against your document vectors.

The silent killer: If this step uses a different model than what was used to index the documents, the vectors live in different mathematical spaces. It's like searching a Spanish library with a French dictionary. Nothing errors out — the results are just irrelevant.

Field	What to Look For
`embedding_model`	Does it match the model used during indexing? If not, every search result is garbage.
`embedding_latency_ms`	Normal: 10-50ms. Above 2000ms: embedding service is struggling.
`embedding_job_failed`	If `true`, the query never got embedded. The LLM is answering with zero context — it's guessing.

Example: Search quality drops overnight. No deployments, no config changes. The team upgraded from text-embedding-ada-002 to text-embedding-3-small for new queries, but stored document vectors are still from the old model. Fix: re-index all documents with the new model.

Stage 3: The Retrieval Step

Fields: collection_name · top_k · chunk_size · chunk_overlap · retrieved_docs · result_count · similarity_score

This is where most RAG failures actually happen.

Check `result_count` first:

Count	What It Means
0	Knowledge base is empty, collection doesn't exist, or query is totally unrelated. Check `collection_name` — staging vs. production mix-ups are more common than you'd think.
1-3	Might be fine. Might mean your knowledge base is too small or chunks are too large.
50+	You're flooding the LLM with noise. Lower `top_k`.

Then check `similarity_score`:

Score	Quality
Above 0.7	Strong matches. Retrieval is working.
0.3 - 0.7	Mediocre. Docs are somewhat related but might not answer the question.
Below 0.3	Retrieval is grabbing garbage. The system would give better answers with no context at all.

Then check chunking:

Problem	Symptom
Chunks too large (2000+ tokens)	Similarity score looks decent but the answer is diluted with irrelevant content
Chunks too small (50-100 tokens)	Important context is split across chunks that don't get retrieved together
No overlap (overlap = 0)	Sentences at chunk boundaries get cut in half. Critical info lost.

Example: Customer asks "What's our refund policy?" and gets an answer about shipping timelines. The top retrieved doc is a 3000-token chunk titled "Order Processing" that mentions refunds in one sentence buried in paragraph 8. Fix: reduce chunk size to 500 tokens so the refund policy lives in its own chunk.

Stage 4: Context Assembly

Fields: prompt_tokens · total_tokens · context_truncated · system_prompt

This is where retrieved documents get packed into a prompt and sent to the LLM. The main failure: stuffing more context than the model can handle.

Field	What to Look For
`prompt_tokens`	Approaching the model's context window limit? (GPT-4o: 128K, Claude Sonnet: 200K)
`context_truncated`	If `true`, the LLM is working with incomplete information. It's like summarizing a book using only chapters 1-7 out of 20.
`system_prompt`	Did someone change it? "Answer only from provided context" vs. "Be helpful" = very different behavior. The first says "I don't know." The second hallucinates.

Example: Simple questions are correct, complex ones are wrong. Simple questions use 800 tokens, complex ones use 45,000. context_truncated is true for every complex query. Fix: set a max context budget and prioritize higher-scoring docs.

Stage 5: The LLM Call

Fields: model · temperature · max_tokens · api_version · status_code · retry_count · latency_ms · cache_hit

Check `status_code` first:

Code	Meaning	Action
200	Success. Problem is elsewhere.	Move on.
429	Rate limited.	Check `retry_count` — high count means retry storm making it worse.
500	Provider's problem.	Retry or failover.
503	Model overloaded.	Common during peak hours. Wait or switch models.

Then check configuration:

Field	What to Look For
`model`	Is it the model you expect? Config drift is real — someone changes an env var and production silently downgrades.
`temperature`	For RAG, should be 0.0-0.3. At 1.0, the model is improvising instead of sticking to context.
`latency_ms`	Normal: 1-5 seconds. 15-30 seconds: model is overloaded or generating very long responses.
`cache_hit`	Answers seem outdated? A cache layer might be serving stale responses.

Example: Customer reports "inconsistent" answers — same question, different answers each time. You check temperature: it's set to 0.8. Every request is a roll of the dice. Fix: set to 0.1 for factual RAG.

Stage 6: The Response

Fields: completion_tokens · finish_reason · error_message

Check `finish_reason`:

Value	Meaning	Fix
`stop`	Model finished naturally. This is good.	—
`length`	Hit `max_tokens` limit. Answer cut off mid-sentence.	Increase `max_tokens` or add "Be concise" to system prompt.
`content_filter`	Blocked by safety filters. User sees an error for a legitimate question.	Adjust content filter settings.

Check `completion_tokens`:

Pattern	Likely Issue
Very low (10-20 tokens)	Model defaulting to "I don't know" — retrieval probably returned nothing useful
Very high (4000+ tokens)	Model is rambling — tighten the system prompt

And always check error_message. Sometimes the answer is literally written in the error. Read it before you start investigating.

Example: Users report the AI "cuts off mid-sentence." finish_reason = length on every affected request. max_tokens is set to 256 — not enough for detailed technical answers. Fix: increase to 1024.

The 10-Step Checklist

When a ticket comes in, work through this in order:

Step	What to Check
1	Get the `request_id`
2	Check `timestamp` — correlate with deployments/outages
3	Check `user_id` — one user or many?
4	Check `embedding_job_failed` — did embedding work?
5	Check `result_count` + `similarity_score` — did retrieval return good docs?
6	Check `context_truncated` — did the full context reach the LLM?
7	Check `status_code` — did the LLM call succeed?
8	Check `model` + `temperature` — is the LLM configured correctly?
9	Check `finish_reason` — did the response complete?
10	Check `error_message` — does it just tell you?

Steps 1-3 scope the problem. Steps 4-6 catch 70% of issues. Steps 7-10 catch the rest.

Common Patterns Quick Reference

Symptom	Likely Cause	Check These Fields
Wrong answers for everyone	Embedding model mismatch or bad re-index	`embedding_model`, `similarity_score`
Wrong answers for one user	Missing docs in their collection	`collection_name`, `result_count`, `user_id`
Incomplete answers	Response truncation	`finish_reason`, `max_tokens`, `context_truncated`
Inconsistent answers	Temperature too high or cache issues	`temperature`, `cache_hit`
Slow responses	LLM overload or too much context	`latency_ms`, `prompt_tokens`, `retry_count`
No response at all	API failure or rate limiting	`status_code`, `error_message`, `embedding_job_failed`
Hallucinated answers	No relevant docs retrieved	`result_count`, `similarity_score`, `system_prompt`
Outdated answers	Stale cache or stale index	`cache_hit`, `timestamp`, `embedding_job_failed`

The Bottom Line

Follow the pipeline. Query → Embedding → Retrieval → Context → LLM → Response. Six stages, 25 fields, one direction.

Start at the beginning. Follow the data. The logs will tell you where it broke.

Which AI Model Should You Actually Use? A Simple Guide for 2026

Sindhu Murthy — Mon, 16 Feb 2026 21:24:35 +0000

Which AI Model Should You Actually Use? A Simple Guide for 2026

Everyone's building with AI now, but nobody tells you which model to pick. There are dozens of options and the wrong choice either wastes money or gives bad results.

Here's the simple version: match the model to the job.

Part 1: Everyday Projects (Solo Developers, Startups, Side Projects)

You're building something yourself or with a small team. Budget matters. Speed matters.

Scenario	What You're Building	Best Model	Why This One	Cost/Month
Chatbot for your website	Answers customer FAQs from your docs	GPT-4o-mini (OpenAI)	Cheap, fast, handles Q&A perfectly	$1-5
Code assistant	Reviews pull requests, writes boilerplate	Claude Sonnet 4.5 (Anthropic)	Great at code, follows instructions precisely	$5-20
Meeting summaries	Transcripts → action items	GPT-4o-mini (OpenAI)	Summarization is simple. Fractions of a cent per summary.	$1-3
Image generation	Marketing visuals, product mockups	DALL-E 3 or Midjourney	DALL-E for API integration. Midjourney for artistic control.	$10-30
Voice transcription	Audio recordings → text	Whisper (OpenAI, local)	Runs on your machine, no API costs, surprisingly accurate	$0

The rule for everyday projects: Start with the cheapest model. Only upgrade if the quality isn't good enough. You'll be surprised how often the cheap option works fine.

Part 2: Enterprise Customers (Production Systems, Thousands of Users)

You're building for a company. Reliability matters. Compliance matters. The wrong answer costs real money.

Scenario	What They Need	Best Model	Why This One	Key Consideration
Internal knowledge search	Employees search docs, get AI answers	GPT-4o-mini + text-embedding-3-small	Mini is cost-effective at scale	Set relevance thresholds — wrong answer is worse than no answer
Legal contract review	AI reads contracts, flags risks	Claude Opus or GPT-4o	Legal requires precision and nuance	Must have human review loop
Support automation	AI handles tier-1 tickets	GPT-4o with fine-tuning	Matches company tone, follows escalation rules	Route to human if confidence is low
Fraud detection	Flag suspicious transactions	Custom ML model (not LLM)	Classification problem, not a language problem	Traditional ML is faster, cheaper, more accurate here
Multi-language portal	Support in 20+ languages	GPT-4o	Best multilingual performance	Test thoroughly in each target language

The rule for enterprise: Reliability beats cost. A $0.01 answer that's wrong costs more than a $0.05 answer that's right — because wrong answers become support tickets, lost customers, and legal risk.

Why Smart Enterprises Don't Use One Model — They Use Several

Most companies start by picking one model for everything. That's a mistake. The companies that control AI costs best use different models for different tasks in the same product.

Task in the Pipeline	Model Used	Why Not One Model for All
Classify incoming ticket	GPT-4o-mini ($0.15/1M tokens)	Classification is simple — cheap model gets it right 95% of the time
Search knowledge base	text-embedding-3-small ($0.02/1M tokens)	One-time cost per document. Cheapest good embeddings.
Generate customer response	GPT-4o ($2.50/1M tokens)	Customer sees this. Quality matters here.
Summarize for internal log	GPT-4o-mini ($0.15/1M tokens)	Internal only. Doesn't need to be perfect.
Flag compliance risk	Claude Opus ($15/1M tokens)	Legal requires the most careful model.

One customer support ticket, five different models. Each matched to the task complexity.

The Cost Difference Is Massive

Take a company handling 10,000 support tickets per month:

Approach	How It Works	Monthly Cost
Single model (GPT-4o for everything)	Every step uses the same premium model	~$800-1,200
Multi-model (right model per task)	Cheap models for simple steps, premium only where it matters	~$150-250

Same quality where the customer sees it. 70-80% cheaper overall.

How It Works in Practice

GPT-4o-mini classifies the ticket → cost: $0.0001
Embedding model searches docs → cost: $0.00005
GPT-4o writes the response → cost: $0.008
GPT-4o-mini summarizes for internal log → cost: $0.0002

Total per ticket: ~$0.009
vs. GPT-4o for all steps: ~$0.04
At 10,000 tickets/month: $90 vs $400

The TAM's Role Here

As a TAM, this is one of the highest-value conversations you can have with a customer:

"I noticed you're using GPT-4o for ticket classification. That's a simple task — switching to mini for just that step would cut your classification costs by 95% with no quality drop. Want me to help you set that up?"

That's not support. That's strategic partnership. That's what gets TAMs promoted.

Quick Decision Flowchart

If Your Task Is...	Use This Model
Text/language + accuracy is critical (legal, medical, finance)	GPT-4o or Claude Opus
Text/language + accuracy isn't life-or-death	GPT-4o-mini or Claude Sonnet
Code generation or review	Claude Sonnet 4.5 or GPT-4o
Math, logic, or reasoning	o3 or o3-mini
Image generation	DALL-E 3 or Midjourney
Audio/speech transcription	Whisper (free, runs locally)
Structured data (numbers, transactions, logs)	Traditional ML — XGBoost, scikit-learn (not an LLM)

The Biggest Mistake I See

People use GPT-4o for everything. It's like using a Ferrari to get groceries. It works, but you're burning money for no reason.

Match the model to the task. Simple task → cheap model. Critical task → premium model. Not a language task → don't use an LLM at all.

The Models at a Glance

Model	Provider	Strength	Price	Best For
GPT-4o-mini	OpenAI	Fast, cheap, good enough	$	Chatbots, summaries, simple Q&A
GPT-4o	OpenAI	Smart, reliable, multilingual	$$	Production apps needing quality
Claude Sonnet 4.5	Anthropic	Great at code, follows instructions	$$	Code generation, technical writing
Claude Opus	Anthropic	Most capable, careful reasoning	$$$	Legal, compliance, complex analysis
o3-mini	OpenAI	Step-by-step reasoning	$$	Math, logic, structured problems
Whisper	OpenAI	Speech-to-text	Free	Transcription
DALL-E 3	OpenAI	Image generation	$$	Marketing, design, prototyping
XGBoost / scikit-learn	Open source	Structured data prediction	Free	Fraud, forecasting, classification

DEV Community: Sindhu Murthy

Billing & Account Issues: A Support Engineer's Runbook

⚡ Quick Reference

Contents

Incident 1 — Payment Failure / Credit Exhaustion {#incident-1}

What the customer says

What actually happened

Diagnosis Flow

How to fix it

What to tell the customer

Incident 2 — Unexpected High Bill {#incident-2}

What the customer says

What actually happened

Planned vs Unplanned Model Changes — Know the Difference

Smart Multi-Model Routing — Recommended Approach

The Pricing Gap That Catches People Off Guard

How Context Bloat Compounds Cost

Diagnosis Flow

Post-Resolution Recommendations

Incident 3 — Spending Limit Reached Without Warning {#incident-3}

What the customer says

What actually happened

Soft Limit vs Hard Limit — The Critical Difference

How Limits Should Be Configured

How to Fix It

What to Tell the Customer

Incident 4 — Free Tier / Trial Credit Expiry {#incident-4}

What the customer says

What actually happened

Free Tier vs Paid Tier — Why It Feels Broken

What to Tell the Customer

Incident 5 — Refund Request for Accidental Usage {#incident-5}

What the customer says

The Refund Policy Reality — Set Expectations Early

How to Handle the Ticket

Information to Collect Before Escalating

What to Tell the Customer

Incident 6 — Account Suspension {#incident-6}

What the customer says

Why Accounts Get Suspended

Diagnosis Flow

What You Can and Cannot Do

What to Tell the Customer

Master Decision Tree

✅ Support Engineer Troubleshooting Checklist {#checklist}

🔍 Step 1 — Initial Triage

💳 Step 2 — Billing Dashboard Check

📊 Step 3 — Usage Investigation (for high-bill tickets)

🔒 Step 4 — Account Status Check (for 401 / suspension tickets)

📋 Step 5 — Resolution & Close-out

⚠️ Always — Safety & Escalation Rules

API Rate Limits & Throttling: What's Actually Happening and How to Fix It

The Scenario

What Is Rate Limiting?

The Three Types of Rate Limits

How to Read a 429 Error

The Response Headers

Status Codes: Which Errors to Retry

The Retry Problem (And Why Most Teams Make It Worse)

The Retry Storm

The Right Way: Exponential Backoff with Jitter

Preventing Rate Limits Before They Happen

1. Request Queuing

2. Caching

3. Smaller Prompts

Monitoring: What to Watch

Enterprise: The Noisy Neighbor Problem

Troubleshooting Checklist

Common Patterns Quick Reference

The Bottom Line

API Rate Limits & Throttling: What's Actually Happening and How to Fix It

The Scenario

What Is Rate Limiting?

The Three Types of Rate Limits

How to Read a 429 Error

The Response Headers

Status Codes: What Each One Means and When to Retry

Retryable Errors

Non-Retryable Errors

The Retry Problem (And Why Most Teams Make It Worse)

Check `result_count` first:

Then check `similarity_score`:

Check `status_code` first:

Check `finish_reason`:

Check `completion_tokens`: