<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sindhu Murthy</title>
    <description>The latest articles on DEV Community by Sindhu Murthy (@sindhu_murthy_628835a359d).</description>
    <link>https://dev.to/sindhu_murthy_628835a359d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3776486%2Fe42aa5dc-cb50-4005-9be8-65f14f5cc258.png</url>
      <title>DEV Community: Sindhu Murthy</title>
      <link>https://dev.to/sindhu_murthy_628835a359d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sindhu_murthy_628835a359d"/>
    <language>en</language>
    <item>
      <title>Billing &amp; Account Issues: A Support Engineer's Runbook</title>
      <dc:creator>Sindhu Murthy</dc:creator>
      <pubDate>Wed, 18 Feb 2026 18:29:57 +0000</pubDate>
      <link>https://dev.to/sindhu_murthy_628835a359d/billing-account-issues-a-support-engineers-runbook-118a</link>
      <guid>https://dev.to/sindhu_murthy_628835a359d/billing-account-issues-a-support-engineers-runbook-118a</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; This runbook is a practical reference for support engineers and anyone preparing for a support engineering role with AI API providers. It covers the 6 most common billing incident types — how to diagnose them, how to fix them, and what to communicate to customers. Patterns here apply across providers including OpenAI, Anthropic, Google, Cohere, and others.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  ⚡ Quick Reference
&lt;/h2&gt;

&lt;p&gt;Match the customer's symptom to the incident type, then jump to that section.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Customer Says&lt;/th&gt;
&lt;th&gt;Jump To&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;🚫 "My API calls suddenly stopped working"&lt;/td&gt;
&lt;td&gt;Incident 1 — Payment Failure / Credit Exhaustion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;😱 "My bill is way higher than I expected"&lt;/td&gt;
&lt;td&gt;Incident 2 — Unexpected High Bill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;📍 "I hit my limit and the API stopped"&lt;/td&gt;
&lt;td&gt;Incident 3 — Spending Limit Reached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;⏳ "My free credits ran out"&lt;/td&gt;
&lt;td&gt;Incident 4 — Free Tier / Trial Expiry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;💸 "I want a refund for accidental charges"&lt;/td&gt;
&lt;td&gt;Incident 5 — Refund Request&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;🔒 "My account has been suspended / locked"&lt;/td&gt;
&lt;td&gt;Incident 6 — Account Suspension&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🔵 &lt;strong&gt;Before anything else:&lt;/strong&gt; Always check the provider's status page first (e.g. status.openai.com, status.anthropic.com). If there is an active incident, that is your answer — inform the customer and monitor. Do not proceed further until you have ruled out a provider-side outage.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Incident 1 — Payment Failure / Credit Exhaustion&lt;/li&gt;
&lt;li&gt;Incident 2 — Unexpected High Bill&lt;/li&gt;
&lt;li&gt;Incident 3 — Spending Limit Reached Without Warning&lt;/li&gt;
&lt;li&gt;Incident 4 — Free Tier / Trial Credit Expiry&lt;/li&gt;
&lt;li&gt;Incident 5 — Refund Request for Accidental Usage&lt;/li&gt;
&lt;li&gt;Incident 6 — Account Suspension&lt;/li&gt;
&lt;li&gt;Master Decision Tree&lt;/li&gt;
&lt;li&gt;Support Engineer Troubleshooting Checklist&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Incident 1 — Payment Failure / Credit Exhaustion {#incident-1}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Error: 402 Payment Required — API access stops immediately&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"My API calls were working fine and then suddenly stopped."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I'm getting 402 errors on every request."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Nothing in my code changed."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What actually happened
&lt;/h3&gt;

&lt;p&gt;AI API providers stop access &lt;strong&gt;immediately and without a grace period&lt;/strong&gt; when a payment fails or a prepaid credit balance hits $0. Unlike a SaaS subscription that might give you days to fix a payment issue, the API cuts off the moment the billing system flags a failure.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;How to Confirm It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Card expired or declined&lt;/td&gt;
&lt;td&gt;Provider dashboard → Billing → red banner or failed payment status&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prepaid credit balance at $0&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → credit balance shows $0.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-recharge enabled but card declined&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → payment history → failed recharge entry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invoice overdue (enterprise accounts)&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → Invoices → unpaid invoice&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Diagnosis Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;402 Payment Required
  │
  ├── Provider Dashboard → Billing section
  │     │
  │     ├── Red banner or "Payment failed" message?
  │     │     └── Card expired / declined
  │     │           FIX: Ask customer to update card.
  │     │               Dashboard → Billing → Payment methods → Update
  │     │               API resumes within ~5 minutes of payment clearing.
  │     │
  │     ├── Credit balance shows $0?
  │     │     ├── Auto-recharge OFF
  │     │     │     FIX: Add credits manually + enable auto-recharge.
  │     │     │
  │     │     └── Auto-recharge ON but balance still $0
  │     │           → The recharge itself failed (card issue).
  │     │             FIX: Same as expired/declined card above.
  │     │
  │     └── Invoice overdue? (enterprise customers)
  │           FIX: Route to finance team for payment processing.
  │
  └── Dashboard looks fine — balance &amp;gt; $0, card valid?
        → Rare sync delay. Wait 10 minutes.
          Still failing? Escalate with account ID + timestamps.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How to fix it
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;th&gt;Where&lt;/th&gt;
&lt;th&gt;Time to Resolution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Update payment card&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → Payment methods&lt;/td&gt;
&lt;td&gt;~5 min after payment clears&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Add prepaid credits&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → Add credits&lt;/td&gt;
&lt;td&gt;~2–5 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enable auto-recharge&lt;/td&gt;
&lt;td&gt;Dashboard → Billing → Auto-recharge settings&lt;/td&gt;
&lt;td&gt;Prevents future incidents&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What to tell the customer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your payment method needs to be updated. Go to your provider dashboard under Billing → Payment methods, update your card, and API access should resume within a few minutes. I'd also recommend enabling auto-recharge so your balance never hits zero unexpectedly."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;💚 &lt;strong&gt;Post-resolution:&lt;/strong&gt; Always recommend enabling auto-recharge with a top-up threshold set to at least 2× the customer's average daily spend. This single setting prevents the majority of these tickets.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Incident 2 — Unexpected High Bill {#incident-2}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Customer's invoice is significantly higher than expected&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"My bill last month was $40. This month it's $800. Nothing changed."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I think I'm being charged incorrectly."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"We only have 200 users — how is this possible?"&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What actually happened
&lt;/h3&gt;

&lt;p&gt;Something in the customer's usage changed — even if they don't know what. In practice, 95% of high-bill tickets trace back to one of five root causes. Your job is to identify which one using the Usage dashboard.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;What It Looks Like in Usage Dashboard&lt;/th&gt;
&lt;th&gt;How Common&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Runaway loop&lt;/strong&gt; — bug calling API thousands of times&lt;/td&gt;
&lt;td&gt;One day with a massive spike, thousands of requests in minutes&lt;/td&gt;
&lt;td&gt;Very common&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Model swap&lt;/strong&gt; — switched to a more expensive model&lt;/td&gt;
&lt;td&gt;Usage shifts to a pricier model mid-month&lt;/td&gt;
&lt;td&gt;Very common&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Context bloat&lt;/strong&gt; — sending full documents instead of chunks&lt;/td&gt;
&lt;td&gt;High token count per request, not high request count&lt;/td&gt;
&lt;td&gt;Common&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Retry storm&lt;/strong&gt; — failed requests retrying without backoff&lt;/td&gt;
&lt;td&gt;Clusters of identical requests at the same timestamps&lt;/td&gt;
&lt;td&gt;Common&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Dev key in production&lt;/strong&gt; — test environment hitting real API&lt;/td&gt;
&lt;td&gt;Usage spikes during business hours or CI/CD run times&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Planned vs Unplanned Model Changes — Know the Difference
&lt;/h3&gt;

&lt;p&gt;Using multiple models intentionally for different tasks is one of the &lt;strong&gt;best cost strategies in AI engineering&lt;/strong&gt; — not a problem. The issue is when a model change happens accidentally: a developer swaps a model name in one place without checking the pricing impact, and the bill spikes before anyone notices.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;✅ Intentional Multi-Model Routing&lt;/th&gt;
&lt;th&gt;❌ Accidental Model Swap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deliberately using cheap models for simple tasks, expensive ones for complex tasks&lt;/td&gt;
&lt;td&gt;Someone changes a model name in code without checking pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Planned?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes — documented in architecture&lt;/td&gt;
&lt;td&gt;No — discovered on the invoice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Is it a problem?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No — this is best practice&lt;/td&gt;
&lt;td&gt;Yes — surprise bill with no warning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Classification → economy model; complex reasoning → premium model&lt;/td&gt;
&lt;td&gt;gpt-4o-mini quietly changed to gpt-4o in a config file&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Smart Multi-Model Routing — Recommended Approach
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task Type&lt;/th&gt;
&lt;th&gt;Recommended Model Tier&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Classification, routing, tagging, simple Q&amp;amp;A&lt;/td&gt;
&lt;td&gt;Economy (e.g. gpt-4o-mini, claude-haiku, gemini-flash)&lt;/td&gt;
&lt;td&gt;Doesn't need deep reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Customer-facing chat, summarisation&lt;/td&gt;
&lt;td&gt;Standard (e.g. gpt-4o, claude-sonnet, gemini-pro)&lt;/td&gt;
&lt;td&gt;Good quality-to-cost balance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex analysis, code, legal/financial reasoning&lt;/td&gt;
&lt;td&gt;Premium (e.g. o1, claude-opus, gemini-ultra)&lt;/td&gt;
&lt;td&gt;Worth the cost when accuracy matters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Pricing Gap That Catches People Off Guard
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model Tier&lt;/th&gt;
&lt;th&gt;Examples&lt;/th&gt;
&lt;th&gt;Approx. Cost per 1M input tokens&lt;/th&gt;
&lt;th&gt;Relative Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Economy / Lightweight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gpt-4o-mini, claude-haiku, gemini-flash&lt;/td&gt;
&lt;td&gt;~$0.10–0.20&lt;/td&gt;
&lt;td&gt;🟢 Cheapest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gpt-4o, claude-sonnet, gemini-pro&lt;/td&gt;
&lt;td&gt;~$2.50–3.00&lt;/td&gt;
&lt;td&gt;🟠 ~15–20× more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Premium / Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;o1, claude-opus, gemini-ultra&lt;/td&gt;
&lt;td&gt;~$15.00+&lt;/td&gt;
&lt;td&gt;🔴 ~100× more&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Always direct customers to their provider's current pricing page&lt;/strong&gt; — these numbers change as models evolve. Use the table above for illustration only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  How Context Bloat Compounds Cost
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Same number of requests — very different cost:

  Request with 1K tokens:
  └── Cost on a standard model: ~$0.0025

  Request with 10K tokens (full document sent):
  └── Cost on a standard model: ~$0.025  ← 10× more expensive

  500 such requests/day × 30 days:
  ├── 1K tokens:  ~$37.50/month
  └── 10K tokens: ~$375.00/month  ← same traffic, 10× the bill

  FIX: Send only relevant chunks. Use retrieval (RAG).
       Summarize long docs with a cheap model before
       passing to an expensive one.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Diagnosis Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer reports high bill
  │
  ├── Dashboard → Usage → set date range to billing period
  │     │
  │     ├── Single-day spike visible?
  │     │     → Likely runaway loop or retry storm.
  │     │       Are requests clustered by timestamp?
  │     │       Clustered      → retry storm (no exponential backoff)
  │     │       Spread but massive volume → runaway loop (code bug)
  │     │
  │     ├── Usage shifted to a more expensive model mid-month?
  │     │     → Model swap.
  │     │       Ask: "Did anyone on your team change the model name recently?"
  │     │
  │     ├── High token count per request?
  │     │     → Context bloat.
  │     │       Ask: "Are you sending full documents or just relevant sections?"
  │     │
  │     └── Usage spread evenly but higher overall?
  │           → Traffic grew OR dev key hitting production API.
  │             Ask: "Do you use the same API key in dev and production?"
  │
  └── Usage dashboard total matches the invoice?
        YES → Usage is legitimate. Explain pricing, suggest optimizations.
        NO  → Escalate with account ID, date range, and the discrepancy figures.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Post-Resolution Recommendations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Root Cause Found&lt;/th&gt;
&lt;th&gt;Recommend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Runaway loop&lt;/td&gt;
&lt;td&gt;Set a monthly hard spend limit. Add request-level logging.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model swap&lt;/td&gt;
&lt;td&gt;Lock model names to constants or environment variables. Review pricing on every model change.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context bloat&lt;/td&gt;
&lt;td&gt;Use retrieval-augmented generation (RAG). Send relevant chunks only.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry storm&lt;/td&gt;
&lt;td&gt;Implement exponential backoff with jitter. Cap total retries per request.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dev key in production&lt;/td&gt;
&lt;td&gt;Separate API keys per environment. Set lower spend limits on dev keys.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Incident 3 — Spending Limit Reached Without Warning {#incident-3}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;API stops mid-month — customer didn't realise a hard limit was set&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"The API just stopped working. I have money in my account."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I'm getting errors even though my balance is positive."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"It was fine yesterday — nothing changed."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What actually happened
&lt;/h3&gt;

&lt;p&gt;Most AI providers allow users to set a &lt;strong&gt;monthly spending cap (hard limit)&lt;/strong&gt;. When this cap is reached, all API calls fail — even with a valid payment method and positive credit balance. This is a customer-configured safety feature, not a bug. The confusion usually happens because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The limit was set a long time ago and forgotten&lt;/li&gt;
&lt;li&gt;Usage grew beyond the original projection&lt;/li&gt;
&lt;li&gt;A spike consumed the monthly budget faster than expected&lt;/li&gt;
&lt;li&gt;The customer confused the &lt;strong&gt;soft limit&lt;/strong&gt; (notification only) with the &lt;strong&gt;hard limit&lt;/strong&gt; (cutoff)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Soft Limit vs Hard Limit — The Critical Difference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Soft Limit&lt;/th&gt;
&lt;th&gt;Hard Limit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it does&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sends an email/alert notification when reached&lt;/td&gt;
&lt;td&gt;Stops all API calls immediately when reached&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Does it cut off the API?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No — API keeps working&lt;/td&gt;
&lt;td&gt;Yes — API stops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Error seen when hit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No error — just a notification&lt;/td&gt;
&lt;td&gt;429 or billing-related error&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best use&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Early warning at 70–80% of budget&lt;/td&gt;
&lt;td&gt;Circuit breaker at 100% of budget&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How Limits Should Be Configured
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly budget: $500
  │
  ├── Soft limit: $375  (75%)
  │     → Notification sent: "You've used 75% of your budget"
  │     → API still works
  │     → Time to review: Is this expected? Should the limit be raised?
  │
  └── Hard limit: $500  (100%)
        → All API calls stop
        → Protects against runaway costs above the budget

  ┌──────────────────────────────────────────────────────┐
  │  $0          $375 (soft)          $500 (hard)         │
  │  ├──────────────┼───────────────────┤                 │
  │  │  SAFE ZONE   │   WARNING ZONE    │   API OFFLINE   │
  └──────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How to Fix It
&lt;/h3&gt;

&lt;p&gt;Go to the provider's &lt;strong&gt;Billing → Spending limits&lt;/strong&gt; settings and either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raise the hard limit to a higher value (takes effect immediately)&lt;/li&gt;
&lt;li&gt;Wait for the monthly reset (usually the 1st of the calendar month)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Before raising the limit:&lt;/strong&gt; Check the Usage dashboard to confirm whether the spend was expected. If it's from a bug or spike, raising the limit without fixing the root cause just defers the problem.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What to Tell the Customer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your account has a monthly spending cap set, and you've reached it — that's why the API stopped. This is a safety feature you configured, not a bug. You can raise it in your billing settings. Before you do, I'd recommend checking your usage dashboard to confirm the spending was expected."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Incident 4 — Free Tier / Trial Credit Expiry {#incident-4}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Credits ran out or expired — customer didn't expect it&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"I just created my account and the API is already not working."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I thought I had free credits — why am I getting errors?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"It worked last week. Now I'm getting 402 errors."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What actually happened
&lt;/h3&gt;

&lt;p&gt;New accounts on most AI providers receive a free credit grant. These credits have two ways to disappear: fully consumed, or expired (credits carry a time limit). When they're gone, behaviour changes in ways that aren't always obvious.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Error Seen&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Free credits fully consumed&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;402&lt;/code&gt; on all requests&lt;/td&gt;
&lt;td&gt;Add a payment method in Billing settings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Free credits expired (time limit hit)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;402&lt;/code&gt; even if credits appeared available&lt;/td&gt;
&lt;td&gt;Credits are gone — add a payment method&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On free tier with very low rate limits&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;429&lt;/code&gt; even at low request volume&lt;/td&gt;
&lt;td&gt;Add payment method to move to paid tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Upgraded to paid but limits feel unchanged&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;429&lt;/code&gt; at low volume&lt;/td&gt;
&lt;td&gt;Tier upgrades can take time to propagate — check current tier in dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Free Tier vs Paid Tier — Why It Feels Broken
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Who&lt;/th&gt;
&lt;th&gt;Rate Limits&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New accounts, no payment method&lt;/td&gt;
&lt;td&gt;Very restrictive (e.g. 3 RPM on premium models)&lt;/td&gt;
&lt;td&gt;Fine for testing; not suitable for real applications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Paid Tier 1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Payment method added + minimum spend reached&lt;/td&gt;
&lt;td&gt;Significantly higher&lt;/td&gt;
&lt;td&gt;Most developers land here first&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Paid Tier 2+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Based on cumulative spend history&lt;/td&gt;
&lt;td&gt;Progressively higher&lt;/td&gt;
&lt;td&gt;Limits increase automatically as spend grows&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Common confusion:&lt;/strong&gt; A customer adds a payment method but still hits very low rate limits. Most providers require a payment method AND a minimum spend AND a minimum account age — all three conditions must be met before a tier upgrade is applied.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What to Tell the Customer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Your free credits have been used up or have expired. To continue, add a payment method in your billing settings. Once you meet the provider's tier criteria — typically a minimum spend and account age — you'll automatically move to a higher rate limit tier."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Incident 5 — Refund Request for Accidental Usage {#incident-5}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Customer was charged for usage they say was unintentional&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"I had a bug that made thousands of API calls — can I get a refund?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"My account was compromised and someone used my API key."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I forgot to turn off my dev environment."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Refund Policy Reality — Set Expectations Early
&lt;/h3&gt;

&lt;p&gt;Most AI providers have a &lt;strong&gt;no-refund policy for API usage&lt;/strong&gt; because the compute was actually consumed. There is no automatic refund process. That said, some situations may qualify for a &lt;strong&gt;goodwill credit&lt;/strong&gt;. Being honest with customers before they escalate saves everyone time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Situation&lt;/th&gt;
&lt;th&gt;Realistic Outcome&lt;/th&gt;
&lt;th&gt;What Helps the Customer's Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bug caused a clear runaway loop&lt;/td&gt;
&lt;td&gt;🟠 Possible goodwill credit&lt;/td&gt;
&lt;td&gt;Application logs, timestamps, request IDs, evidence it was unintentional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Account compromised / key stolen&lt;/td&gt;
&lt;td&gt;🟢 Usually resolved in customer's favour&lt;/td&gt;
&lt;td&gt;Report immediately. Show usage inconsistent with normal activity (IPs, models, times).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Provider outage caused excessive retries&lt;/td&gt;
&lt;td&gt;🟢 Usually credited&lt;/td&gt;
&lt;td&gt;Reference the outage from the provider's status page with matching timestamps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Didn't realise a model was expensive&lt;/td&gt;
&lt;td&gt;🔴 Very unlikely&lt;/td&gt;
&lt;td&gt;Pricing is publicly listed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Forgot to cancel" / dev env left running&lt;/td&gt;
&lt;td&gt;🔴 Unlikely&lt;/td&gt;
&lt;td&gt;This is what spend limits are for&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How to Handle the Ticket
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer submits refund request
  │
  ├── Evidence of account compromise?
  │     → YES: Flag as security incident.
  │             Ask customer to rotate API key immediately.
  │             Collect: unusual IPs, models used, timestamps.
  │             Escalate to security / trust &amp;amp; safety team.
  │
  ├── Matching provider outage at that time?
  │     → YES: Cross-reference with provider's status page.
  │             If confirmed, credit is likely appropriate. Escalate to billing team.
  │
  ├── Clear code bug with log evidence?
  │     → Collect: timestamps, request IDs, total requests vs. normal baseline.
  │       Escalate to billing team with evidence.
  │       Do NOT promise a refund — only the billing team can approve.
  │
  └── No clear evidence / "I just forgot"?
        → Empathise but set expectations honestly.
          Recommend: hard spend limit + auto-recharge threshold.
          Offer to help configure it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Information to Collect Before Escalating
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Info Needed&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Account / Org ID&lt;/td&gt;
&lt;td&gt;Identifies the account for the billing team&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Date range of charges in question&lt;/td&gt;
&lt;td&gt;Narrows the investigation window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request IDs if available&lt;/td&gt;
&lt;td&gt;Allows billing team to trace exact usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Description of what went wrong (customer's words)&lt;/td&gt;
&lt;td&gt;Establishes intent and context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supporting logs or screenshots&lt;/td&gt;
&lt;td&gt;Evidence for goodwill consideration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🔴 &lt;strong&gt;Never promise a refund.&lt;/strong&gt; Only the billing team can approve credits. Promising what you can't deliver creates a worse outcome than being upfront from the start.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What to Tell the Customer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I understand this is frustrating. The general policy is that API usage is non-refundable since the compute was consumed, but I'll escalate this to our billing team with the details you've shared. They'll review it and follow up. In the meantime, I'd recommend setting a monthly spend limit so this can't happen again — I can walk you through that now if you'd like."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Incident 6 — Account Suspension {#incident-6}
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Account locked due to policy violation or fraud flag&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the customer says
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"My account was suddenly disabled. I didn't do anything wrong."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I'm getting 401 errors on a key that worked yesterday."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"I got an email saying my account violated usage policies but I don't understand why."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Accounts Get Suspended
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Suspension Type&lt;/th&gt;
&lt;th&gt;Common Triggers&lt;/th&gt;
&lt;th&gt;Who Handles It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automated — Policy violation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Usage patterns matching prohibited use cases, abuse detection&lt;/td&gt;
&lt;td&gt;Trust &amp;amp; Safety team&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Automated — Fraud flag&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Suspicious payment method, unusual signup signals, sanctioned region&lt;/td&gt;
&lt;td&gt;Trust &amp;amp; Safety / Finance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual — Policy violation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Reported abuse, investigation-triggered review&lt;/td&gt;
&lt;td&gt;Trust &amp;amp; Safety team&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Manual — Outstanding balance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Invoice not paid after repeated reminders&lt;/td&gt;
&lt;td&gt;Finance / Billing team&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Diagnosis Flow
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer reports account suspended / 401 on all keys
  │
  ├── Can the customer log into the provider dashboard?
  │     │
  │     ├── Login WORKS but API fails
  │     │     → NOT an account suspension.
  │     │       This is a key-level issue.
  │     │       → Treat as Incident 1 or investigate API key directly.
  │     │
  │     └── Login FAILS
  │           → Account-level suspension confirmed. Continue below.
  │
  ├── Did the customer receive a suspension email?
  │     ├── YES — policy violation notice
  │     │     → Route to Trust &amp;amp; Safety.
  │     │       Do NOT reinstate at support level.
  │     │       Do NOT share what triggered the automated system.
  │     │
  │     ├── YES — payment / fraud notice
  │     │     → Outstanding invoice? Route to Finance.
  │     │       Fraud flag?          Route to Trust &amp;amp; Safety.
  │     │
  │     └── NO email received
  │           → Check internally if account is flagged.
  │             Could also be a key issue rather than true suspension.
  │
  └── Customer wants to appeal?
        → Direct to provider's official support/appeal process.
          Do NOT bypass or pre-approve reinstatement at support level.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What You Can and Cannot Do
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Support Engineer CAN&lt;/th&gt;
&lt;th&gt;Support Engineer CANNOT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Policy suspension&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Confirm suspension, route to T&amp;amp;S, explain appeal process&lt;/td&gt;
&lt;td&gt;Reinstate the account, share what triggered the suspension&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fraud flag&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Confirm status, collect info, route to correct team&lt;/td&gt;
&lt;td&gt;Lift the fraud flag, process reinstatement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Outstanding invoice&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Confirm invoice exists, direct to payment, route to Finance&lt;/td&gt;
&lt;td&gt;Waive the amount, manually reinstate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;🔴 &lt;strong&gt;Do not reinstate suspended accounts at the support level.&lt;/strong&gt; All reinstatements for policy or fraud-related suspensions must go through Trust &amp;amp; Safety. Bypassing this process creates liability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What to Tell the Customer
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"I can see your account has been suspended. I've escalated this to the appropriate team for review. You can also submit a formal appeal through the provider's support portal — include your account ID and a description of your use case. The team will review and respond. I'm not able to share details of what triggered the review, but the appeals team will have full context."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Master Decision Tree
&lt;/h2&gt;

&lt;p&gt;Start here for every billing or account ticket. The error code is the most reliable entry point.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Billing or account ticket received
  │
  ├── STEP 1: Check the provider's status page
  │     Active incident? → Inform customer, monitor, close when resolved.
  │     No incident?     → Continue.
  │
  ├── STEP 2: What error is the customer seeing?
  │     │
  │     ├── 402 Payment Required
  │     │     ├── Balance $0 or card failed?    → Incident 1 (Payment Failure)
  │     │     └── Hard spending limit reached?  → Incident 3 (Spending Limit)
  │     │
  │     ├── 401 Unauthorized
  │     │     ├── Account suspended?            → Incident 6 (Account Suspension)
  │     │     └── Key issue (no suspension)?    → API key troubleshooting
  │     │
  │     ├── 403 Forbidden
  │     │     └── Free tier / model access?     → Incident 4 (Free Tier Expiry)
  │     │
  │     ├── No specific error / vague report
  │     │     ├── "Bill too high"               → Incident 2 (Unexpected High Bill)
  │     │     ├── "Want a refund"               → Incident 5 (Refund Request)
  │     │     └── "Account locked"              → Incident 6 (Account Suspension)
  │     │
  │     └── 429 Too Many Requests
  │           → NOT a billing issue.
  │             See the Rate Limits Runbook.
  │
  └── STEP 3: After resolution
        → Send post-resolution recommendation (see each incident section above)
        → Log case notes: incident type, root cause, fix applied
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ✅ Support Engineer Troubleshooting Checklist {#checklist}
&lt;/h2&gt;

&lt;p&gt;Work through this top to bottom for every billing or account ticket.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔍 Step 1 — Initial Triage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Check the &lt;strong&gt;provider's status page&lt;/strong&gt; for active incidents — stop here if one exists&lt;/li&gt;
&lt;li&gt;[ ] Get the &lt;strong&gt;exact HTTP status code&lt;/strong&gt; from the customer's logs (402, 401, 403, 429)&lt;/li&gt;
&lt;li&gt;[ ] Get the &lt;strong&gt;exact error message&lt;/strong&gt; from the response body (e.g. "insufficient_quota", "invalid_api_key")&lt;/li&gt;
&lt;li&gt;[ ] Confirm the &lt;strong&gt;Account / Org ID&lt;/strong&gt; (found in provider dashboard → Settings → Organization)&lt;/li&gt;
&lt;li&gt;[ ] Get &lt;strong&gt;timestamp of last successful request&lt;/strong&gt; and &lt;strong&gt;first failed request&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  💳 Step 2 — Billing Dashboard Check
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Check &lt;strong&gt;payment method status&lt;/strong&gt; — any red banners or declined payments?&lt;/li&gt;
&lt;li&gt;[ ] Check &lt;strong&gt;credit balance&lt;/strong&gt; — is it $0? Is auto-recharge enabled?&lt;/li&gt;
&lt;li&gt;[ ] Check &lt;strong&gt;spending limits&lt;/strong&gt; — has the hard limit been reached this month?&lt;/li&gt;
&lt;li&gt;[ ] Check &lt;strong&gt;account tier&lt;/strong&gt; — Free / Paid Tier 1 / Higher? Does it match what the customer expects?&lt;/li&gt;
&lt;li&gt;[ ] Check for &lt;strong&gt;outstanding invoices&lt;/strong&gt; (enterprise / invoice-billed accounts)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📊 Step 3 — Usage Investigation &lt;em&gt;(for high-bill tickets)&lt;/em&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Open &lt;strong&gt;Usage dashboard&lt;/strong&gt; for the billing period in question&lt;/li&gt;
&lt;li&gt;[ ] Look for a &lt;strong&gt;single-day spike&lt;/strong&gt; — note the date&lt;/li&gt;
&lt;li&gt;[ ] Filter by &lt;strong&gt;model&lt;/strong&gt; — did usage shift to a more expensive model mid-month?&lt;/li&gt;
&lt;li&gt;[ ] Check &lt;strong&gt;tokens per request&lt;/strong&gt; — high count = context bloat&lt;/li&gt;
&lt;li&gt;[ ] Confirm &lt;strong&gt;usage dashboard total matches invoice total&lt;/strong&gt; — discrepancy? Escalate with both figures&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🔒 Step 4 — Account Status Check &lt;em&gt;(for 401 / suspension tickets)&lt;/em&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Can the customer &lt;strong&gt;log into the provider dashboard&lt;/strong&gt;? Login works but API fails = key issue, not suspension&lt;/li&gt;
&lt;li&gt;[ ] Did the customer receive a &lt;strong&gt;suspension email&lt;/strong&gt;? Policy violation? Fraud flag? Outstanding balance?&lt;/li&gt;
&lt;li&gt;[ ] Verify the &lt;strong&gt;API key is organisation-level&lt;/strong&gt;, not a personal key from a departed team member&lt;/li&gt;
&lt;li&gt;[ ] For suspension: &lt;strong&gt;route to Trust &amp;amp; Safety&lt;/strong&gt; — do NOT reinstate at support level&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  📋 Step 5 — Resolution &amp;amp; Close-out
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Confirm &lt;strong&gt;API is working again&lt;/strong&gt; before closing the ticket&lt;/li&gt;
&lt;li&gt;[ ] Send the appropriate &lt;strong&gt;post-resolution recommendation&lt;/strong&gt; based on root cause&lt;/li&gt;
&lt;li&gt;[ ] Add &lt;strong&gt;case notes&lt;/strong&gt;: incident type, root cause, fix applied, recommendation given&lt;/li&gt;
&lt;li&gt;[ ] If escalated: confirm &lt;strong&gt;escalation was received&lt;/strong&gt; with a follow-up timeline set for the customer&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ Always — Safety &amp;amp; Escalation Rules
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Never ask for a full API key&lt;/strong&gt; — if the customer sends one, tell them to rotate it immediately&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Never promise a refund&lt;/strong&gt; — only the billing team can approve credits&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Never reinstate a suspended account&lt;/strong&gt; at the support level — all reinstatements go through Trust &amp;amp; Safety&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;A general troubleshooting reference for support engineers working with AI API providers. Patterns apply across providers — OpenAI, Anthropic, Google, Cohere, and others follow similar billing models.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>billing</category>
      <category>support</category>
    </item>
    <item>
      <title>API Rate Limits &amp; Throttling: What's Actually Happening and How to Fix It</title>
      <dc:creator>Sindhu Murthy</dc:creator>
      <pubDate>Tue, 17 Feb 2026 19:56:07 +0000</pubDate>
      <link>https://dev.to/sindhu_murthy_628835a359d/api-rate-limits-throttling-whats-actually-happening-and-how-to-fix-it-4gk5</link>
      <guid>https://dev.to/sindhu_murthy_628835a359d/api-rate-limits-throttling-whats-actually-happening-and-how-to-fix-it-4gk5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;**** Rate limiting is the #1 reason AI API calls fail in production. It's not a bug — it's the provider protecting their infrastructure. This guide explains what's happening, how to read the signals, and how to stop it from breaking your app.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;Your app has been running fine for weeks. Then on a Monday morning, users start seeing errors. Not everyone — just some. The errors come and go. Sometimes the same question works on the second try.&lt;/p&gt;

&lt;p&gt;Your logs are full of this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 429 — Too Many Requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You're being rate limited. And if you handle it wrong, you'll make it worse.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Rate Limiting?
&lt;/h2&gt;

&lt;p&gt;Think of a highway on-ramp with a traffic light. When too many cars try to merge at once, the light turns red and lets them through one at a time. Nobody's banned from the highway — they just have to wait their turn.&lt;/p&gt;

&lt;p&gt;AI providers (OpenAI, Anthropic, Google) work the same way. When too many requests come in, they start telling some customers: &lt;strong&gt;"Slow down."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's a rate limit. It's not an error in your code. It's the provider saying: "I can handle your request, just not right now."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum number of requests allowed in a time window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throttling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The provider actively slowing down or rejecting your requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429 status code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The HTTP response that means "too many requests"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quota&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your total allocation (per minute, per day, or per month)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Three Types of Rate Limits
&lt;/h2&gt;

&lt;p&gt;Most people think there's one rate limit. There are actually three, and they trigger independently.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What It Limits&lt;/th&gt;
&lt;th&gt;Example Limit&lt;/th&gt;
&lt;th&gt;How You Hit It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requests per minute (RPM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Number of API calls&lt;/td&gt;
&lt;td&gt;60 RPM&lt;/td&gt;
&lt;td&gt;Sending too many questions, even short ones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens per minute (TPM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total tokens processed&lt;/td&gt;
&lt;td&gt;90,000 TPM&lt;/td&gt;
&lt;td&gt;Sending fewer requests, but each one is huge (long documents, big prompts)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens per day (TPD)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Daily token budget&lt;/td&gt;
&lt;td&gt;1,000,000 TPD&lt;/td&gt;
&lt;td&gt;Sustained high usage over hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; You can hit TPM while staying under RPM. A single request with a 50,000-token document eats more than half your minute's budget. You only sent one request — but you're already throttled. Always check your provider's current documentation for exact limits — they change frequently and vary by tier.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How to Read a 429 Error
&lt;/h2&gt;

&lt;p&gt;When you get rate limited, the provider doesn't just say "no." They tell you &lt;strong&gt;when to try again&lt;/strong&gt;. Most people ignore this information.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Response Headers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Header&lt;/th&gt;
&lt;th&gt;What It Tells You&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry-after&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seconds to wait before trying again. &lt;strong&gt;Use this number.&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-limit-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your RPM cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-remaining-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many requests you have left this window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-reset-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When your request limit resets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-limit-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your TPM cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-remaining-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many tokens you have left this window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-reset-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When your token limit resets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; You get a 429. The &lt;code&gt;retry-after&lt;/code&gt; header says &lt;code&gt;2&lt;/code&gt;. That means: wait 2 seconds and try again. Not 0 seconds. Not 30 seconds. Exactly 2. The provider is literally telling you the answer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Status Codes: Which Errors to Retry
&lt;/h2&gt;

&lt;p&gt;Not every error is a rate limit. Here's the simple rule:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Retry?&lt;/th&gt;
&lt;th&gt;What to Do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Too Many Requests&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wait and retry with backoff&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Server Error&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Once&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Try once more, then check the provider's status page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;503&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service Unavailable&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Provider is overloaded — wait and retry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;400&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad Request&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your request is malformed — fix your code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;401&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unauthorized&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your API key is invalid or expired — fix it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;403&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forbidden&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your key doesn't have permission for this model or action&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The key rule:&lt;/strong&gt; Only retry on 429, 500, and 503. Everything else means something is wrong on your end — retrying won't help.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Retry Problem (And Why Most Teams Make It Worse)
&lt;/h2&gt;

&lt;p&gt;Here's what happens when teams don't handle rate limits properly:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Retry Storm
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5x worse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is called a &lt;strong&gt;retry storm&lt;/strong&gt;. Your retry logic is creating more traffic, which causes more 429s, which causes more retries. It's a death spiral.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retry Approach&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No retry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User sees an error&lt;/td&gt;
&lt;td&gt;Bad UX, but no damage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Immediate retry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same request hits the same limit&lt;/td&gt;
&lt;td&gt;Retry storm — makes it worse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Fixed delay&lt;/strong&gt; (wait 1s every time)&lt;/td&gt;
&lt;td&gt;All retries fire at the same time&lt;/td&gt;
&lt;td&gt;Thundering herd — same problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exponential backoff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wait 1s, 2s, 4s, 8s&lt;/td&gt;
&lt;td&gt;Spreads load, gives limits time to reset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exponential backoff + jitter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same as above + random 0-1s added&lt;/td&gt;
&lt;td&gt;Prevents synchronized retries across users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Right Way: Exponential Backoff with Jitter
&lt;/h2&gt;

&lt;p&gt;Instead of retrying immediately (which makes things worse), wait a little longer each time:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;First retry:&lt;/strong&gt; wait ~1 second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Second retry:&lt;/strong&gt; wait ~2 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Third retry:&lt;/strong&gt; wait ~4 seconds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep doubling&lt;/strong&gt; up to a max of 5 retries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Still failing?&lt;/strong&gt; Stop and show the user a helpful error&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Add a small random delay ("jitter") to each wait so that multiple users don't all retry at the exact same moment.&lt;/p&gt;

&lt;p&gt;That's it. Double the wait each time, add a pinch of randomness, and give up after 5 tries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Preventing Rate Limits Before They Happen
&lt;/h2&gt;

&lt;p&gt;Three strategies, in order of impact:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Request Queuing
&lt;/h3&gt;

&lt;p&gt;Without a queue, every user hits the API directly. With a queue, your app controls the flow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API     →  100 simultaneous calls  →  429s
  User C ──→ API
  ...
  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API  →  No 429s
  ...      │
  User Z ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users A and B get instant responses. User Z waits a few seconds. Nobody gets an error. The queue absorbs the traffic spike and releases it at a rate the API can handle.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Caching
&lt;/h3&gt;

&lt;p&gt;If 200 users ask "How do I reset my password?" in one day — why call the API 200 times?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exact match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same question → cached answer&lt;/td&gt;
&lt;td&gt;FAQs, common queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Similar questions → cached answer&lt;/td&gt;
&lt;td&gt;Support bots, knowledge bases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTL-based&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cache expires after X minutes&lt;/td&gt;
&lt;td&gt;Data that changes periodically&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; 200 identical questions per day. Without cache: 200 API calls. With cache: 1 API call + 199 cache hits. Rate limit usage drops by 99.5%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Smaller Prompts
&lt;/h3&gt;

&lt;p&gt;TPM limits are about total tokens. A 10,000-token request eats 100x more budget than a 100-token request.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Token Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Send only relevant chunks, not full documents&lt;/td&gt;
&lt;td&gt;30-60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shorter system prompts&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarize long docs with a cheap model first&lt;/td&gt;
&lt;td&gt;50-70%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Monitoring: What to Watch
&lt;/h2&gt;

&lt;p&gt;Don't wait for users to report 429s. Watch these numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Warning&lt;/th&gt;
&lt;th&gt;Critical&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RPM usage %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of limit&lt;/td&gt;
&lt;td&gt;90% of limit&lt;/td&gt;
&lt;td&gt;Enable queuing or caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TPM usage %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of limit&lt;/td&gt;
&lt;td&gt;90% of limit&lt;/td&gt;
&lt;td&gt;Optimize prompt sizes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429 count/hour&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;10+ per hour&lt;/td&gt;
&lt;td&gt;Check for retry storms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5% of requests&lt;/td&gt;
&lt;td&gt;15% of requests&lt;/td&gt;
&lt;td&gt;Backoff isn't aggressive enough&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P95 response time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 seconds&lt;/td&gt;
&lt;td&gt;15 seconds&lt;/td&gt;
&lt;td&gt;Rate limit delays hitting UX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daily token spend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of TPD&lt;/td&gt;
&lt;td&gt;90% of TPD&lt;/td&gt;
&lt;td&gt;Will run out of daily quota&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Enterprise: The Noisy Neighbor Problem
&lt;/h2&gt;

&lt;p&gt;One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate limited. Now &lt;strong&gt;every&lt;/strong&gt; customer is affected.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One customer blocks everyone&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Per-tenant rate limiting&lt;/strong&gt; — your app enforces limits per customer before hitting the API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time chat delayed by batch jobs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Priority queues&lt;/strong&gt; — chat requests go before batch jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared key runs out of quota&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Separate API keys&lt;/strong&gt; — different keys for different customers or use cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unpredictable usage spikes&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Batch vs. real-time separation&lt;/strong&gt; — batch jobs use a different key with lower priority&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Troubleshooting Checklist
&lt;/h2&gt;

&lt;p&gt;When 429s start showing up, work through this in order:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What to Check&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;x-ratelimit-remaining-requests&lt;/code&gt; and &lt;code&gt;x-ratelimit-remaining-tokens&lt;/code&gt; — which limit did you hit?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Is it RPM or TPM? Too many requests or too many tokens per request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Check for retry storms — is your retry count multiplying the problem?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;retry-after&lt;/code&gt; header — are you waiting the recommended time?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Check if one user or tenant is consuming disproportionate quota&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Check prompt sizes — did someone add a huge system prompt or send large documents?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Check for duplicate requests — is the frontend sending the same request multiple times?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Check your tier — did you recently exceed a billing threshold that changes your limits?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Check provider status page — is the provider having capacity issues?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Check time of day — peak hours (US business hours) have tighter effective limits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Common Patterns Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;429s for everyone at once&lt;/td&gt;
&lt;td&gt;Shared rate limit exhausted&lt;/td&gt;
&lt;td&gt;Per-tenant limits or request queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s for one customer only&lt;/td&gt;
&lt;td&gt;That customer is sending too much&lt;/td&gt;
&lt;td&gt;Per-customer throttling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s only during peak hours&lt;/td&gt;
&lt;td&gt;Hitting RPM at high traffic times&lt;/td&gt;
&lt;td&gt;Queue + cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s after deploying new feature&lt;/td&gt;
&lt;td&gt;New feature sends more or larger requests&lt;/td&gt;
&lt;td&gt;Audit token usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s that get worse over time&lt;/td&gt;
&lt;td&gt;Retry storm&lt;/td&gt;
&lt;td&gt;Exponential backoff + jitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s on token limit but low RPM&lt;/td&gt;
&lt;td&gt;Sending very large prompts&lt;/td&gt;
&lt;td&gt;Reduce context and prompt size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermittent 429s, no pattern&lt;/td&gt;
&lt;td&gt;Hovering near the limit&lt;/td&gt;
&lt;td&gt;Add 20% buffer below your limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s after a billing change&lt;/td&gt;
&lt;td&gt;Tier downgrade reduced limits&lt;/td&gt;
&lt;td&gt;Check provider dashboard for current tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Rate limits aren't bugs. They're a feature of every AI API. The difference between a junior and senior engineer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Junior:&lt;/strong&gt; "The API is broken, it keeps returning errors."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Senior:&lt;/strong&gt; "We're hitting our TPM limit during peak hours. I'm adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70% utilization."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Know your limits. Monitor your usage. Retry smart, not fast. And when in doubt, check the headers — the answer is usually right there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>devops</category>
      <category>beginners</category>
    </item>
    <item>
      <title>API Rate Limits &amp; Throttling: What's Actually Happening and How to Fix It</title>
      <dc:creator>Sindhu Murthy</dc:creator>
      <pubDate>Tue, 17 Feb 2026 06:10:40 +0000</pubDate>
      <link>https://dev.to/sindhu_murthy_628835a359d/api-rate-limits-throttling-whats-actually-happening-and-how-to-fix-it-2lc3</link>
      <guid>https://dev.to/sindhu_murthy_628835a359d/api-rate-limits-throttling-whats-actually-happening-and-how-to-fix-it-2lc3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Rate limiting is the #1 reason AI API calls fail in production. It's not a bug — it's the provider protecting their infrastructure. This guide explains what's happening, how to read the signals, and how to stop it from breaking your app.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;Your app has been running fine for weeks. Then on a Monday morning, users start seeing errors. Not everyone — just some. The errors come and go. Sometimes the same question works on the second try.&lt;/p&gt;

&lt;p&gt;Your logs are full of this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 429 — Too Many Requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You're being rate limited. And if you handle it wrong, you'll make it worse.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Rate Limiting?
&lt;/h2&gt;

&lt;p&gt;Imagine a restaurant with 10 tables. You can't seat 50 people at once — you ask some to wait.&lt;/p&gt;

&lt;p&gt;AI providers (OpenAI, Anthropic, Google) do the same thing. Their servers have capacity limits. When too many requests come in, they start telling some customers: &lt;strong&gt;"Slow down."&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's a rate limit. It's not an error in your code. It's the provider saying: "I can handle your request, just not right now."&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rate limit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum number of requests allowed in a time window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Throttling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The provider actively slowing down or rejecting your requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429 status code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The HTTP response that means "too many requests"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Quota&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Your total allocation (per minute, per day, or per month)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Three Types of Rate Limits
&lt;/h2&gt;

&lt;p&gt;Most people think there's one rate limit. There are actually three, and they trigger independently.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What It Limits&lt;/th&gt;
&lt;th&gt;Example Limit&lt;/th&gt;
&lt;th&gt;How You Hit It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Requests per minute (RPM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Number of API calls&lt;/td&gt;
&lt;td&gt;60 RPM&lt;/td&gt;
&lt;td&gt;Sending too many questions, even short ones&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens per minute (TPM)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Total tokens processed&lt;/td&gt;
&lt;td&gt;90,000 TPM&lt;/td&gt;
&lt;td&gt;Sending fewer requests, but each one is huge (long documents, big prompts)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Tokens per day (TPD)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Daily token budget&lt;/td&gt;
&lt;td&gt;1,000,000 TPD&lt;/td&gt;
&lt;td&gt;Sustained high usage over hours&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; You can hit TPM while staying under RPM. A single request with a 50,000-token document eats more than half your minute's budget. You only sent one request — but you're already throttled. Always check your provider's current documentation for exact limits — they change frequently and vary by tier.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How to Read a 429 Error
&lt;/h2&gt;

&lt;p&gt;When you get rate limited, the provider doesn't just say "no." They tell you &lt;strong&gt;when to try again&lt;/strong&gt;. Most people ignore this information.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Response Headers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP/1.1 429 Too Many Requests
retry-after: 2
x-ratelimit-limit-requests: 60
x-ratelimit-remaining-requests: 0
x-ratelimit-reset-requests: 12s
x-ratelimit-limit-tokens: 90000
x-ratelimit-remaining-tokens: 0
x-ratelimit-reset-tokens: 28s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Header&lt;/th&gt;
&lt;th&gt;What It Tells You&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry-after&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seconds to wait before trying again. &lt;strong&gt;Use this number.&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-limit-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your RPM cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-remaining-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many requests you have left this window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-reset-requests&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When your request limit resets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-limit-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Your TPM cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-remaining-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;How many tokens you have left this window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;x-ratelimit-reset-tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When your token limit resets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; You get a 429. The &lt;code&gt;retry-after&lt;/code&gt; header says &lt;code&gt;2&lt;/code&gt;. That means: wait 2 seconds and try again. Not 0 seconds. Not 30 seconds. Exactly 2. The provider is literally telling you the answer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Status Codes: What Each One Means and When to Retry
&lt;/h2&gt;

&lt;p&gt;Not every error is a rate limit. Different status codes mean different things — and some should never be retried.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retryable Errors
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Retry?&lt;/th&gt;
&lt;th&gt;Real-World Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Too Many Requests&lt;/td&gt;
&lt;td&gt;Yes, with backoff&lt;/td&gt;
&lt;td&gt;Your app sends 80 requests in a minute. Your limit is 60 RPM. Requests 61-80 all come back as 429. Wait for the window to reset.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;503&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service Unavailable&lt;/td&gt;
&lt;td&gt;Yes, with backoff&lt;/td&gt;
&lt;td&gt;It's 2 PM EST on a Tuesday. OpenAI's GPT-4o is overloaded because every company in the US is using it. Your request gets a 503. Try again in a few seconds — or switch to a less busy model.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Internal Server Error&lt;/td&gt;
&lt;td&gt;Maybe once&lt;/td&gt;
&lt;td&gt;You send a perfectly valid request. The provider's server crashes mid-response. You get a 500 back. Try once more — if it fails again, it's their problem, not yours. Check the status page.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Non-Retryable Errors
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Retry?&lt;/th&gt;
&lt;th&gt;Real-World Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;400&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Bad Request&lt;/td&gt;
&lt;td&gt;No — fix your code&lt;/td&gt;
&lt;td&gt;You set &lt;code&gt;temperature&lt;/code&gt; to &lt;code&gt;2.5&lt;/code&gt; but the max allowed is &lt;code&gt;2.0&lt;/code&gt;. Or you send &lt;code&gt;max_tokens: -1&lt;/code&gt;. Or your JSON body is malformed. The API can't understand what you're asking for. Retrying the same bad request will get the same error every time.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;401&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unauthorized&lt;/td&gt;
&lt;td&gt;No — fix your key&lt;/td&gt;
&lt;td&gt;Your API key is &lt;code&gt;sk-abc123...&lt;/code&gt; but it expired last week. Or someone rotated the key and didn't update the environment variable. Or you're sending the key in the wrong header. No amount of retrying will make an invalid key valid.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;403&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Forbidden&lt;/td&gt;
&lt;td&gt;No — fix permissions&lt;/td&gt;
&lt;td&gt;Your API key is valid, but it only has access to GPT-4o-mini. You're trying to call GPT-4o. Or your organization has a policy that blocks certain models. The key works — it just doesn't have permission for what you're asking.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The key rule:&lt;/strong&gt; Only retry on 429 and 503. A 400 means your request is broken. A 401 means your key is wrong. A 403 means you don't have permission. Waiting and retrying won't fix any of those.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Retry Problem (And Why Most Teams Make It Worse)
&lt;/h2&gt;

&lt;p&gt;Here's what happens when teams don't handle rate limits properly:&lt;/p&gt;

&lt;h3&gt;
  
  
  The Retry Storm
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request fails (429)
  → Code immediately retries
    → Also fails (429) — still in the same window
      → Code retries again
        → Also fails
          → 3 users are now each retrying 5 times
            → 15 requests where there were 3
              → Rate limit is now 5x worse
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is called a &lt;strong&gt;retry storm&lt;/strong&gt;. Your retry logic is creating more traffic, which causes more 429s, which causes more retries. It's a death spiral.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retry Approach&lt;/th&gt;
&lt;th&gt;What Happens&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No retry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User sees an error&lt;/td&gt;
&lt;td&gt;Bad UX, but no damage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Immediate retry&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same request hits the same limit&lt;/td&gt;
&lt;td&gt;Retry storm — makes it worse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Fixed delay&lt;/strong&gt; (wait 1s every time)&lt;/td&gt;
&lt;td&gt;All retries fire at the same time&lt;/td&gt;
&lt;td&gt;Thundering herd — same problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exponential backoff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wait 1s, 2s, 4s, 8s&lt;/td&gt;
&lt;td&gt;Spreads load, gives limits time to reset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exponential backoff + jitter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same as above + random 0-1s added&lt;/td&gt;
&lt;td&gt;Prevents synchronized retries across users&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Right Way: Exponential Backoff with Jitter
&lt;/h2&gt;

&lt;p&gt;This is the industry standard. Every provider recommends it.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Retry #&lt;/th&gt;
&lt;th&gt;Base Wait&lt;/th&gt;
&lt;th&gt;With Jitter (random 0-1s)&lt;/th&gt;
&lt;th&gt;Total Wait From First Request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1st&lt;/td&gt;
&lt;td&gt;1 second&lt;/td&gt;
&lt;td&gt;1.0 - 2.0s&lt;/td&gt;
&lt;td&gt;~1.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2nd&lt;/td&gt;
&lt;td&gt;2 seconds&lt;/td&gt;
&lt;td&gt;2.0 - 3.0s&lt;/td&gt;
&lt;td&gt;~4s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3rd&lt;/td&gt;
&lt;td&gt;4 seconds&lt;/td&gt;
&lt;td&gt;4.0 - 5.0s&lt;/td&gt;
&lt;td&gt;~8.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4th&lt;/td&gt;
&lt;td&gt;8 seconds&lt;/td&gt;
&lt;td&gt;8.0 - 9.0s&lt;/td&gt;
&lt;td&gt;~17s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5th&lt;/td&gt;
&lt;td&gt;16 seconds&lt;/td&gt;
&lt;td&gt;16.0 - 17.0s&lt;/td&gt;
&lt;td&gt;~34s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Give up&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Show user a helpful error&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  The Logic in Plain English
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;attempt = 1
max_retries = 5

while attempt &amp;lt;= max_retries:
    response = call_api()

    if response.status == 200:
        return response        # Success — done

    if response.status == 429:
        wait = (2 ^ attempt) + random(0, 1)    # Exponential + jitter
        sleep(wait)
        attempt += 1

    else:
        raise error            # Not a rate limit — don't retry

show_user("Service is busy, please try again in a minute")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Preventing Rate Limits Before They Happen
&lt;/h2&gt;

&lt;p&gt;Three strategies, in order of impact:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Request Queuing
&lt;/h3&gt;

&lt;p&gt;Without a queue, every user hits the API directly. With a queue, your app controls the flow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITHOUT QUEUE:
  User A ──→ API
  User B ──→ API     →  100 simultaneous calls  →  429s
  User C ──→ API
  ...
  User Z ──→ API

WITH QUEUE:
  User A ──┐
  User B ──┤
  User C ──┼──→ Queue ──→ 10 requests/sec ──→ API  →  No 429s
  ...      │
  User Z ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users A and B get instant responses. User Z waits a few seconds. Nobody gets an error. The queue absorbs the traffic spike and releases it at a rate the API can handle.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Caching
&lt;/h3&gt;

&lt;p&gt;If 200 users ask "How do I reset my password?" in one day — why call the API 200 times?&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Exact match&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Same question → cached answer&lt;/td&gt;
&lt;td&gt;FAQs, common queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic cache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Similar questions → cached answer&lt;/td&gt;
&lt;td&gt;Support bots, knowledge bases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TTL-based&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cache expires after X minutes&lt;/td&gt;
&lt;td&gt;Data that changes periodically&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; 200 identical questions per day. Without cache: 200 API calls. With cache: 1 API call + 199 cache hits. Rate limit usage drops by 99.5%.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Smaller Prompts
&lt;/h3&gt;

&lt;p&gt;TPM limits are about total tokens. A 10,000-token request eats 100x more budget than a 100-token request.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Token Savings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Send only relevant chunks, not full documents&lt;/td&gt;
&lt;td&gt;30-60%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shorter system prompts&lt;/td&gt;
&lt;td&gt;10-20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarize long docs with a cheap model first&lt;/td&gt;
&lt;td&gt;50-70%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Monitoring: What to Watch
&lt;/h2&gt;

&lt;p&gt;Don't wait for users to report 429s. Watch these numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Warning&lt;/th&gt;
&lt;th&gt;Critical&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RPM usage %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of limit&lt;/td&gt;
&lt;td&gt;90% of limit&lt;/td&gt;
&lt;td&gt;Enable queuing or caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TPM usage %&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of limit&lt;/td&gt;
&lt;td&gt;90% of limit&lt;/td&gt;
&lt;td&gt;Optimize prompt sizes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429 count/hour&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;10+ per hour&lt;/td&gt;
&lt;td&gt;Check for retry storms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5% of requests&lt;/td&gt;
&lt;td&gt;15% of requests&lt;/td&gt;
&lt;td&gt;Backoff isn't aggressive enough&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;P95 response time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;5 seconds&lt;/td&gt;
&lt;td&gt;15 seconds&lt;/td&gt;
&lt;td&gt;Rate limit delays hitting UX&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daily token spend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70% of TPD&lt;/td&gt;
&lt;td&gt;90% of TPD&lt;/td&gt;
&lt;td&gt;Will run out of daily quota&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Enterprise: The Noisy Neighbor Problem
&lt;/h2&gt;

&lt;p&gt;One enterprise customer runs a batch job — 500 requests in a minute. Your shared API key gets rate limited. Now &lt;strong&gt;every&lt;/strong&gt; customer is affected.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One customer blocks everyone&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Per-tenant rate limiting&lt;/strong&gt; — your app enforces limits per customer before hitting the API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time chat delayed by batch jobs&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Priority queues&lt;/strong&gt; — chat requests go before batch jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shared key runs out of quota&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Separate API keys&lt;/strong&gt; — different keys for different customers or use cases&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unpredictable usage spikes&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Batch vs. real-time separation&lt;/strong&gt; — batch jobs use a different key with lower priority&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Troubleshooting Checklist
&lt;/h2&gt;

&lt;p&gt;When 429s start showing up, work through this in order:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What to Check&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;x-ratelimit-remaining-requests&lt;/code&gt; and &lt;code&gt;x-ratelimit-remaining-tokens&lt;/code&gt; — which limit did you hit?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Is it RPM or TPM? Too many requests or too many tokens per request?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Check for retry storms — is your retry count multiplying the problem?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;retry-after&lt;/code&gt; header — are you waiting the recommended time?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Check if one user or tenant is consuming disproportionate quota&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Check prompt sizes — did someone add a huge system prompt or send large documents?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Check for duplicate requests — is the frontend sending the same request multiple times?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Check your tier — did you recently exceed a billing threshold that changes your limits?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Check provider status page — is the provider having capacity issues?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Check time of day — peak hours (US business hours) have tighter effective limits&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Common Patterns Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;429s for everyone at once&lt;/td&gt;
&lt;td&gt;Shared rate limit exhausted&lt;/td&gt;
&lt;td&gt;Per-tenant limits or request queue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s for one customer only&lt;/td&gt;
&lt;td&gt;That customer is sending too much&lt;/td&gt;
&lt;td&gt;Per-customer throttling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s only during peak hours&lt;/td&gt;
&lt;td&gt;Hitting RPM at high traffic times&lt;/td&gt;
&lt;td&gt;Queue + cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s after deploying new feature&lt;/td&gt;
&lt;td&gt;New feature sends more or larger requests&lt;/td&gt;
&lt;td&gt;Audit token usage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s that get worse over time&lt;/td&gt;
&lt;td&gt;Retry storm&lt;/td&gt;
&lt;td&gt;Exponential backoff + jitter&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s on token limit but low RPM&lt;/td&gt;
&lt;td&gt;Sending very large prompts&lt;/td&gt;
&lt;td&gt;Reduce context and prompt size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermittent 429s, no pattern&lt;/td&gt;
&lt;td&gt;Hovering near the limit&lt;/td&gt;
&lt;td&gt;Add 20% buffer below your limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429s after a billing change&lt;/td&gt;
&lt;td&gt;Tier downgrade reduced limits&lt;/td&gt;
&lt;td&gt;Check provider dashboard for current tier&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Rate limits aren't bugs. They're a feature of every AI API. The difference between a junior and senior engineer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Junior:&lt;/strong&gt; "The API is broken, it keeps returning errors."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Senior:&lt;/strong&gt; "We're hitting our TPM limit during peak hours. I'm adding a request queue with exponential backoff and caching frequent queries. That should keep us under 70% utilization."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Know your limits. Monitor your usage. Retry smart, not fast. And when in doubt, check the headers — the answer is usually right there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>devops</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How to Troubleshoot RAG in Production: A Field Guide</title>
      <dc:creator>Sindhu Murthy</dc:creator>
      <pubDate>Mon, 16 Feb 2026 23:28:36 +0000</pubDate>
      <link>https://dev.to/sindhu_murthy_628835a359d/how-to-troubleshoot-rag-in-production-a-field-guide-6nb</link>
      <guid>https://dev.to/sindhu_murthy_628835a359d/how-to-troubleshoot-rag-in-production-a-field-guide-6nb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; RAG isn't one system — it's a pipeline with 6 stages. When something breaks, follow the data from start to finish. This guide shows you exactly which log fields to check at each stage and what they mean.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;A customer messages you at 2 PM on a Tuesday:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The AI is giving wrong answers."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That's it. No logs. No screenshots. Just vibes.&lt;/p&gt;

&lt;p&gt;You have 25 fields scattered across 6 pipeline stages, and somewhere in there is the answer. This guide tells you where to look.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline at a Glance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Query → Embedding → Retrieval → Context Assembly → LLM Call → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mistake most people make: they jump straight to the LLM. "Must be a model problem." It usually isn't. &lt;strong&gt;70% of RAG failures happen before the LLM is ever called&lt;/strong&gt; — in retrieval and context assembly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: The Query Comes In
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;request_id&lt;/code&gt; · &lt;code&gt;user_id&lt;/code&gt; · &lt;code&gt;timestamp&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Always start with &lt;code&gt;request_id&lt;/code&gt;.&lt;/strong&gt; This is your case number. Every other log field is useless without it because you can't tell which retrieval, which LLM call, which response belongs to this specific complaint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then check &lt;code&gt;user_id&lt;/code&gt;.&lt;/strong&gt; One user affected = their data or permissions. Hundreds of users at the same time = infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then check &lt;code&gt;timestamp&lt;/code&gt;.&lt;/strong&gt; Correlate with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recent deployments — did someone push a change?&lt;/li&gt;
&lt;li&gt;Known outages — is the LLM provider having issues?&lt;/li&gt;
&lt;li&gt;Batch jobs — did an embedding re-index just run?&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Customer says answers broke "recently." You check timestamps — every bad answer started at 3:47 AM, exactly when a cron job re-indexed the knowledge base with a new embedding model. Mystery solved in 30 seconds.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Stage 2: The Embedding Step
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;embedding_model&lt;/code&gt; · &lt;code&gt;embedding_latency_ms&lt;/code&gt; · &lt;code&gt;embedding_job_failed&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The user's question gets converted into a vector (a list of numbers) so it can be compared against your document vectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The silent killer:&lt;/strong&gt; If this step uses a &lt;strong&gt;different model&lt;/strong&gt; than what was used to index the documents, the vectors live in different mathematical spaces. It's like searching a Spanish library with a French dictionary. Nothing errors out — the results are just irrelevant.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;What to Look For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;embedding_model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Does it match the model used during indexing? If not, every search result is garbage.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;embedding_latency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Normal: 10-50ms. Above 2000ms: embedding service is struggling.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;embedding_job_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If &lt;code&gt;true&lt;/code&gt;, the query never got embedded. The LLM is answering with zero context — it's guessing.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Search quality drops overnight. No deployments, no config changes. The team upgraded from &lt;code&gt;text-embedding-ada-002&lt;/code&gt; to &lt;code&gt;text-embedding-3-small&lt;/code&gt; for new queries, but stored document vectors are still from the old model. Fix: re-index all documents with the new model.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Stage 3: The Retrieval Step
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;collection_name&lt;/code&gt; · &lt;code&gt;top_k&lt;/code&gt; · &lt;code&gt;chunk_size&lt;/code&gt; · &lt;code&gt;chunk_overlap&lt;/code&gt; · &lt;code&gt;retrieved_docs&lt;/code&gt; · &lt;code&gt;result_count&lt;/code&gt; · &lt;code&gt;similarity_score&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;This is where most RAG failures actually happen.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Check &lt;code&gt;result_count&lt;/code&gt; first:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;What It Means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Knowledge base is empty, collection doesn't exist, or query is totally unrelated. Check &lt;code&gt;collection_name&lt;/code&gt; — staging vs. production mix-ups are more common than you'd think.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1-3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Might be fine. Might mean your knowledge base is too small or chunks are too large.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;50+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You're flooding the LLM with noise. Lower &lt;code&gt;top_k&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Then check &lt;code&gt;similarity_score&lt;/code&gt;:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Above 0.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Strong matches. Retrieval is working.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0.3 - 0.7&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mediocre. Docs are somewhat related but might not answer the question.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Below 0.3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval is grabbing garbage. The system would give better answers with no context at all.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Then check chunking:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Chunks too large&lt;/strong&gt; (2000+ tokens)&lt;/td&gt;
&lt;td&gt;Similarity score looks decent but the answer is diluted with irrelevant content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Chunks too small&lt;/strong&gt; (50-100 tokens)&lt;/td&gt;
&lt;td&gt;Important context is split across chunks that don't get retrieved together&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;No overlap&lt;/strong&gt; (overlap = 0)&lt;/td&gt;
&lt;td&gt;Sentences at chunk boundaries get cut in half. Critical info lost.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Customer asks "What's our refund policy?" and gets an answer about shipping timelines. The top retrieved doc is a 3000-token chunk titled "Order Processing" that mentions refunds in one sentence buried in paragraph 8. Fix: reduce chunk size to 500 tokens so the refund policy lives in its own chunk.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Stage 4: Context Assembly
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;prompt_tokens&lt;/code&gt; · &lt;code&gt;total_tokens&lt;/code&gt; · &lt;code&gt;context_truncated&lt;/code&gt; · &lt;code&gt;system_prompt&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where retrieved documents get packed into a prompt and sent to the LLM. The main failure: &lt;strong&gt;stuffing more context than the model can handle.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;What to Look For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prompt_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Approaching the model's context window limit? (GPT-4o: 128K, Claude Sonnet: 200K)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;context_truncated&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;If &lt;code&gt;true&lt;/code&gt;, the LLM is working with incomplete information. It's like summarizing a book using only chapters 1-7 out of 20.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;system_prompt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Did someone change it? "Answer only from provided context" vs. "Be helpful" = very different behavior. The first says "I don't know." The second hallucinates.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Simple questions are correct, complex ones are wrong. Simple questions use 800 tokens, complex ones use 45,000. &lt;code&gt;context_truncated&lt;/code&gt; is &lt;code&gt;true&lt;/code&gt; for every complex query. Fix: set a max context budget and prioritize higher-scoring docs.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Stage 5: The LLM Call
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;model&lt;/code&gt; · &lt;code&gt;temperature&lt;/code&gt; · &lt;code&gt;max_tokens&lt;/code&gt; · &lt;code&gt;api_version&lt;/code&gt; · &lt;code&gt;status_code&lt;/code&gt; · &lt;code&gt;retry_count&lt;/code&gt; · &lt;code&gt;latency_ms&lt;/code&gt; · &lt;code&gt;cache_hit&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Check &lt;code&gt;status_code&lt;/code&gt; first:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Code&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;200&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Success. Problem is elsewhere.&lt;/td&gt;
&lt;td&gt;Move on.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;429&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Rate limited.&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;retry_count&lt;/code&gt; — high count means retry storm making it worse.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Provider's problem.&lt;/td&gt;
&lt;td&gt;Retry or failover.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;503&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model overloaded.&lt;/td&gt;
&lt;td&gt;Common during peak hours. Wait or switch models.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Then check configuration:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;What to Look For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Is it the model you expect? Config drift is real — someone changes an env var and production silently downgrades.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;temperature&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;For RAG, should be 0.0-0.3. At 1.0, the model is improvising instead of sticking to context.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;latency_ms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Normal: 1-5 seconds. 15-30 seconds: model is overloaded or generating very long responses.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cache_hit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Answers seem outdated? A cache layer might be serving stale responses.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Customer reports "inconsistent" answers — same question, different answers each time. You check &lt;code&gt;temperature&lt;/code&gt;: it's set to 0.8. Every request is a roll of the dice. Fix: set to 0.1 for factual RAG.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Stage 6: The Response
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Fields:&lt;/strong&gt; &lt;code&gt;completion_tokens&lt;/code&gt; · &lt;code&gt;finish_reason&lt;/code&gt; · &lt;code&gt;error_message&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Check &lt;code&gt;finish_reason&lt;/code&gt;:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;stop&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Model finished naturally. This is good.&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;length&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hit &lt;code&gt;max_tokens&lt;/code&gt; limit. Answer cut off mid-sentence.&lt;/td&gt;
&lt;td&gt;Increase &lt;code&gt;max_tokens&lt;/code&gt; or add "Be concise" to system prompt.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;content_filter&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blocked by safety filters. User sees an error for a legitimate question.&lt;/td&gt;
&lt;td&gt;Adjust content filter settings.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Check &lt;code&gt;completion_tokens&lt;/code&gt;:
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Likely Issue&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Very low (10-20 tokens)&lt;/td&gt;
&lt;td&gt;Model defaulting to "I don't know" — retrieval probably returned nothing useful&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very high (4000+ tokens)&lt;/td&gt;
&lt;td&gt;Model is rambling — tighten the system prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;And always check &lt;code&gt;error_message&lt;/code&gt;.&lt;/strong&gt; Sometimes the answer is literally written in the error. Read it before you start investigating.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt; Users report the AI "cuts off mid-sentence." &lt;code&gt;finish_reason&lt;/code&gt; = &lt;code&gt;length&lt;/code&gt; on every affected request. &lt;code&gt;max_tokens&lt;/code&gt; is set to 256 — not enough for detailed technical answers. Fix: increase to 1024.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The 10-Step Checklist
&lt;/h2&gt;

&lt;p&gt;When a ticket comes in, work through this in order:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What to Check&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Get the &lt;code&gt;request_id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;timestamp&lt;/code&gt; — correlate with deployments/outages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;user_id&lt;/code&gt; — one user or many?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;embedding_job_failed&lt;/code&gt; — did embedding work?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;result_count&lt;/code&gt; + &lt;code&gt;similarity_score&lt;/code&gt; — did retrieval return good docs?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;context_truncated&lt;/code&gt; — did the full context reach the LLM?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;status_code&lt;/code&gt; — did the LLM call succeed?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;model&lt;/code&gt; + &lt;code&gt;temperature&lt;/code&gt; — is the LLM configured correctly?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;finish_reason&lt;/code&gt; — did the response complete?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Check &lt;code&gt;error_message&lt;/code&gt; — does it just tell you?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Steps 1-3&lt;/strong&gt; scope the problem. &lt;strong&gt;Steps 4-6&lt;/strong&gt; catch 70% of issues. &lt;strong&gt;Steps 7-10&lt;/strong&gt; catch the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Patterns Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Check These Fields&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Wrong answers for everyone&lt;/td&gt;
&lt;td&gt;Embedding model mismatch or bad re-index&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;embedding_model&lt;/code&gt;, &lt;code&gt;similarity_score&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wrong answers for one user&lt;/td&gt;
&lt;td&gt;Missing docs in their collection&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;collection_name&lt;/code&gt;, &lt;code&gt;result_count&lt;/code&gt;, &lt;code&gt;user_id&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incomplete answers&lt;/td&gt;
&lt;td&gt;Response truncation&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;finish_reason&lt;/code&gt;, &lt;code&gt;max_tokens&lt;/code&gt;, &lt;code&gt;context_truncated&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inconsistent answers&lt;/td&gt;
&lt;td&gt;Temperature too high or cache issues&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;temperature&lt;/code&gt;, &lt;code&gt;cache_hit&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Slow responses&lt;/td&gt;
&lt;td&gt;LLM overload or too much context&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;latency_ms&lt;/code&gt;, &lt;code&gt;prompt_tokens&lt;/code&gt;, &lt;code&gt;retry_count&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No response at all&lt;/td&gt;
&lt;td&gt;API failure or rate limiting&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;status_code&lt;/code&gt;, &lt;code&gt;error_message&lt;/code&gt;, &lt;code&gt;embedding_job_failed&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hallucinated answers&lt;/td&gt;
&lt;td&gt;No relevant docs retrieved&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;result_count&lt;/code&gt;, &lt;code&gt;similarity_score&lt;/code&gt;, &lt;code&gt;system_prompt&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Outdated answers&lt;/td&gt;
&lt;td&gt;Stale cache or stale index&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;cache_hit&lt;/code&gt;, &lt;code&gt;timestamp&lt;/code&gt;, &lt;code&gt;embedding_job_failed&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Follow the pipeline. Query → Embedding → Retrieval → Context → LLM → Response. Six stages, 25 fields, one direction.&lt;/p&gt;

&lt;p&gt;Start at the beginning. Follow the data. The logs will tell you where it broke.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>devops</category>
    </item>
    <item>
      <title>Which AI Model Should You Actually Use? A Simple Guide for 2026</title>
      <dc:creator>Sindhu Murthy</dc:creator>
      <pubDate>Mon, 16 Feb 2026 21:24:35 +0000</pubDate>
      <link>https://dev.to/sindhu_murthy_628835a359d/which-ai-model-should-you-actually-use-a-simple-guide-for-2026-31d4</link>
      <guid>https://dev.to/sindhu_murthy_628835a359d/which-ai-model-should-you-actually-use-a-simple-guide-for-2026-31d4</guid>
      <description>&lt;h1&gt;Which AI Model Should You Actually Use? A Simple Guide for 2026&lt;/h1&gt;

&lt;p&gt;Everyone's building with AI now, but nobody tells you which model to pick. There are dozens of options and the wrong choice either wastes money or gives bad results.&lt;/p&gt;

&lt;p&gt;Here's the simple version: match the model to the job.&lt;/p&gt;

&lt;h2&gt;Part 1: Everyday Projects (Solo Developers, Startups, Side Projects)&lt;/h2&gt;

&lt;p&gt;You're building something yourself or with a small team. Budget matters. Speed matters.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;What You're Building&lt;/th&gt;
&lt;th&gt;Best Model&lt;/th&gt;
&lt;th&gt;Why This One&lt;/th&gt;
&lt;th&gt;Cost/Month&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Chatbot for your website&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Answers customer FAQs from your docs&lt;/td&gt;
&lt;td&gt;GPT-4o-mini (OpenAI)&lt;/td&gt;
&lt;td&gt;Cheap, fast, handles Q&amp;amp;A perfectly&lt;/td&gt;
&lt;td&gt;$1-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Code assistant&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Reviews pull requests, writes boilerplate&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.5 (Anthropic)&lt;/td&gt;
&lt;td&gt;Great at code, follows instructions precisely&lt;/td&gt;
&lt;td&gt;$5-20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Meeting summaries&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Transcripts → action items&lt;/td&gt;
&lt;td&gt;GPT-4o-mini (OpenAI)&lt;/td&gt;
&lt;td&gt;Summarization is simple. Fractions of a cent per summary.&lt;/td&gt;
&lt;td&gt;$1-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Image generation&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Marketing visuals, product mockups&lt;/td&gt;
&lt;td&gt;DALL-E 3 or Midjourney&lt;/td&gt;
&lt;td&gt;DALL-E for API integration. Midjourney for artistic control.&lt;/td&gt;
&lt;td&gt;$10-30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Voice transcription&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Audio recordings → text&lt;/td&gt;
&lt;td&gt;Whisper (OpenAI, local)&lt;/td&gt;
&lt;td&gt;Runs on your machine, no API costs, surprisingly accurate&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;The rule for everyday projects:&lt;/b&gt; Start with the cheapest model. Only upgrade if the quality isn't good enough. You'll be surprised how often the cheap option works fine.&lt;/p&gt;

&lt;h2&gt;Part 2: Enterprise Customers (Production Systems, Thousands of Users)&lt;/h2&gt;

&lt;p&gt;You're building for a company. Reliability matters. Compliance matters. The wrong answer costs real money.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;What They Need&lt;/th&gt;
&lt;th&gt;Best Model&lt;/th&gt;
&lt;th&gt;Why This One&lt;/th&gt;
&lt;th&gt;Key Consideration&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Internal knowledge search&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Employees search docs, get AI answers&lt;/td&gt;
&lt;td&gt;GPT-4o-mini + text-embedding-3-small&lt;/td&gt;
&lt;td&gt;Mini is cost-effective at scale&lt;/td&gt;
&lt;td&gt;Set relevance thresholds — wrong answer is worse than no answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Legal contract review&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;AI reads contracts, flags risks&lt;/td&gt;
&lt;td&gt;Claude Opus or GPT-4o&lt;/td&gt;
&lt;td&gt;Legal requires precision and nuance&lt;/td&gt;
&lt;td&gt;Must have human review loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Support automation&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;AI handles tier-1 tickets&lt;/td&gt;
&lt;td&gt;GPT-4o with fine-tuning&lt;/td&gt;
&lt;td&gt;Matches company tone, follows escalation rules&lt;/td&gt;
&lt;td&gt;Route to human if confidence is low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Fraud detection&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Flag suspicious transactions&lt;/td&gt;
&lt;td&gt;Custom ML model (not LLM)&lt;/td&gt;
&lt;td&gt;Classification problem, not a language problem&lt;/td&gt;
&lt;td&gt;Traditional ML is faster, cheaper, more accurate here&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Multi-language portal&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Support in 20+ languages&lt;/td&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;Best multilingual performance&lt;/td&gt;
&lt;td&gt;Test thoroughly in each target language&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;The rule for enterprise:&lt;/b&gt; Reliability beats cost. A $0.01 answer that's wrong costs more than a $0.05 answer that's right — because wrong answers become support tickets, lost customers, and legal risk.&lt;/p&gt;

&lt;h2&gt;Why Smart Enterprises Don't Use One Model — They Use Several&lt;/h2&gt;

&lt;p&gt;Most companies start by picking one model for everything. That's a mistake. The companies that control AI costs best use &lt;b&gt;different models for different tasks in the same product&lt;/b&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Task in the Pipeline&lt;/th&gt;
&lt;th&gt;Model Used&lt;/th&gt;
&lt;th&gt;Why Not One Model for All&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Classify incoming ticket&lt;/td&gt;
&lt;td&gt;GPT-4o-mini ($0.15/1M tokens)&lt;/td&gt;
&lt;td&gt;Classification is simple — cheap model gets it right 95% of the time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search knowledge base&lt;/td&gt;
&lt;td&gt;text-embedding-3-small ($0.02/1M tokens)&lt;/td&gt;
&lt;td&gt;One-time cost per document. Cheapest good embeddings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Generate customer response&lt;/td&gt;
&lt;td&gt;GPT-4o ($2.50/1M tokens)&lt;/td&gt;
&lt;td&gt;Customer sees this. Quality matters here.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarize for internal log&lt;/td&gt;
&lt;td&gt;GPT-4o-mini ($0.15/1M tokens)&lt;/td&gt;
&lt;td&gt;Internal only. Doesn't need to be perfect.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flag compliance risk&lt;/td&gt;
&lt;td&gt;Claude Opus ($15/1M tokens)&lt;/td&gt;
&lt;td&gt;Legal requires the most careful model.&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;One customer support ticket, five different models.&lt;/b&gt; Each matched to the task complexity.&lt;/p&gt;

&lt;h3&gt;The Cost Difference Is Massive&lt;/h3&gt;

&lt;p&gt;Take a company handling &lt;b&gt;10,000 support tickets per month&lt;/b&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;How It Works&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Single model (GPT-4o for everything)&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Every step uses the same premium model&lt;/td&gt;
&lt;td&gt;~$800-1,200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Multi-model (right model per task)&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;Cheap models for simple steps, premium only where it matters&lt;/td&gt;
&lt;td&gt;~$150-250&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;b&gt;Same quality where the customer sees it. 70-80% cheaper overall.&lt;/b&gt;&lt;/p&gt;

&lt;h3&gt;How It Works in Practice&lt;/h3&gt;

&lt;p&gt;
GPT-4o-mini classifies the ticket → cost: $0.0001&lt;br&gt;
Embedding model searches docs → cost: $0.00005&lt;br&gt;
GPT-4o writes the response → cost: $0.008&lt;br&gt;
GPT-4o-mini summarizes for internal log → cost: $0.0002&lt;br&gt;
&lt;br&gt;
&lt;b&gt;Total per ticket: ~$0.009&lt;/b&gt;&lt;br&gt;
&lt;b&gt;vs. GPT-4o for all steps: ~$0.04&lt;/b&gt;&lt;br&gt;
&lt;b&gt;At 10,000 tickets/month: $90 vs $400&lt;/b&gt;
&lt;/p&gt;

&lt;h3&gt;The TAM's Role Here&lt;/h3&gt;

&lt;p&gt;As a TAM, this is one of the highest-value conversations you can have with a customer:&lt;/p&gt;

&lt;p&gt;"I noticed you're using GPT-4o for ticket classification. That's a simple task — switching to mini for just that step would cut your classification costs by 95% with no quality drop. Want me to help you set that up?"&lt;/p&gt;

&lt;p&gt;That's not support. That's &lt;b&gt;strategic partnership&lt;/b&gt;. That's what gets TAMs promoted.&lt;/p&gt;

&lt;h2&gt;Quick Decision Flowchart&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;If Your Task Is...&lt;/th&gt;
&lt;th&gt;Use This Model&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text/language + accuracy is critical (legal, medical, finance)&lt;/td&gt;
&lt;td&gt;GPT-4o or Claude Opus&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Text/language + accuracy isn't life-or-death&lt;/td&gt;
&lt;td&gt;GPT-4o-mini or Claude Sonnet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code generation or review&lt;/td&gt;
&lt;td&gt;Claude Sonnet 4.5 or GPT-4o&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Math, logic, or reasoning&lt;/td&gt;
&lt;td&gt;o3 or o3-mini&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image generation&lt;/td&gt;
&lt;td&gt;DALL-E 3 or Midjourney&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audio/speech transcription&lt;/td&gt;
&lt;td&gt;Whisper (free, runs locally)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Structured data (numbers, transactions, logs)&lt;/td&gt;
&lt;td&gt;Traditional ML — XGBoost, scikit-learn (not an LLM)&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;The Biggest Mistake I See&lt;/h2&gt;

&lt;p&gt;People use GPT-4o for everything. It's like using a Ferrari to get groceries. It works, but you're burning money for no reason.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Match the model to the task.&lt;/b&gt; Simple task → cheap model. Critical task → premium model. Not a language task → don't use an LLM at all.&lt;/p&gt;

&lt;h2&gt;The Models at a Glance&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Strength&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Fast, cheap, good enough&lt;/td&gt;
&lt;td&gt;$&lt;/td&gt;
&lt;td&gt;Chatbots, summaries, simple Q&amp;amp;A&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Smart, reliable, multilingual&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;td&gt;Production apps needing quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Sonnet 4.5&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Great at code, follows instructions&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;td&gt;Code generation, technical writing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus&lt;/td&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Most capable, careful reasoning&lt;/td&gt;
&lt;td&gt;$$$&lt;/td&gt;
&lt;td&gt;Legal, compliance, complex analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;o3-mini&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Step-by-step reasoning&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;td&gt;Math, logic, structured problems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Speech-to-text&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Transcription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DALL-E 3&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;Image generation&lt;/td&gt;
&lt;td&gt;$$&lt;/td&gt;
&lt;td&gt;Marketing, design, prototyping&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;XGBoost / scikit-learn&lt;/td&gt;
&lt;td&gt;Open source&lt;/td&gt;
&lt;td&gt;Structured data prediction&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Fraud, forecasting, classification&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;&lt;/div&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
