<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TokenHub</title>
    <description>The latest articles on DEV Community by TokenHub (@tokenhub_dev).</description>
    <link>https://dev.to/tokenhub_dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898502%2F6f13da76-8606-4490-b57e-067e230f3c22.png</url>
      <title>DEV Community: TokenHub</title>
      <link>https://dev.to/tokenhub_dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tokenhub_dev"/>
    <language>en</language>
    <item>
      <title>Build the eval set before you swap the model.</title>
      <dc:creator>TokenHub</dc:creator>
      <pubDate>Tue, 12 May 2026 09:46:54 +0000</pubDate>
      <link>https://dev.to/tokenhub_dev/build-the-eval-set-before-you-swap-the-model-1dlc</link>
      <guid>https://dev.to/tokenhub_dev/build-the-eval-set-before-you-swap-the-model-1dlc</guid>
      <description>&lt;p&gt;The pattern I keep seeing on teams chasing AI cost reductions: someone swaps a workload from GPT-4o to DeepSeek-V3, eyeballs a handful of outputs, calls it good, ships it. The cost graph drops the next day. Three weeks later a customer surfaces a regression — the cheaper model hallucinates a date format 6% more often, breaks the downstream invoice generator, and the rollback erases most of the savings plus a week of engineering time.&lt;/p&gt;

&lt;p&gt;The fix isn't "don't swap models." Swapping is mostly the right move — DeepSeek-V3 at $0.07/$0.28 per million tokens vs GPT-4o at $2.50/$10 is too much money to leave on the table when the workload tolerates it.&lt;/p&gt;

&lt;p&gt;The fix is: &lt;strong&gt;build the eval set before the swap, not after.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What a useful eval set looks like
&lt;/h2&gt;

&lt;p&gt;You don't need fancy infrastructure for this. Five steps:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Pull 100-300 real prompts from production logs
&lt;/h3&gt;

&lt;p&gt;Cover the long tail of inputs, not just the happy path. Include the weird ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The customer ticket in Spanish when your system was designed for English&lt;/li&gt;
&lt;li&gt;The PR diff with binary files mixed in&lt;/li&gt;
&lt;li&gt;The malformed JSON the user pasted instead of describing the issue in words&lt;/li&gt;
&lt;li&gt;The prompt that hit a 30-second timeout last Tuesday&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the inputs where models actually differ. A 50-prompt happy-path eval will tell you both models are 99% accurate, and you'll learn nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Get the current model's outputs on those prompts
&lt;/h3&gt;

&lt;p&gt;Save them with a timestamp. This is your baseline. Don't skip this — you'll need it for the comparison and you can't reconstruct it later if the model gets deprecated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-gateway/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;baseline_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eval_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;baseline_outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;baseline_gpt4o.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_outputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Get the candidate model's outputs on the same prompts
&lt;/h3&gt;

&lt;p&gt;Same code, different model name. If your application is wired through an OpenAI-compatible gateway, this is one config change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;candidate_outputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;eval_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# the only change
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;candidate_outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost of running 300 prompts through DeepSeek-V3 is roughly $0.20. Don't optimize this step.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Compare programmatically where you can, human-review the rest
&lt;/h3&gt;

&lt;p&gt;For structured outputs (JSON, tool calls, field extraction), programmatic comparison covers most ground:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schema validity: does the output parse?&lt;/li&gt;
&lt;li&gt;Field match: do the extracted fields match the baseline?&lt;/li&gt;
&lt;li&gt;Edit distance for short strings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For free-form outputs (summaries, explanations, agent responses), human review is the bottleneck. Three minutes per prompt × 300 prompts = 15 hours, which sounds bad but is a one-time cost for a decision that affects every production call going forward.&lt;/p&gt;

&lt;p&gt;Use an LLM-as-judge to triage: have a stronger model (Claude 3.5, GPT-4o) rate each candidate output against the baseline as &lt;code&gt;better / equivalent / worse / different-but-acceptable&lt;/code&gt;. Then human-review only the &lt;code&gt;worse&lt;/code&gt; and &lt;code&gt;different&lt;/code&gt; buckets. That cuts human time by ~70% in my experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Set a threshold before you ship
&lt;/h3&gt;

&lt;p&gt;"Candidate model has to match baseline on at least 95% of evals to ship" is a reasonable default. The exact number depends on the workload:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Safety-critical (legal, medical, financial): 99%+&lt;/li&gt;
&lt;li&gt;User-facing high-stakes (customer-facing summaries): 97%+&lt;/li&gt;
&lt;li&gt;Internal tooling (Slack summaries, dev tools): 92%+&lt;/li&gt;
&lt;li&gt;Background tasks (data cleanup, tagging): 85%+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pick the threshold &lt;em&gt;before&lt;/em&gt; you see the numbers. Picking after is how you talk yourself into shipping a model that's slightly worse on the dimension you care about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architectural prerequisite
&lt;/h2&gt;

&lt;p&gt;This whole loop only works cheaply if swapping the model for the eval is a config change, not an integration project. Wire your application code through the OpenAI Python SDK with a configurable &lt;code&gt;base_url&lt;/code&gt; and let a gateway handle the provider-specific bits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;th-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-gateway/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Same client, different model per call
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I use &lt;a href="https://jiatoken.com" rel="noopener noreferrer"&gt;TokenHub&lt;/a&gt; for the gateway — 40+ models behind one API key, route per call. &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; self-hosted gets you the same shape if you'd rather run it yourself.&lt;/p&gt;

&lt;p&gt;Without that wiring, every eval is a custom integration project, which is why most teams don't run evals.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Pull 100-300 real prompts from logs (include weird ones)&lt;/li&gt;
&lt;li&gt;Run baseline model, save outputs&lt;/li&gt;
&lt;li&gt;Run candidate model, save outputs&lt;/li&gt;
&lt;li&gt;Compare (programmatic for structured, LLM-judge + human for free-form)&lt;/li&gt;
&lt;li&gt;Threshold before shipping&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole exercise takes a day. It saves you the rollback story.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>deepseek</category>
      <category>testing</category>
    </item>
    <item>
      <title>Stop tuning one model. Route per workload.</title>
      <dc:creator>TokenHub</dc:creator>
      <pubDate>Thu, 07 May 2026 13:49:58 +0000</pubDate>
      <link>https://dev.to/tokenhub_dev/stop-tuning-one-model-route-per-workload-2aah</link>
      <guid>https://dev.to/tokenhub_dev/stop-tuning-one-model-route-per-workload-2aah</guid>
      <description>&lt;p&gt;"What's the best model?" used to be a meaningful question. Today it has the wrong shape.&lt;/p&gt;

&lt;p&gt;The useful version is: best for which workload?&lt;/p&gt;

&lt;p&gt;In the pipeline I migrated last month, three workloads ended up on three different models. None of those would have been my answer if you'd asked "what's the best model overall." Each one is the right answer to a more specific question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Field extraction wants reliability
&lt;/h2&gt;

&lt;p&gt;We pull structured fields from customer support tickets and route them into a workflow tool. Accuracy matters more than anything because a wrong field silently corrupts a downstream system. Speed barely matters; cost matters somewhat.&lt;/p&gt;

&lt;p&gt;Landed on &lt;strong&gt;GPT-4o-mini&lt;/strong&gt;. Boring, reliable, structured-output friendly. Cheap enough that we don't think about it. Trying to use a smarter model here would be a category error — the failure mode isn't "the model isn't smart enough," it's "the model occasionally hallucinates a field name that breaks the schema validator." Smaller, more boring models hallucinate less.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long-context summarization wants context and price
&lt;/h2&gt;

&lt;p&gt;We summarize PR diffs that range from 5K to 80K tokens, posted to Slack. Quality matters, but the difference between 91% and 94% accuracy on "summarize this diff" is invisible to the human reading the Slack message; the difference between $0.49/day and $17.50/day shows up on the next finance review.&lt;/p&gt;

&lt;p&gt;Landed on &lt;strong&gt;DeepSeek-V3&lt;/strong&gt;. The accuracy hit is real but small; the cost reduction is large; the context window covers what we need. The fact that DeepSeek speaks the OpenAI wire format meant the migration was a configuration change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool-use agents want strong instruction following
&lt;/h2&gt;

&lt;p&gt;We have an agent that calls 4-6 tools in sequence to draft and send a follow-up email. Each retry is expensive (in latency, in user trust, in tokens spent debugging). The model that gets it right on the first try the most often is the one we want, almost regardless of per-call cost.&lt;/p&gt;

&lt;p&gt;Landed on &lt;strong&gt;Claude 3.5&lt;/strong&gt;. The per-call price is higher; the retries-per-task ratio is lower; the net cost is similar to GPT-4o and the reliability ceiling is meaningfully higher.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architectural prerequisite
&lt;/h2&gt;

&lt;p&gt;This only works if your application code doesn't care which model serves a given call. We wired everything through the OpenAI Python SDK with a configurable &lt;code&gt;base_url&lt;/code&gt;, then put a gateway in front that routes per request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;th-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://your-gateway/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Same client, different model per call
&lt;/span&gt;&lt;span class="n"&gt;extraction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;agent_step&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-3-5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switching a workload to a different model is now a one-line config change in production.&lt;/p&gt;

&lt;p&gt;Without that, picking three models means writing three integrations, debugging three sets of edge cases, and watching one of them rot every time a provider changes their API. With it, the marginal cost of trying a new model on a workload is near zero, which is also the marginal cost of moving off a model that stops being the right answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more in 2026
&lt;/h2&gt;

&lt;p&gt;The market keeps shipping new models at new price points. Last quarter alone: DeepSeek dropped V3 prices, Anthropic shipped a cheaper Haiku, OpenAI introduced 4o-mini, Mistral updated their open-weight lineup. Each of those releases changed which model was the right answer for at least one workload in our system.&lt;/p&gt;

&lt;p&gt;If switching costs a sprint, you ignore most of those announcements. If it costs a config change, you re-evaluate routinely and your costs trend down while the market keeps shipping.&lt;/p&gt;

&lt;p&gt;The shape of the question stays the same: which model is the best answer to this specific workload right now? You want code that can give a different answer next week without a sprint of refactoring.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Don't pick the best model. Pick the best model &lt;em&gt;per workload&lt;/em&gt;, and keep picking — the answer changes every few weeks. The architectural prerequisite is wiring everything through one OpenAI-compatible API surface so swapping is free.&lt;/p&gt;

&lt;p&gt;I used &lt;a href="https://jiatoken.com" rel="noopener noreferrer"&gt;TokenHub&lt;/a&gt; for the gateway — 40+ models behind one API key, route per call. &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; self-hosted is the same architectural pattern if you'd rather run it yourself. The point, again, is the pattern more than the vendor.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>openai</category>
      <category>deepseek</category>
      <category>programming</category>
    </item>
    <item>
      <title>Swap OpenAI for DeepSeek without rewriting a single line of code</title>
      <dc:creator>TokenHub</dc:creator>
      <pubDate>Mon, 27 Apr 2026 03:55:51 +0000</pubDate>
      <link>https://dev.to/tokenhub_dev/swap-openai-for-deepseek-without-rewriting-a-single-line-of-code-4lm3</link>
      <guid>https://dev.to/tokenhub_dev/swap-openai-for-deepseek-without-rewriting-a-single-line-of-code-4lm3</guid>
      <description>&lt;p&gt;Last month I added Claude to a project that was already using GPT-4o. Two SDKs, two error formats, two retry strategies. By the time I finished I had wrapped both in my own abstraction — a tiny LLM gateway, badly written, that I now had to maintain.&lt;/p&gt;

&lt;p&gt;Then I noticed something I should have noticed earlier: most of the new providers expose an &lt;strong&gt;OpenAI-compatible&lt;/strong&gt; endpoint. DeepSeek, Mistral, Together, Fireworks — they all speak the same wire format. You don't need a new SDK. You need a new &lt;code&gt;base_url&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This post is the 5-minute version of that realization, with the tradeoffs I learned the hard way.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "before" code
&lt;/h2&gt;

&lt;p&gt;Standard OpenAI Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this PR diff...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The "after" code
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;th-...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://jiatoken.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# gateway
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Same call, different model
&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                &lt;span class="c1"&gt;# DeepSeek-V3
&lt;/span&gt;    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize this PR diff...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Two lines changed. The rest of your code — streaming handlers, tool calls, retry logic — keeps working because the response shape is identical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works
&lt;/h2&gt;

&lt;p&gt;The OpenAI Python SDK is just a typed HTTP client. It POSTs JSON to &lt;code&gt;{base_url}/chat/completions&lt;/code&gt;. Anything that responds with the same JSON shape is, from the SDK's point of view, OpenAI.&lt;/p&gt;

&lt;p&gt;Most gateways take advantage of this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek&lt;/strong&gt; ships its own OpenAI-compatible endpoint at &lt;code&gt;api.deepseek.com/v1&lt;/code&gt;. You can point the SDK there directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic&lt;/strong&gt; does &lt;strong&gt;not&lt;/strong&gt; — Claude has its own message format. You need a translator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini&lt;/strong&gt; has both: a native API and a Vertex-side OpenAI shim.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A multi-model gateway (LiteLLM, OpenRouter, TokenHub, your own) collapses these into one endpoint. One key, one base_url, every model behind it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually save
&lt;/h2&gt;

&lt;p&gt;For the workload I just migrated (~3M input tokens / 1M output per day, mostly summarization):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Input $/1M&lt;/th&gt;
&lt;th&gt;Output $/1M&lt;/th&gt;
&lt;th&gt;Daily cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;td&gt;10.00&lt;/td&gt;
&lt;td&gt;$17.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude 3.5&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;15.00&lt;/td&gt;
&lt;td&gt;$24.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-V3&lt;/td&gt;
&lt;td&gt;0.07&lt;/td&gt;
&lt;td&gt;0.28&lt;/td&gt;
&lt;td&gt;$0.49&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DeepSeek isn't a drop-in &lt;em&gt;quality&lt;/em&gt; replacement for everything — GPT-4o still wins on instruction following in my evals — but for the 80% of calls that are "summarize this", "extract these fields", "rewrite in tone X", it's fine and ~35× cheaper.&lt;/p&gt;

&lt;h2&gt;
  
  
  The annoying parts
&lt;/h2&gt;

&lt;p&gt;A few things don't carry over cleanly through OpenAI compatibility:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tool calling JSON shape.&lt;/strong&gt; Most providers match it now, but older OSS models return tool calls inside the content string. Always test with your actual prompts before flipping production.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision.&lt;/strong&gt; OpenAI uses &lt;code&gt;image_url&lt;/code&gt; parts; some providers want base64. A gateway should normalize this for you — verify before you assume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming with usage stats.&lt;/strong&gt; OpenAI added &lt;code&gt;stream_options={"include_usage": True}&lt;/code&gt; to get token counts on the final SSE chunk. Not every backend forwards this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limits.&lt;/strong&gt; You're now subject to the gateway's RPM, which may be lower than direct provider limits.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When NOT to use a gateway
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;You only ever call one provider. Direct SDK is one less moving part.&lt;/li&gt;
&lt;li&gt;You need provider-specific features (Anthropic's prompt caching, OpenAI's Realtime API, Gemini's long context). Gateways usually lag behind native features by weeks.&lt;/li&gt;
&lt;li&gt;You're in a regulated environment that requires data plane control. Most gateways are SaaS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For everything else — especially side projects and prototypes where the model you "want" changes every two weeks — a gateway pays for itself in saved switching cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight diff"&gt;&lt;code&gt;  client = OpenAI(
      api_key="...",
&lt;span class="gi"&gt;+     base_url="https://your-gateway/v1",
&lt;/span&gt;  )
  client.chat.completions.create(
&lt;span class="gd"&gt;-     model="gpt-4o",
&lt;/span&gt;&lt;span class="gi"&gt;+     model="deepseek-chat",
&lt;/span&gt;      ...
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to skip running your own LiteLLM, &lt;a href="https://jiatoken.com" rel="noopener noreferrer"&gt;TokenHub&lt;/a&gt; hosts a pre-configured gateway with 40+ models behind one key. Otherwise, &lt;a href="https://github.com/BerriAI/litellm" rel="noopener noreferrer"&gt;LiteLLM&lt;/a&gt; self-hosted is the standard answer.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>I Built an OpenAI-Compatible Gateway to 40+ AI Models (DeepSeek, MiniMax, Claude)</title>
      <dc:creator>TokenHub</dc:creator>
      <pubDate>Sun, 26 Apr 2026 08:31:18 +0000</pubDate>
      <link>https://dev.to/tokenhub_dev/i-built-an-openai-compatible-gateway-to-40-ai-models-deepseek-minimax-claude-2ifk</link>
      <guid>https://dev.to/tokenhub_dev/i-built-an-openai-compatible-gateway-to-40-ai-models-deepseek-minimax-claude-2ifk</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I was paying for 5+ different AI subscriptions: OpenAI, Anthropic, Google, etc. Each with separate API keys, billing dashboards, and SDK quirks.&lt;/p&gt;

&lt;p&gt;When DeepSeek-V3 dropped at ~$0.28 per million output tokens (vs GPT-4o at $10), I wanted to switch — but the friction of changing SDKs across multiple projects was a pain.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;TokenHub&lt;/strong&gt; — an OpenAI-compatible gateway that routes to 40+ AI models with a single API key.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;It's a drop-in replacement for the OpenAI SDK. Just change &lt;code&gt;base_url&lt;/code&gt; and &lt;code&gt;api_key&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-tokenhub-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://jiatoken.com/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use any of 40+ models — DeepSeek, MiniMax, Claude, GPT, Gemini, Llama, etc.
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepseek-v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain async/await in Python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The same code works with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;gpt-4o&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;claude-sonnet-4-6&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gemini-2.5-pro&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deepseek-v3&lt;/code&gt; / &lt;code&gt;deepseek-r1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;minimax-text-01&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;llama-3.3-70b&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;...and more&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real Pricing Comparison
&lt;/h2&gt;

&lt;p&gt;Per million tokens (input / output):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Input&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;$2.50&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o mini&lt;/td&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;$0.15&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek-V3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TokenHub&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;td&gt;$0.28&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek-R1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TokenHub&lt;/td&gt;
&lt;td&gt;$0.14&lt;/td&gt;
&lt;td&gt;$0.55&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MiniMax-Text-01&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;TokenHub&lt;/td&gt;
&lt;td&gt;$0.10&lt;/td&gt;
&lt;td&gt;$0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For high-volume workloads (RAG, agents, batch summarization), DeepSeek-V3 is &lt;strong&gt;~35x cheaper&lt;/strong&gt; than GPT-4o for output tokens.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Which Model
&lt;/h2&gt;

&lt;p&gt;A quick mental model from my own usage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cheap &amp;amp; good enough&lt;/strong&gt; → DeepSeek-V3 (most general tasks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt; → DeepSeek-R1 (CoT-style tasks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long context&lt;/strong&gt; → MiniMax-Text-01 (200K+ tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontier capability&lt;/strong&gt; → GPT-4o or Claude (still worth it for hard problems)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Code&lt;/strong&gt; → Claude Sonnet 4.6 or DeepSeek-V3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The win is being able to A/B test across models without rewriting code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Open-Sourced the Routing Logic
&lt;/h2&gt;

&lt;p&gt;(Note: TokenHub itself is hosted, but the routing pattern is straightforward.)&lt;/p&gt;

&lt;p&gt;The hardest part wasn't the proxy — it was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Normalizing function-calling formats&lt;/strong&gt; across providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handling streaming differences&lt;/strong&gt; (SSE format quirks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token counting&lt;/strong&gt; for accurate billing pre-request&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're building something similar, the OpenAI spec is the de facto standard. Most providers either match it or have OpenAI-compatible endpoints already.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;If you're tired of juggling AI subscriptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;👉 &lt;a href="https://jiatoken.com" rel="noopener noreferrer"&gt;https://jiatoken.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Free credits to start&lt;/li&gt;
&lt;li&gt;Pay-as-you-go, no monthly commitment&lt;/li&gt;
&lt;li&gt;Compatible with OpenAI SDK out of the box&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd love feedback — especially on which models you'd want added, or pricing pain points.&lt;/p&gt;

&lt;p&gt;What's your current setup? Are you using a single provider or juggling multiple?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>api</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
