Lolo

Posted on Jun 27

I Got Tired of Rewriting AI API Wrappers, So I Built a Gateway

#ai #api #llm #showdev

Simplified credit billing for side projects

Every side project starts the same way.

-Generate an OpenAI key.

-Add it to .env.

-Write a wrapper.

-Realize I also need Claude.

-Create another account.

-Another API key.

-Another billing dashboard.

Before the project even starts, I've already configured three different services.

At some point I thought why not just make this a proper API and host it publicly?

That's how Apiarium started.

Why Not LiteLLM or OpenRouter?

They're great projects. But I wanted something more opinionated:

Credit-based billing so users always know what they're spending
One normalized API across text, images, TTS, and transcription
A self-hosted backend I fully control
Simple pricing without per-model complexity

My target isn't teams running production AI infrastructure. It's developers building side projects who want one API key and a predictable bill.

How It Works

Client
   │
   ▼
Apiarium
   ├── /llm         → OpenAI, Anthropic (more coming)
   ├── /image       → gpt-image-1 (more coming)
   ├── /tts         → OpenAI TTS, ElevenLabs (soon)
   └── /transcribe  → Whisper (more coming)

*More providers coming.

Adding a new provider doesn't change the API contract.
Same endpoints, same auth, more options.

From the client's perspective, every provider looks exactly the same:

# GPT-4o-mini
curl -X POST https://api.apiarium.dev/llm \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Hello"}]}'

# Switch to Claude — same endpoint, same auth
curl -X POST https://api.apiarium.dev/llm \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-haiku","messages":[{"role":"user","content":"Hello"}]}'

The Technical Decisions

Credits instead of per-model pricing. Pricing by token across four providers is confusing. I landed on credits text generation costs 1–20 credits depending on the model, images cost 100, TTS costs 10 per 1,000 characters. One number, always visible.

Normalized error format. OpenAI and Anthropic return completely different error structures. Every error from Apiarium looks the same regardless of which provider caused it:

{
  "error": "Rate limit exceeded. Try again in 30 seconds.",
  "code": "rate_limit_exceeded",
  "retry_after": 30
}

Provider abstraction. Every provider adapter returns the same internal response format before it's sent back to the client. That means adding a new provider is mostly implementing one adapter instead of changing the whole API.

What I'd Do Differently

If I started again, I'd build the provider abstraction first instead of adding providers one by one. Every new model taught me another edge case around streaming, token accounting, or error handling. Designing for those differences upfront would've saved me time.

Where It Is Now

Launched a few days ago. Still early, but the infrastructure is solid and every endpoint works.

I'm mostly interested in whether this solves a real problem for other developers. If you've hit the same setup tax or think I'm solving the wrong problem entirely, I'd genuinely like to hear it.

If even one developer stops copy-pasting another ai-utils.js file because of this, I'll call it a success.

apiarium.dev · Docs

Top comments (22)

UnitBuilds • Jun 29

IMO, developers need a gateway, but not for production, for testing. Simple routing, so they can wire it up and have a 'try them out' mode. Eg. if a user has a task, eg. OCR this document, they want to know which model does the best job, the fastest and the cheapest. 1 gateway, 1 endpoint, boom, you have stats for all 3 available claude models, all half a million from google and OpenAI (or a select few, depending on the criteria, eg. doc processing you wouldnt use opus, you'd use Haiku). With transparent token usage (input and output), if possible a cacheable detector (eg. for MCPs), that way the user can calculate their costs (or you can use webhook to fetch the latest pricing for them for each model). Essentially turning it into a LLM benchmark suite, that lets developers test whatever they need, without having to juggle authentications, API endpoints and formatting.

That's the kind of tool anyone would happily pay a 10% premium on. Because it saves them hours of reconfiguring. Honestly, most would just stick with it for production at that point, because it allows for simple-fallbacks and as long as you keep it updated, they wont have to update their ends.

Lolo • Jul 1

The benchmarking use case is one I hadn't fully considered but makes total sense, same task, multiple models, comparable stats in one place. Right now credits abstract away token counts, but exposing raw token usage + latency per request alongside credits would make the 'which model is actually cheapest for my use case' question answerable without leaving the gateway. Adding that to the roadmap.

UnitBuilds • Jul 1

Good move. The thing is, people dont buy single-serving software, or pay subscription for one-trick-ponies. The real value of a gateway is compatibility, ease of use and benchmarking would make it an actual cost-saver, not just time-saver. That's what enterprise clients like to hear, a tool that saves them money.

Lolo • Jul 1

That's exactly where this is heading. The next step isn't just benchmarking manually, it's smart routing: if you don't specify a model, the gateway picks the right one based on the task type. Short simple request → cheap fast model. Complex reasoning → stronger model. The user stops thinking about models entirely and just sends requests. That's the real value of the abstraction layer.

UnitBuilds • Jul 1

Exactly and when model cards update and get deprecated, auto-migrate tasks to the newer version.

Mike Czerwinski • Jun 28

Solid writeup, and the most honest line is admitting you built the provider abstraction after adding providers rather than at the second one. That is exactly where the seam should go.

The part I would push on is the credit model. Absorbing margin manually every time a provider reprices means your billing layer is auditing its own cost basis by hand. LiteLLM and OpenRouter dodge this by staying price-transparent. So the real question: when OpenAI or Anthropic shift pricing mid-month, what is your actual lag between their change and your credit remap, and who eats the gap in that window?

Not thetorical. That number is the whole product.

Lolo • Jun 28

Fair push. The lag is manual right now realistically hours to a day depending on when I catch it. In that window I eat the gap, that's a conscious call. The bet is that provider price changes are infrequent enough that the simplicity of fixed credits is worth it for the target user (side project devs who want predictability, not transparency). If that bet turns out wrong, the fix is dynamic credit costs per model, the abstraction layer already supports it, I just haven't needed it yet.

Mike Czerwinski • Jun 29

Eating the gap works while the gap is symmetric. The catch is that provider prices don't move symmetrically: they jump on capacity crunches and drift down slowly if at all, and crunches tend to land exactly when everyone's usage spikes. So the gap you absorb runs against you, and it does so right when your volume peaks.

Which means the real cost of fixed credits isn't the average lag, it's the correlation between when prices move and when your users are burning them. The simplicity is a genuine product call. I'd just watch that correlation rather than the mean: if 'hours to a day' keeps overlapping with demand spikes, the dynamic-cost fallback stops being optional.

Lolo • Jul 1

The correlation point is the one I'm actually watching now. The mean lag is manageable, the problem is when a capacity crunch and a usage spike land on the same day, that's when fixed credits hurt most. No clean solution yet beyond monitoring both and keeping the dynamic-cost fallback close. It's the right thing to track.

Mike Czerwinski • Jul 2

Correlation is the harder signal to trigger cleanly. If the fallback fires from the "both happened at once" event, not from each input separately, you keep the switching bounded to the case you actually care about.

Kartik N V J K • Jun 29

The normalized contract across text, image, TTS and transcribe is the part I would lean on hardest, since swapping providers behind one schema is where most of my glue code used to die. The thing I would add early is per-provider fallback and retry inside the gateway, because the moment one provider 500s or rate-limits, a single endpoint becomes a single point of failure. Are you planning to expose token usage and latency per request, or keep the surface purely credit-based?

Lolo • Jul 1

The single point of failure concern is valid and something I think about. Right now errors are normalized and returned consistently, but per-provider fallback and retry inside the gateway isn't implemented yet, that's the honest answer. On token usage: currently credit-based only, but exposing raw tokens + latency per request is on the roadmap, especially for the use case UnitBuilds describes above.

Nazar Boyko • Jun 27

Quick question on the credit model, since that's where I'd be nervous running it. When OpenAI or Anthropic shifts pricing, or you add a pricier model, how do you stop a fixed "1 to 20 credits" from quietly going underwater on the models that got more expensive? Is it a manual remap per model, or is there enough margin baked in that a provider price bump doesn't eat your spread? Either way the normalized error shape is the piece I'd happily borrow, that's the thing most gateways get wrong.

Lolo • Jun 27

Great question. It's a manual remap per model when a provider shifts pricing significantly. That's a risk I'm consciously taking on, if a model gets more expensive, that's my problem to absorb or adjust, not the user's. The whole point of the abstraction layer is that pricing volatility stops at my end.
And yeah, the normalized error shape was the first thing I built, tired of writing provider-specific error handling in every project.

Thank you very much for your question.

FuturMix • Jun 29

The credit-based billing decision resonates — per-token pricing across providers is genuinely hard to reason about when OpenAI bills input/output/cached separately and Anthropic adds cache-write vs cache-read tiers on top. One number the user can watch is the right call for the side-project audience.

The edge case that'll bite next, from having lived in this exact problem: token accounting drift between providers. Each provider counts tokens with its own tokenizer, so "1 token" of GPT-4o-mini and "1 token" of claude-haiku aren't the same unit of work. If your credit conversion is a flat per-model multiplier you're fine, but the moment you normalize usage (e.g. surfacing token counts back to the client in one shape) the numbers won't reconcile with each provider's own dashboard, and users will open issues about it. Worth deciding early whether you expose raw provider usage or your own normalized credits — mixing both is where the confusion lives.

The "build the provider abstraction first" lesson is the universal one here. Streaming deltas, error envelopes, and finish_reason semantics differ enough between OpenAI and Anthropic that retrofitting an adapter layer after two providers always means rewriting the first one.

Disclosure: I work on FuturMix (futurmix.ai), an OpenAI-compatible gateway in the same space — so this is a fellow-traveler note, not a pitch. The normalized-error and single-endpoint-across-providers pattern you landed on is the right shape. Curious how you're handling streaming normalization specifically, since that's where most of our edge cases ended up.

Lolo • Jun 29

The token discrepancy point is honest feedback. I didn't overthink it early on, the goal was to abstract that complexity away from the user entirely. Credits are a fixed multiplier per model, so we don't expose raw token counts at all. If a model ends up significantly more expensive to serve, I adjust the credit cost for that model.

On streaming: currently not implemented, responses are returned in full. It's on the roadmap and I expect that's exactly where the edge cases will show up. How did you end up handling finish_reason normalization on your end?