Lei Ye

Posted on Mar 10 • Originally published at lei-ye.dev

Building Maester — Enable Multi-provider LLM APIs

#programming #ai #python #machinelearning

Originally published at Building Maester — Enable Multi-provider LLM APIs.

We Locked Ourselves Into GCP

Most infrastructure mistakes don’t start as mistakes. They start as reasonable decisions. This one started with a discount.

It Worked Beautifully

In the beginning, the decision felt obvious. We had a large GCP startup credit, so our entire stack ran there.

Compute.
Storage.
Data pipelines.
Model training.
...
Everything.

And honestly, it worked beautifully! Monitoring was already integrated.
Identity management was built in. IAM policies were easy to manage.
Even LDAP integration was already available.

One of my teammates said something that sounded perfectly reasonable:

“Don’t reinvent the wheel.”

And he was right. Why build infrastructure when the cloud already solved it?

We were a small team. Most of our compute was tied to token usage, so costs looked predictable. Everything felt lightweight.

So we did what most startups do. We committed.

Where Did the Cost Come From?

"Don't spend like a billionaire with the company's money !"

What? We were all so much confused with the bill complaints at a Monday morning standup meeting months later. The bill arrived and nobody could clearly explain it. It was the cloud bill. And it started eating into margins.

Where did the cost come from?
Storage?
Network egress?
Pipelines?
Inference traffic?

Someone suggested hiring a cloud optimization engineer. Another suggested redesigning the entire data pipeline.

But we were still a startup. Every time we opened the roadmap we saw something else staring at us:

Customer requests.
Feature releases.
Revenue milestones.

Infrastructure work always lost that fight. So the bills kept climbing. We weren't bankrupt. But we were trapped.

We Split the Stack

Eventually we did something radical. We split the stack. The architecture finally looked like this:

Azure → Identity / Compliance
AWS → Applications / Storage
GCP → Data Pipelines / Training

And the cost?

Still expensive. But predictable. Even without our original startup discount, the system became easier to control.

Vendor lock-in is invisible when things work. It becomes obvious only when you try to leave.

We Are Not Going to Lock into OpenAI

So when we started building the AI APIs, I began seeing the same pattern again.

It was just:

response = openai.responses.create(...)

And honestly, that works.

But I kept remembering the GCP moment. The moment when switching vendors became impossible. We were about to repeat the same mistake.

Except this time the vendor was not a cloud. It was a model provider. So I made a decision. We are not going to lock into OpenAI.

Approach 1 — Let the client choose the model

The simplest idea was letting the client select the model.

POST /generate
{
  "model": "gpt-4.1-mini"
}

This allowed switching between providers.

OpenAI
Anthropic
Others later

Technically it worked. But users quickly complained.

“I don't want to choose the model. I just want the best answer.”

The user is always right. They just wanted results.

Approach 2 — Introduce a Model Gateway

So we moved the decision out of the client. Instead of clients choosing providers, we introduced a Model Gateway.

Application
     ↓
Model Gateway
     ↓
Provider Router
     ↓
Provider Adapter
(OpenAI / Anthropic / others)

This gateway would manage:

provider routing
fallback logic
cost tracking
observability
evaluation

The application now simply asks for a response. And the infrastructure decides how to produce it.

The Real Code

The implementation lives inside a small reference project I’ve been building called Maester.

The goal of the project is not to build a full AI platform, but to demonstrate a reliable AI API architecture.

The gateway sits inside the system like this:

apps/
   api/
      routes/
         reliable_completion.py

packages/
   model_gateway/
      base.py
      provider_openai.py
      provider_anthropic.py
      router.py
      client.py

The Provider Contract

The first step was defining a provider interface. This follows the Adapter Pattern, allowing different model vendors to conform to a shared interface.

class ModelProvider(Protocol):

    def supports(self, model: str) -> bool:
        ...

    def generate(self, request: GenerationRequest) -> GenerationResponse:
        ...

Each provider adapter simply implements this contract.

For example:

OpenAIProvider
AnthropicProvider

Both produce the same normalized response.

GenerationResponse
 ├─ provider
 ├─ model
 ├─ content
 └─ usage

This means the rest of the system never deals with vendor-specific formats.

The Router

Next comes the router. The router decides which provider handles a request.

class ModelRouter:

    def route(self, model: str) -> ModelProvider:
        for provider in self.providers:
            if provider.supports(model):
                return provider

        return self.fallback_provider

In production systems this layer can later evolve into:

cost-aware routing
latency-aware routing
capability routing
traffic shaping

But the interface stays the same.

The Gateway Client

Finally the application talks to the gateway through a simple client.

class ModelGateway:

    def generate(self, model: str, prompt: str, max_tokens: int):
        request = GenerationRequest(
            model=model,
            prompt=prompt,
            max_tokens=max_tokens,
        )

        provider = self.router.route(model)

        return provider.generate(request)

The API layer doesn't know which provider was selected. It just receives a normalized response.

The API Layer

The FastAPI route becomes extremely simple.

model_response = model_gateway.generate(
    model=requested_model,
    prompt=payload.prompt,
    max_tokens=payload.max_tokens,
)

After generation, the system runs the reliability pipeline:

Cost metering
Evaluation
Structured logging

Example log:

model_routed
requested_model: gpt-4.1-mini
selected_provider: openai
fallback_used: false

This gives operators visibility without leaking provider logic into application code.

Why This Architecture Matters

This design combines several classic software engineering principles:

Dependency Inversion Application code depends on abstractions, not providers.
Adapter Pattern Each vendor SDK is wrapped behind a provider adapter.
Strategy Pattern Routing policies are interchangeable strategies.
Separation of Concerns API layer handles orchestration.Gateway handles provider logic.

What This Enables Later

Once this boundary exists, the system becomes far easier to evolve.

For example:

multi-provider fallback
provider benchmarking
cost-aware routing
latency optimization
evaluation-based routing

All of those changes can happen inside the gateway. The application API never changes. That is the real value of the design.

The Lesson

Vendor lock-in rarely feels dangerous at the beginning. Everything works. Costs look reasonable. The roadmap is full of features.

Then one day something changes. Prices rise. Performance shifts. A better provider appears.And suddenly the architecture makes switching painful.

The lesson I learned from our cloud migration was simple: Always design one layer where you can change your mind later.

For our AI systems, that layer became the Model Gateway.

The application talks to the gateway.
The gateway talks to providers.
And the providers can change.

Because eventually they always do.

Note: This article was originally published on my egineering blog where I document the design of Maester, an AI SaaS infrastructure system built in public.
Original post: Building Maester — Enable Multi-provider LLM APIs.

DEV Community