TokVera

Posted on Mar 22 • Originally published at tokvera.org

How to Build an OpenAI-Compatible LLM Gateway with Model Routing Visibility

#ai #typescript #opensource #api

Most teams do not start with a full AI platform.

They start with a problem.

Maybe one team wants to proxy OpenAI traffic through an internal service. Maybe another wants to route small prompts to a cheaper model and longer prompts to a stronger one. Maybe the platform team wants one place to add policy, fallback, logging, rate limits, or tenant-specific rules.

That is usually the moment when a gateway becomes more valuable than another direct SDK call.

The challenge is that once you insert a gateway between the application and the model provider, you also create a new layer that can become opaque. A request gets routed somewhere, a model gets selected, a response comes back, and later nobody remembers why that route was chosen.

That is the motivation behind llm-gateway-template, an open-source Node.js starter that shows how to build an OpenAI-compatible gateway with model routing and Tokvera trace visibility.

Why an LLM gateway is useful

An LLM gateway gives platform teams a control point.

Instead of letting every application talk to providers directly, the gateway becomes the place where you can standardize request handling and enforce common decisions.

That usually includes things like:

routing auto requests to different models
applying policy before a provider call happens
centralizing observability and audit metadata
adding tenant-level behavior without changing every client app
introducing fallback logic without touching each product surface

This is especially useful when the application team wants a familiar API contract but the platform team wants more control underneath.

What this starter repo does

llm-gateway-template is intentionally small, but it captures the workflow shape that matters.

For each incoming OpenAI-style request, the service:

accepts a /v1/chat/completions payload
decides whether to keep the requested model or auto-route it
forwards the request to a downstream provider or mock responder
returns an OpenAI-compatible completion response
includes Tokvera metadata for the route and trace

That makes the repo useful for teams that want to prototype gateway behavior without having to build a large internal platform first.

The API shape stays familiar

One of the best choices in this starter is that it keeps the interface simple.

Clients can call it using a familiar OpenAI-style payload:

curl -X POST http://localhost:3100/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      { "role": "system", "content": "You are a concise assistant." },
      { "role": "user", "content": "Summarize the importance of model routing in two bullet points." }
    ]
  }'

That matters because it lowers adoption friction.

You can introduce gateway logic without forcing every internal caller to learn a completely new contract.

How routing works in the starter

The default routing logic is simple on purpose.

If the caller specifies an explicit model, the gateway passes that through unchanged.

If the caller uses model: "auto", the gateway estimates prompt size and chooses either a small model or a larger one.

In the current implementation:

explicit models become passthrough requests
short prompts route to the smaller model
longer prompts route to the larger model
the response carries the route reason and selected model

That is enough to demonstrate the control plane behavior that most teams care about first.

Why visibility matters at the gateway layer

A gateway is not only an HTTP proxy.

It is a decision engine.

Once the gateway starts selecting models, estimating prompt size, or applying policy, it becomes one of the most important places to observe.

Without visibility, teams run into questions like:

Why did this request choose the large model?
Did the client override the route, or did the gateway decide?
Was the request expensive because of the prompt, the chosen model, or both?
Did the provider fail, or did routing logic choose the wrong path?

If your only evidence is the final completion response, debugging turns into guesswork.

That is why tracing the gateway itself matters just as much as tracing the downstream model call.

How Tokvera fits into the flow

The starter uses Tokvera to trace both the gateway root and the downstream model execution.

The architecture is simple:

OpenAI-style request
  -> route_request
  -> downstream_provider_call
  -> completion response + Tokvera metadata

That structure gives you a coherent trace instead of isolated model events.

You can inspect the routing step, see the selected model, review route reasoning, and keep the downstream provider call attached to the same workflow lineage.

That is much more useful than observing only the final provider response in isolation.

What the response gives you

The gateway returns a familiar completion response and includes a tokvera object with routing and request metadata.

Example shape:

{
  "id": "chatcmpl_mock_123",
  "object": "chat.completion",
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Mock gateway response from gpt-4o-mini: ..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 18,
    "total_tokens": 48
  },
  "tokvera": {
    "traceId": "trc_123",
    "runId": "run_123",
    "routing": {
      "routeReason": "short_prompt_default",
      "sizeClass": "small",
      "selectedModel": "gpt-4o-mini",
      "totalCharacters": 124,
      "estimatedPromptTokens": 31
    },
    "request": {
      "requestedModel": "auto",
      "messageCount": 2,
      "mockMode": true,
      "provider": "mock"
    }
  }
}

That extra metadata is what makes the gateway operationally useful.

It lets platform teams answer not just what the model said, but how the request moved through the routing system.

Running it locally

Like the support-router starter, this project defaults to mock mode.

That makes it easy to evaluate and demo without needing live provider traffic on day one.

npm install
cp .env.example .env
npm run dev

By default, the service runs on http://localhost:3100.

To use live requests, set MOCK_MODE=false and provide:

OPENAI_API_KEY
TOKVERA_API_KEY

You can also configure:

OPENAI_MODEL_SMALL
OPENAI_MODEL_LARGE
GATEWAY_TENANT_ID
TOKVERA_INGEST_URL

That makes the starter good for both local demos and real integration experiments.

What to customize next

The repo is deliberately minimal, which makes it a good foundation for platform-specific extensions.

The next useful upgrades would be:

add provider fallback chains
add latency-aware or cost-aware routing
add tenant-specific policies and budgets
add rate limiting and request logging
add payload redaction or prompt policy checks
add Anthropic or Gemini as downstream providers

Those are the kinds of features that turn a starter into a real internal AI gateway.

Why this repo is commercially useful

A lot of AI infrastructure work happens before a team is ready for a full orchestration platform.

They still need a place to enforce routing rules, centralize cost control, and inspect why requests were handled the way they were.

That is exactly where an OpenAI-compatible gateway becomes valuable.

And that is why llm-gateway-template is a strong reference repo.

It shows how to preserve a familiar client interface while making gateway behavior observable, inspectable, and extensible.

DEV Community