Most teams do not start with a full AI platform.
They start with a problem.
Maybe one team wants to proxy OpenAI traffic through an internal service. Maybe another wants to route small prompts to a cheaper model and longer prompts to a stronger one. Maybe the platform team wants one place to add policy, fallback, logging, rate limits, or tenant-specific rules.
That is usually the moment when a gateway becomes more valuable than another direct SDK call.
The challenge is that once you insert a gateway between the application and the model provider, you also create a new layer that can become opaque. A request gets routed somewhere, a model gets selected, a response comes back, and later nobody remembers why that route was chosen.
That is the motivation behind llm-gateway-template, an open-source Node.js starter that shows how to build an OpenAI-compatible gateway with model routing and Tokvera trace visibility.
Why an LLM gateway is useful
An LLM gateway gives platform teams a control point.
Instead of letting every application talk to providers directly, the gateway becomes the place where you can standardize request handling and enforce common decisions.
That usually includes things like:
- routing
autorequests to different models - applying policy before a provider call happens
- centralizing observability and audit metadata
- adding tenant-level behavior without changing every client app
- introducing fallback logic without touching each product surface
This is especially useful when the application team wants a familiar API contract but the platform team wants more control underneath.
What this starter repo does
llm-gateway-template is intentionally small, but it captures the workflow shape that matters.
For each incoming OpenAI-style request, the service:
- accepts a
/v1/chat/completionspayload - decides whether to keep the requested model or auto-route it
- forwards the request to a downstream provider or mock responder
- returns an OpenAI-compatible completion response
- includes Tokvera metadata for the route and trace
That makes the repo useful for teams that want to prototype gateway behavior without having to build a large internal platform first.
The API shape stays familiar
One of the best choices in this starter is that it keeps the interface simple.
Clients can call it using a familiar OpenAI-style payload:
curl -X POST http://localhost:3100/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{ "role": "system", "content": "You are a concise assistant." },
{ "role": "user", "content": "Summarize the importance of model routing in two bullet points." }
]
}'
That matters because it lowers adoption friction.
You can introduce gateway logic without forcing every internal caller to learn a completely new contract.
How routing works in the starter
The default routing logic is simple on purpose.
If the caller specifies an explicit model, the gateway passes that through unchanged.
If the caller uses model: "auto", the gateway estimates prompt size and chooses either a small model or a larger one.
In the current implementation:
- explicit models become passthrough requests
- short prompts route to the smaller model
- longer prompts route to the larger model
- the response carries the route reason and selected model
That is enough to demonstrate the control plane behavior that most teams care about first.
Why visibility matters at the gateway layer
A gateway is not only an HTTP proxy.
It is a decision engine.
Once the gateway starts selecting models, estimating prompt size, or applying policy, it becomes one of the most important places to observe.
Without visibility, teams run into questions like:
- Why did this request choose the large model?
- Did the client override the route, or did the gateway decide?
- Was the request expensive because of the prompt, the chosen model, or both?
- Did the provider fail, or did routing logic choose the wrong path?
If your only evidence is the final completion response, debugging turns into guesswork.
That is why tracing the gateway itself matters just as much as tracing the downstream model call.
How Tokvera fits into the flow
The starter uses Tokvera to trace both the gateway root and the downstream model execution.
The architecture is simple:
OpenAI-style request
-> route_request
-> downstream_provider_call
-> completion response + Tokvera metadata
That structure gives you a coherent trace instead of isolated model events.
You can inspect the routing step, see the selected model, review route reasoning, and keep the downstream provider call attached to the same workflow lineage.
That is much more useful than observing only the final provider response in isolation.
What the response gives you
The gateway returns a familiar completion response and includes a tokvera object with routing and request metadata.
Example shape:
{
"id": "chatcmpl_mock_123",
"object": "chat.completion",
"model": "gpt-4o-mini",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Mock gateway response from gpt-4o-mini: ..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 30,
"completion_tokens": 18,
"total_tokens": 48
},
"tokvera": {
"traceId": "trc_123",
"runId": "run_123",
"routing": {
"routeReason": "short_prompt_default",
"sizeClass": "small",
"selectedModel": "gpt-4o-mini",
"totalCharacters": 124,
"estimatedPromptTokens": 31
},
"request": {
"requestedModel": "auto",
"messageCount": 2,
"mockMode": true,
"provider": "mock"
}
}
}
That extra metadata is what makes the gateway operationally useful.
It lets platform teams answer not just what the model said, but how the request moved through the routing system.
Running it locally
Like the support-router starter, this project defaults to mock mode.
That makes it easy to evaluate and demo without needing live provider traffic on day one.
npm install
cp .env.example .env
npm run dev
By default, the service runs on http://localhost:3100.
To use live requests, set MOCK_MODE=false and provide:
OPENAI_API_KEYTOKVERA_API_KEY
You can also configure:
OPENAI_MODEL_SMALLOPENAI_MODEL_LARGEGATEWAY_TENANT_IDTOKVERA_INGEST_URL
That makes the starter good for both local demos and real integration experiments.
What to customize next
The repo is deliberately minimal, which makes it a good foundation for platform-specific extensions.
The next useful upgrades would be:
- add provider fallback chains
- add latency-aware or cost-aware routing
- add tenant-specific policies and budgets
- add rate limiting and request logging
- add payload redaction or prompt policy checks
- add Anthropic or Gemini as downstream providers
Those are the kinds of features that turn a starter into a real internal AI gateway.
Why this repo is commercially useful
A lot of AI infrastructure work happens before a team is ready for a full orchestration platform.
They still need a place to enforce routing rules, centralize cost control, and inspect why requests were handled the way they were.
That is exactly where an OpenAI-compatible gateway becomes valuable.
And that is why llm-gateway-template is a strong reference repo.
It shows how to preserve a familiar client interface while making gateway behavior observable, inspectable, and extensible.
Related links
- Repo: https://github.com/Tokvera/llm-gateway-template
- Existing app tracing guide: https://tokvera.org/docs/integrations/existing-app
- Get started: https://tokvera.org/docs/get-started
Top comments (0)