How to Route to 100+ AI Models with a Single API Endpoint

#ai #api #opensource #tutorial

The Problem: API Key Fragmentation Is Real

If you're building AI applications in 2026, you know the pain: 6 different API keys, 6 different billing dashboards, 6 different SDKs. Every time a new model drops, you spend hours integrating it.

I found a solution that changed my workflow: New API — an open-source AI API gateway that routes to 100+ models through a single OpenAI-compatible endpoint.

What Is New API?

New API is an open-source (AGPLv3) gateway that sits between your application and AI model providers. Think of it as a universal translator for AI APIs.

Key Features

Single Endpoint: One OpenAI-compatible API routes to GPT-4o, Claude, Gemini, DeepSeek, Qwen, Llama — and any custom model
Zero Markup: The managed version (aipossword.cn) charges $0 on top of model pricing
Self-Hostable: Docker, 2 minutes. Full control.
Auto Failover: If a model goes down, requests auto-route to the next best option
Team Ready: RBAC, per-member keys, usage quotas

Quick Start (30 Seconds)

# Your existing OpenAI code — just change the base URL and model
curl https://api.aipossword.cn/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"claude-sonnet-4","messages":[{"role":"user","content":"Hello"}]}'

Switching Models: One Line of Code

This is where the magic happens. Want to compare GPT-4o vs Claude vs DeepSeek? Just change the model string:

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.aipossword.cn/v1"
)

# Try GPT-4o
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role":"user","content":"Hello"}]
)

# Now try Claude — same code, different model
response = client.chat.completions.create(
    model="claude-sonnet-4-20250514",
    messages=[{"role":"user","content":"Hello"}]
)

Real-World Use Cases

Cost Optimization: Route simple queries to cheap models (Qwen at $0.10/1M tokens) and complex ones to frontier models
Multi-Provider Redundancy: Set up fallback chains — if OpenAI is down, auto-switch to Claude
Team Billing: One invoice, per-member usage tracking, no more expense report nightmares
Local + Cloud Hybrid: Route to your local Ollama instance for dev, fall back to cloud for production

Self-Hosted vs Managed

Feature	Self-Hosted	Managed (aipossword.cn)
Setup	Docker, 2 min	Instant
Models	Bring your keys	Pre-configured
Billing	DIY	USD, Stripe
Cost	Server costs	Model price + $0

Why I Recommend It

I've been using New API in production for a few weeks. The auto-failover has saved me twice when providers went down. The zero-markup pricing means I'm not paying extra for convenience — I pay exactly what the model costs.

The open-source nature (AGPLv3) gives me confidence. I can audit the code, self-host if I want, and never worry about vendor lock-in.

Get Started

Self-host: docker run calciumion/new-api:latest
Managed: aipossword.cn — $5 free credits
GitHub: github.com/QuantumNous/new-api (37k+ stars)

One endpoint. Every model. Zero friction.

Have you tried API gateways for AI models? What's your setup? Let me know in the comments!

Top comments (1)

FastAnchor_io • Jun 17

This architecture solves a massive integration headache, though it brings a set of hidden observability challenges we’ve covered in earlier threads.
Exposing 100+ heterogeneous LLMs under one unified endpoint drastically simplifies client integration, but it amplifies cost and monitoring blind spots. Without strict per-model routing tagging, baseline tracking breaks entirely — config tweaks to routing weights, fallback logic or priority tiers will silently shift monthly spend, and most dashboards won’t trigger automatic recalibration for those changes.
Another critical pain point: blast-radius alert tiering becomes far harder to enforce at scale. Every team will argue their target model workload deserves an exception to standardized severity rules, gradually eroding your entire alert framework. Also, paired versioning between meta-evaluators and individual models becomes unwieldy when managing this many variants; unaligned evaluator drift will corrupt all your unified quality signals.
Curious how you handle two core pieces:
1.Automated baseline resets whenever routing configs are updated;
2.Mandatory evidence review for teams requesting custom alert exceptions for specific models.