DEV Community

Ye Allen
Ye Allen

Posted on

How to Control AI API Costs with Model Tiers and an OpenAI-Compatible Gateway

When an AI feature moves from a prototype to real users, API cost usually becomes one of the first scaling problems.

The mistake I see often is simple: every request goes to the same default model.

That works during testing, but it becomes expensive when the product starts handling chat messages, summaries, RAG answers, classification jobs, and background tasks at the same time.

A better pattern is to separate model choice by product value.

1. Keep the OpenAI SDK shape stable

If your app already uses the OpenAI SDK, do not spread provider-specific logic across the codebase. Keep the client small and configurable:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["VECTOR_ENGINE_API_KEY"],
    base_url=os.getenv("VECTOR_ENGINE_BASE_URL", "https://www.vectronode.com/v1"),
)
Enter fullscreen mode Exit fullscreen mode

The important part is that the base URL, API key, and model name live in configuration instead of product logic.

2. Split tasks into model tiers

Not every request needs the same model.

Use stronger models for:

  • paid-user workflows
  • complex reasoning
  • customer-facing answers
  • coding and analysis tasks where quality matters

Use lower-cost models for:

  • drafts
  • short summaries
  • classification
  • routing
  • internal checks
  • free-tier usage

This is where an OpenAI-compatible gateway is useful. You can test GPT, Claude, Gemini, DeepSeek, Qwen, and other models behind one API format instead of wiring every provider separately.

3. Route by feature and user tier

A simple router can prevent accidental overuse of expensive models:

def choose_model(user_tier: str, feature: str) -> str:
    if user_tier == "free":
        return "deepseek-chat"

    if feature in {"classification", "draft", "summary"}:
        return "deepseek-chat"

    return "gpt-4o-mini"
Enter fullscreen mode Exit fullscreen mode

This is not a perfect router. It is a starting point. The goal is to make model selection explicit and measurable.

4. Set token limits per feature

A background summarizer, a chat reply, and an agent planning step should not share one token limit.

FEATURE_TOKEN_LIMITS = {
    "support_summary": 300,
    "chat_reply": 800,
    "agent_plan": 500,
    "rag_answer": 900,
}
Enter fullscreen mode Exit fullscreen mode

Start conservative. Raise limits only when product quality actually improves.

5. Track the cost signals early

Before traffic grows, log enough metadata to understand spend:

  • feature name
  • user tier
  • model name
  • latency
  • success or error status
  • prompt and completion token counts

You do not need to store full private prompts to understand cost behavior.

6. Test before scaling

Before choosing one default model, run the same prompt set across multiple options:

  • GPT for general reasoning
  • Claude for long-form writing and analysis
  • Gemini for multimodal or Google ecosystem workflows
  • DeepSeek for cost-sensitive reasoning and coding
  • Qwen or other Chinese LLMs for Chinese-language products

The best production model is usually not simply the most expensive model. It is the cheapest model that reliably meets the quality bar for that feature.

Practical takeaway

If you are building an AI product, treat model choice as product infrastructure, not a hard-coded string.

An OpenAI-compatible API gateway such as VectorNode AI can make this easier because the SDK shape stays familiar while the model strategy can evolve over time.

I also keep a small GitHub quickstart here:

https://github.com/yeallen441-del/vectorengine-quickstart

Top comments (0)