DEV Community: ModelHub Dev

How I Built 18 AI Employees in One Telegram Bot (Architecture Deep Dive)

ModelHub Dev — Tue, 30 Jun 2026 10:25:21 +0000

How I Built 18 AI Employees in One Telegram Bot (Architecture Deep Dive)

A few months ago, I found myself spending way too much time switching between tools — checking shipping rates, responding to customer inquiries, monitoring inventory, posting on social media. Each task had its own app, its own login, its own notification system.

So I did what any developer would do: I built a bot army.

The result is ModelHub — a single Telegram bot that gives businesses access to 18 different AI "employees," each specialized in a different role. Here's how it works under the hood.

The Problem: Too Many SaaS Tools, Not Enough Integration

Most small businesses I work with have the same problem:

They use 5-10 different SaaS tools
Employees spend hours context-switching
Automation tools are either too expensive or too brittle
Custom development is out of budget

I wanted one interface. One place where a business owner (or a team lead, or a support manager) could type a message and get the right AI for whatever they needed.

The Multi-Bot Architecture

The system is built around a simple concept: one hub, many workers.

User → Telegram → Hub Bot → Worker Bot (specialized role)
                        ↕
                   User Selection

Hub Bot (The Manager)

The main bot you interact with is the dispatcher. When you send it a message:

It presents available AI roles as inline buttons
You pick which "employee" you need
It spawns a direct conversation with that worker bot
The worker handles your request end-to-end
When done, you're returned to the hub

This keeps everything clean. Each worker bot has its own context window, its own prompt, its own conversation history. They don't interfere with each other.

Worker Bots (The Employees)

Each worker bot is an individual Telegram bot with a different personality and skill set. Currently I've built 18, including:

Bot	Specialty
Trade Clerk	Shipping rates, customs docs, trade compliance
E-commerce Agent	Product listings, inventory, order management
Customer Service Agent	Handle returns, complaints, FAQs
Content Writer	Blog posts, social media copy, product descriptions
Data Analyst	Spreadsheet analysis, sales trends, reporting
HR Assistant	Scheduling, employee queries, onboarding
Tech Support	Debugging, setup guides, technical docs
Marketing Bot	Ad copy, campaign ideas, A/B test suggestions
Translator	Multi-language translation with context awareness
Researcher	Web research, competitor analysis, market intel
And 8 more for specific niche workflows

Each one shares a common codebase but has a unique system prompt, behavior rules, and tool access.

Technical Stack

The whole thing runs on surprisingly modest infrastructure:

- Language: Python 3.11
- Framework: python-telegram-bot + Pyrogram
- API Server: Flask + gunicorn
- Database: SQLite (with WAL mode for concurrency)
- Deployment: $6/month Contabo VPS (Germany)
- LLM: GPT-4o-mini / Claude 3 Haiku (role-dependent)

Why Python + Flask + python-telegram-bot?

I've been building Telegram bots for years, and this combo is the sweet spot for reliability vs complexity:

python-telegram-bot handles webhook registration, message routing, and inline keyboards beautifully
Pyrogram handles the MTProto layer for the worker bots (they use userbot-style interaction where needed)
Flask with gunicorn keeps the webhook server lightweight — no FastAPI overhead when you don't need async for every request
SQLite with WAL mode handles concurrent reads without a dedicated database server

The Core Insight: One Codebase, 18 Personalities

The biggest engineering decision was keeping a single codebase for all 18 bots.

Instead of 18 separate repos (which would be a nightmare to maintain), every bot loads from the same code. The difference is in the configuration:

# Simplified worker config
BOTS = {
    "trade_clerk": {
        "token": os.getenv("BOT_TOKEN_TRADE"),
        "system_prompt": SYSTEM_PROMPT_TRADE_CLERK,
        "tools": ["shipping_api", "customs_api", "currency_api"],
        "temperature": 0.2,
        "max_context": 16000
    },
    "ecommerce": {
        "token": os.getenv("BOT_TOKEN_ECOMMERCE"),
        "system_prompt": SYSTEM_PROMPT_ECOMMERCE,
        "tools": ["shopify_api", "inventory_api", "order_api"],
        "temperature": 0.3,
        "max_context": 16000
    },
    # ... 16 more
}

Each worker bot process is spawned as a separate thread with its own webhook. The shared code means I can push a bug fix once and it applies to all 18 bots. New features go through the same pipeline.

The "personality" comes purely from prompt engineering. There's no fine-tuning. Just carefully crafted system prompts that define:

The bot's persona and tone
Its knowledge boundaries
Which APIs it can call
Escalation rules ("If you can't handle this, tell the user I'll forward this to a human")

Prompt Engineering at Scale

The hardest part wasn't the code — it was the prompts. Each bot needs to stay in character while being useful.

Key tricks I learned:

Role-lock early — Put the persona definition in the first 200 tokens so the model anchors on it
Tool definitions over examples — Instead of showing 50 examples of "how to respond," define what tools it has and let the model figure out the rest
Hard constraints in the post-amble — After the main system prompt, add a "RULES" section in ALL CAPS for things it must never do
Context budget per role — Trade clerk needs different token limits than content writer

How Pricing Works

The service runs on a freemium model:

Free trial: 300 messages, no credit card
Single role: $12.99/month (rent one AI employee)
Three roles: $29.99/month (pick any three)
Full access: $99/year (all 18 roles)

The pricing was a deliberate choice. At $12.99/role, it's cheaper than a single SaaS subscription for most of these tasks. And most businesses only need 2-3 roles regularly.

Infrastructure Reality Check

I'm running this on a $6/month Contabo VPS in Germany. That's it.

Handle 18 concurrent webhook listeners on this tiny box? Yes. Each bot is a Flask app behind gunicorn, and the total memory usage is about 480MB for all 18 (roughly 25MB per bot process for the webhook handler).

The real magic is that the LLM calls don't happen on this server — they go out to OpenAI/Anthropic APIs. So the VPS is just handling routing, prompt construction, and response formatting. Each request takes about 150-300ms of local processing, with the rest being LLM inference time.

What I'd Do Differently Next Time

If I were building this again:

Use a message queue — Currently all bots register webhooks independently. A shared queue would simplify deployment
Add persistent memory — Individual bot conversations don't share context. Sometimes I wish the trade clerk knew what the e-commerce bot just told the user
Database migration — SQLite is fine for MVP, but I'd move to PostgreSQL for anything beyond personal use
Containerize sooner — The VPS is manageable, but Docker would make scaling to new instances instant

Try It Yourself

If you want to check it out, the free trial is 300 messages — no signup, just start a conversation. The hub bot introduces you and you pick what roles you need.

The bot: @modelhub_bot

Or if you're a developer and want to build something similar, the architecture is straightforward: one Flask app per bot, all sharing a codebase, differentiated by prompt and tool config. The rest is just scaling the same pattern.

Built with Python, Flask, gunicorn, python-telegram-bot, running on a $6 Contabo VPS. AI models provided by OpenAI and Anthropic.

I Built a Telegram Bot That Acts as Your AI Employee (Here's the Architecture)

ModelHub Dev — Tue, 30 Jun 2026 08:06:43 +0000

I Built a Telegram Bot That Acts as Your AI Employee (Here's the Architecture)

Running a small business means you're always short on time. Customer inquiries pile up, orders need tracking, quotes need sending — and you're just one person. What if you could hire an employee who works 24/7, never takes sick leave, and costs less than a cup of coffee per day?

That's exactly what I built. An AI Employee — deployable through Telegram — that handles customer service, order management, and sales support for small businesses. Here's how it works under the hood.

The Problem

Small business owners wear too many hats. You're the CEO, the sales team, the customer support rep, and the warehouse manager — all at once. Every minute spent answering Do you have this in stock? is a minute not spent growing your business.

Existing solutions are either too expensive (hiring a full-time employee), too impersonal (basic chatbots), or too complex (building your own AI system). There's a gap for something that's:

Affordable — under /month
Personal — understands your specific business
Easy to deploy — no coding required
Always available — responds in seconds, any time

That gap is where AI Employees come in.

The Architecture

The system runs on a relatively simple stack. Python/Flask handles HTTP webhooks from Telegram, NGINX sits in front as a reverse proxy, and multiprocessing lets us run multiple employee roles in parallel.

┌─────────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐ │ Telegram │────▶│ NGINX │────▶│ Flask │────▶│ Python │ │ Bot API │ │ (Proxy) │ │ Webhooks │ │ Multiprocess │ └─────────────┘ └──────────┘ └───────────┘ └──────┬───────┘ │ ▼ ┌──────────────┐ │ ModelHub │ │ API (Deep- │ │ Seek) │ └──────────────┘

NGINX → Flask: All Telegram webhooks hit the server on port 443. NGINX terminates SSL and proxies to the Flask app running on a local port. This is standard practice — NGINX handles the TLS overhead while Flask focuses on request processing.

Flask Webhook Handler: Telegram sends a POST request every time a user messages the bot. Flask receives it, validates the HMAC signature, and dispatches to the right handler based on which bot received the message.

Multiprocessing: Each employee role runs in its own process. This means if one employee is processing a long request, others keep responding instantly. The main Flask process acts as a router.

ModelHub API: All AI responses go through ModelHub API, which provides access to DeepSeek models. This gives us high-quality reasoning and context understanding at a fraction of the cost of larger models.

Here's a simplified version of the webhook handler:

`python
from flask import Flask, request, jsonify
import hmac
import hashlib
import os

app = Flask(name)

Each bot has its own token and model configuration

BOTS = {
'ecommerce': {'token': os.environ.get('ECOMMERCE_TOKEN'), 'role': 'ecommerce_assistant'},
'clerk': {'token': os.environ.get('CLERK_TOKEN'), 'role': 'foreign_trade_clerk'},
'support': {'token': os.environ.get('SUPPORT_TOKEN'), 'role': 'customer_service'},
}

@app.route('/webhook/', methods=['POST'])
def webhook(bot_name):
if bot_name not in BOTS:
return jsonify({'error': 'unknown bot'}), 404

bot = BOTS[bot_name]
data = request.get_json()

# Validate Telegram HMAC signature
secret = bot['token'].encode()
signature = request.headers.get('X-Telegram-Bot-Api-Secret-Token', '')
expected = hmac.new(secret, request.data, hashlib.sha256).hexdigest()

if not hmac.compare_digest(signature, expected):
    return jsonify({'error': 'invalid signature'}), 403

# Dispatch to worker process
message = data.get('message', {})
chat_id = message.get('chat', {}).get('id')
text = message.get('text', '')

# Queue for the appropriate worker process
process_message(bot['role'], chat_id, text)

return jsonify({'ok': True}), 200

How Telegram Webhooks Work

Instead of polling Telegram's API for new messages, we use webhooks. When a user sends a message to the bot, Telegram immediately POSTs the message data to our server URL. This is far more efficient than polling — responses are near-instant and we only process requests when there's actual activity.

Setting up a webhook is a single API call:

POST https://api.telegram.org/bot<TOKEN>/setWebhook { url: https://your-server.com/webhook/bot_name, secret_token: your_hmac_secret }

The secret_token is critical — it lets us verify that the request genuinely came from Telegram. Without it, anyone could send fake messages to your bot.

The Multi-Bot Pattern

One interesting challenge: how do you offer multiple employee roles without asking users to install a dozen different bots?

The solution is a hub bot pattern. A single main bot (@ai_staff_xiaochen_bot) serves as the entry point. Users see a menu of available employees — Ecommerce Assistant, Foreign Trade Clerk, Customer Service, Sales Assistant — and pick the one they need. Behind the scenes, each role is a separate Telegram bot with its own webhook, its own conversation history, and its own system prompt defining its personality and capabilities.

This pattern means:

Users only install one bot instead of managing multiple
Each role is completely isolated — context doesn't leak between conversations
Roles can be added independently without affecting existing ones
Each role can have its own pricing — pay only for what you use

The hub bot uses inline keyboard markup to create a clean selection interface:

`python
from telegram import InlineKeyboardButton, InlineKeyboardMarkup

def show_role_selection(chat_id):
keyboard = [
[InlineKeyboardButton(🛒 Ecommerce Assistant, callback_data='role_ecommerce')],
[InlineKeyboardButton(📋 Foreign Trade Clerk, callback_data='role_clerk')],
[InlineKeyboardButton(🎧 Customer Service, callback_data='role_support')],
[InlineKeyboardButton(💼 Sales Assistant, callback_data='role_sales')],
]
reply_markup = InlineKeyboardMarkup(keyboard)
# Send the menu to the user
bot.send_message(
chat_id=chat_id,
text=Choose your AI Employee:,
reply_markup=reply_markup
)
`

Free Trial to Paid

Every new user gets 300 free messages — enough to try out an employee for a week or two. After that, it's .9/month per employee. The pricing is intentional: cheap enough that any small business can afford it, but not free (free users have no incentive to treat the system well).

The payment flow is handled through Telegram itself. When a user hits the message limit, the bot sends an invoice via Telegram's payment API. Once paid, the counter resets and the user continues with their AI employee.

Lessons Learned

1. Prompt engineering is 80% of the work. The AI is only as good as its instructions. Writing a good system prompt for a foreign trade clerk requires real domain knowledge — what to ask about shipping terms, how to handle pricing inquiries, when to escalate to a human. This isn't something you can skip.

2. Multiprocessing matters more than you think. At first we used threading, but long-running requests would block everything. Moving to multiprocessing was a night-and-day difference in responsiveness.

3. Users treat AI employees like real people. They say please and thank you. They get frustrated when the bot misunderstands. They name their bots. This is both heartwarming and a reminder to get the experience right.

4. The HMAC validation is non-negotiable. We had a competitor try to reverse-engineer our bot by flooding it with fake webhook payloads. The HMAC check blocked every single one. It takes two lines of code and saves you a world of pain.

5. Keep a human fallback. The AI handles 90% of queries on its own, but there's always an escalate to human command. Users appreciate knowing a real person can step in if needed.

What's Next

The AI Employee platform launched recently and is already handling thousands of messages daily. Small businesses are using it to automate order tracking, answer product questions, send quotes, and manage customer relationships — all through a simple Telegram interface.

If you run a small business or know someone who does, give it a try: @ai_staff_xiaochen_bot. Pick an employee, send a few messages, and see what happens. The first 300 are on the house.

For developers interested in the technical side, the entire system runs on standard tools with open-source libraries. Flask, python-telegram-bot, and a reliable API backend are all you need to build something similar. The real magic is in the prompts.

DeepSeek V4 Flash vs OpenAI GPT-4o: A Cost Analysis for AI Developers

ModelHub Dev — Wed, 10 Jun 2026 16:48:59 +0000

The Real Cost of AI APIs

After building AI applications for years, one question keeps coming up: how much does it actually cost?

Here's the honest breakdown.

OpenAI Pricing

GPT-4o mini: $0.15/M input, $0.60/M output
GPT-4o: $2.50/M input, $10.00/M output
Heavy production use: $5,000-7,500/month

DeepSeek V4 Flash Pricing (via ModelHub)

Input: $0.14/M tokens
Output: $0.28/M tokens
Heavy production use: ~$1,000/month

The Gap is 86%

Same API format. Same quality for most tasks. Different price.

Switching takes 5 minutes — just change your base URL and model name.

Plans

Backpack: $15/month — 60M tokens, 1 API key
Launch: $65/month — 280M tokens, 20 API keys

Try It

→ https://modelhub-api.com
Launch promo code: PHLAUNCH50

ModelHub API — Access DeepSeek V4 Flash and other top Chinese AI models at near-cost pricing.

DeepSeek V4 Flash API: 86% Cheaper Than OpenAI, Same OpenAI-Compatible Format

ModelHub Dev — Wed, 10 Jun 2026 15:54:31 +0000

The Problem

Your app runs on OpenAI. It works. You're shipping features. But then the invoice comes.

A personal project doing ~50M tokens/month: $900/month on GPT-5.5.
A mid-size production app doing 500M tokens/month: $9,000/month.

That's not a scaling cost. That's a second salary.

The Surprising Solution

DeepSeek V4 Flash — China's top-ranked open-weight model — costs $0.15 per million input tokens via a globally accessible API. Same tier as GPT-5.5 on independent benchmarks (coding, math, data analysis). But 45x cheaper.

And you can switch with exactly two lines of code:

# Before — paying $900/mo
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After — paying $15/mo
client = OpenAI(
    api_key="sk-...",
    base_url="https://modelhub-api.com/v1"  # ← only change
)

Everything below this line stays identical. Same SDK. Same parameters. Same response format.

Why This Works

The OpenAI SDK has become the de facto standard for LLM APIs. Any model provider that wants developers to use them builds a compatible endpoint. DeepSeek, Qwen, GLM-4 — they all speak the same protocol.

What changes is the backend: different architecture (Mixture-of-Experts with 671B total params but only 37B active per token), different training strategy (reinforcement learning at scale), and different cost structure (Chinese compute is ~60% cheaper than US hyperscaler pricing).

Real Cost Comparison

Here's what a typical developer workload looks like (100M tokens/month, 60/40 input/output split):

Provider	Model	Input $/M	Output $/M	Monthly	vs GPT-5.5
GPT-5.5	Flagship	$5.00	$15.00	$900	—
DeepSeek V4 (Official)	Raw	$0.07	$0.14	$9.72	93x cheaper
ModelHub	V4 Flash	$0.15	$0.30	$21.00	43x cheaper
GPT-4o mini	Budget	$0.15	$0.60	$33.00	27x cheaper
Claude Sonnet 4	Premium	$3.00	$15.00	$780.00	1.2x cheaper

At 500M tokens/month (a growing production app):

GPT-5.5: $4,500/month
ModelHub: $105/month

The gap isn't 10%. It's 40x.

What About Quality?

This is the obvious question. Here's the real answer:

For technical tasks (coding, math, data analysis, classification), DeepSeek V4 Flash is competitive with or better than GPT-5.5 at 1/45 the cost.

Independent benchmarks (MMLU-Pro, HumanEval, MATH-500, LiveCodeBench):

Benchmark	GPT-5.5	DeepSeek V4 Flash	DeepSeek R1
MMLU-Pro	78.1%	75.9%	84.0%
HumanEval (pass@1)	90.2%	82.6%	92.4%
MATH-500	76.4%	74.3%	97.3%
LiveCodeBench	71.4%	65.2%	80.3%

The nuance: GPT-5.5 is still better at creative writing, nuanced instruction following, and multi-modal tasks. But for 80% of production AI use cases — RAG, classification, code generation, data extraction — DeepSeek is more than good enough. And much cheaper.

The Migration (Real Engineering, Not Marketing)

I migrated my production pipeline three months ago. Here's exactly what broke and what didn't:

Zero issues:

Chat completions API — identical
Streaming — works exactly like OpenAI's SSE
JSON mode — same parameter, same behavior
Function calling — solid, just adjust the model name

Minor tweaks needed:

System prompt placement: DeepSeek is slightly more sensitive to instruction ordering
Temperature: default 0.3 vs OpenAI's 0.7 (produces more reliable outputs)
Retry logic: occasional timeouts on burst traffic (add 3 retries with exponential backoff)

Total engineering time: ~4 hours for a production pipeline processing 5M documents/month.

The Hidden Cost Nobody Talks About

Beyond API tokens, there's the switching cost. Most developers know they're overpaying but stay because migrating seems painful.

It's not. The OpenAI SDK was designed as a standard. Every compatible provider speaks it. The hardest part is generating a new API key.

# Smart routing: use the right model for the right task
def smart_complete(prompt, task_type="general"):
    model_map = {
        "simple": "deepseek-v4-flash",  # $0.15/M
        "code": "deepseek-v4-flash",    # $0.15/M
        "reasoning": "deepseek-r1",     # $0.55/M — best reasoning model
        "creative": "gpt-5.5",          # $5.00/M — only when needed
        "classification": "qwen-3",     # $0.10/M
    }
    model = model_map.get(task_type, "deepseek-v4-flash")
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

With a routing layer like this, I'm spending $80/month on what used to be $1,200/month. Same quality for users. 93% less cost.

Try It

ModelHub — One API key, 45+ AI models (DeepSeek V4 Flash, DeepSeek R1, Qwen, GLM-4, GPT-4o, Claude 4, Gemini 2.5 Pro, and more), global payment, no Chinese phone number required.

Free $5 credit to start, no credit card needed. Change two lines. Save 95%.

Built with ❤️ by a developer who was tired of overpaying for AI inference.

I Switched from OpenAI to Chinese AI Models. My API Bill Went from to .

ModelHub Dev — Sun, 07 Jun 2026 10:51:31 +0000

A few months ago, my monthly API bill hit . I was using GPT-4o and Claude for a SaaS app. Summarization, classification, extraction. Nothing crazy. Then I discovered Chinese AI models. DeepSeek V4. GLM-4. Qwen3. Same quality tier. Different planet pricing.

DeepSeek V4: .34/1M tokens vs GPT-5.5 at /1M. 43x cheaper. And from my testing, reasoning quality is within 2-3% on coding tasks.

The problem: Accessing Chinese models as a US dev is a nightmare. Chinese phone number. WeChat Pay. Separate API keys for each provider. Most devs just give up.

That is why I use ModelHub API (https://modelhub-api.com). One API key, 45+ Chinese models. OpenAI SDK compatible -- just change the base URL. Global credit cards work. No Chinese phone.

My bill went from to ~. Same volume.

Give it a try with free credit: https://modelhub-api.com
Launching June 11 on Product Hunt.

Building a Multi-Provider AI Gateway: Rate Limiting, Format Normalization, and Cost Optimization

ModelHub Dev — Sun, 07 Jun 2026 04:08:30 +0000

When you build a product that needs to serve multiple AI models from different providers, you quickly run into a wall: every provider has a different API.

Some use SSE streaming. Some don't. Some count tokens by characters. Some by sub-words. Rate limits? Completely different formats.

Here's how I built a gateway that handles all of them under one interface.

The Problem

You want to offer: DeepSeek, Qwen, GLM-4, Kimi, and more — all through one API key. Each provider has:

Different auth methods
Different content types (JSON vs plain text vs multipart)
Different error formats
Different streaming formats (SSE vs chunked vs WebSocket)
Different token counting

A naive approach would be spaghetti code with if/else chains. Not sustainable.

Architecture: Three Layers

Client → Gateway (rate limiter + auth) → Router (model selection) → Provider Adapter (format normalization)

Layer 1: Auth & Rate Limiting

All requests start with API key validation. Simple Redis check: GET api_key:{key}. If found, extract user_id and plan.

Rate limiting is per-user, per-plan, per-model. Three tiers:

Free tier: 10 RPM, 100K TPM
Standard: 60 RPM, 1M TPM
Pro: 300 RPM, 10M TPM

Implementation is a sliding window counter in Redis:

def check_rate_limit(user_id, model, rpm_limit):
    key = f"ratelimit:{user_id}:{model}:{int(time.time()/60)}"
    count = redis.incr(key)
    redis.expire(key, 120)  # 2 min ttl
    return count <= rpm_limit

Layer 2: Router

Each provider registers itself with supported models:

ROUTING_TABLE = {
    "deepseek-v4-flash": "deepseek",
    "deepseek-r1": "deepseek",
    "qwen-3": "alibaba",
    "glm-4": "zhipu",
    "doubao": "byteplus",
    "kimi": "moonshot",
}

The router takes model from the request body and maps it to the correct provider adapter. No if/else — just a dict lookup.

Layer 3: Provider Adapters

This is where the magic happens. Each adapter normalizes:

Input format: Convert OpenAI-style messages to provider-native format
Output format: Convert provider response back to OpenAI-compatible
Streaming: Normalize SSE data: chunks to a unified event format
Error codes: Map provider errors to OpenAI-style errors (401, 429, 500)

Example adapter for DeepSeek:

class DeepSeekAdapter(BaseAdapter):
    def to_provider(self, payload):
        return payload  # DeepSeek already uses OpenAI format

    def to_openai(self, response_json):
        # DeepSeek returns OpenAI-compatible response
        return response_json

    def stream_chunks(self, raw_lines):
        for line in raw_lines:
            if line.startswith("data: "):
                yield line[6:]  # Strip SSE prefix

For providers that don't use OpenAI format (like Kimi or GLM-4), the adapter does a complete transformation:

class KimiAdapter(BaseAdapter):
    def to_provider(self, payload):
        # Kimi uses a different message format
        return {
            "model": "kimi",
            "messages": [{"role": m["role"], "content": m["content"]}
                         for m in payload["messages"]],
            "temperature": payload.get("temperature", 0.7),
        }

Cost Optimization

The real value is intelligent routing. With multiple providers, you can:

Fallback on error: If DeepSeek returns 503, try Qwen
Latency-based routing: Route to the fastest provider right now
Cost-based routing: Use the cheapest model that meets quality requirements

Implementing fallback:

async def chat_completion(request):
    providers = priority_list(request.model)
    last_error = None
    for provider in providers:
        try:
            return await provider.complete(request)
        except ProviderOverloaded:
            last_error = "All providers overloaded"
            continue
    raise ServiceUnavailable(last_error)

Token Counting

The hardest part. Each provider counts tokens differently. Our approach:

Default to tiktoken (OpenAI's tokenizer) for OpenAI-compatible models
Provider-reported token counts from response headers
Estimated: len(text) / 4 for Chinese-heavy content (Chinese chars are ~2 tokens in most tokenizers)

We store user usage as the count reported by the provider, not our estimate. This avoids disputes.

Results

With this architecture:

Adding a new provider takes ~100 lines of code (adapter + routing entry)
99.9% uptime across 45 models
Average response time: 380ms (slightly higher than single-provider due to routing)

The full gateway serves ~100M tokens per day with 6 worker processes. No special hardware needed.

Key Takeaways

Provider adapters are the critical abstraction — invest in a clean interface
Rate limiting must be per-model, not per-user — one noisy user shouldn't block all models
Fallback chain is free reliability — one provider goes down, another takes over
Unified error handling matters more than you think — your SDK users will thank you

Built with ❤️ and Python async. Data from production serving 45+ Chinese AI models globally.

How to Compare AI API Costs Across Providers: CLI Tool Walkthrough

ModelHub Dev — Sun, 07 Jun 2026 03:34:32 +0000

API costs are the second-biggest expense (after compute) for AI startups. Here is a free CLI tool to compare them.

The Problem

You need to pick an AI API provider. Prices vary wildly:

DeepSeek V4 Flash: $0.15/M tokens
GPT-5.5: $5.00/M tokens
Claude Haiku 3.5: $0.80/M tokens

But comparing blends of input/output costs across models at your specific volume is tedious.

The Solution: ai-model-cost

A zero-config CLI tool. Install it in one command:

npx ai-model-cost --tokens=100M --compare

Output shows every major model sorted by price with monthly cost at your volume.

Quick Examples

# Compare all models at 100M tokens
npx ai-model-cost -t=100M -c

# Check a specific model
npx ai-model-cost --model=gpt-5.5 --tokens=500M

# List all available models
npx ai-model-cost --list

The Results at 100M Tokens/Month

Model	Monthly Cost	vs GPT-5.5
DeepSeek V4 Flash (ModelHub)	$18	45x cheaper
Gemini 2.0 Flash	$22	36x cheaper
DeepSeek V4 Flash (Official)	$9	91x cheaper
GPT-5.5	$800	baseline

Bottom Line

Before signing up for a $2,000/month API bill, run this tool. It takes 5 seconds.

https://github.com/AdamXiao-eolab/ai-model-cost

The Chinese AI Models You Should Know About in 2026

ModelHub Dev — Sun, 07 Jun 2026 03:23:10 +0000

The Chinese AI ecosystem has matured quietly while the world was watching OpenAI. Here are the models you should know about in 2026.

DeepSeek R1

Arena ELO: 91
Best at: Advanced reasoning, math, coding
Cost: $0.55/M tokens
The standout. Competitive with GPT-5.5 at 1/10 the price.

DeepSeek V4 Flash

Arena ELO: 89
Best at: General purpose, high throughput
Cost: $0.15/M tokens
The workhorse. 33x cheaper than GPT-5.5.

Qwen 3 (Alibaba)

Best at: Multilingual, coding
Cost: $0.10/M tokens
Ridiculously cheap. Good quality for budget tasks.

GLM-4 (Zhipu AI)

Best at: Balanced performance
Cost: $0.20/M tokens
Consistent and reliable.

Kimi (Moonshot)

Best at: Long context, document analysis
Context: 128K tokens
Excellent for RAG pipelines.

The Problem

Accessing these models from outside China requires a Chinese phone number, WeChat, and Alipay.

The Solution

ModelHub (https://modelhub-api.com/) provides global access to all these models. One API key. International payment. OpenAI-compatible SDK.

Pricing starts at $15/month for 60M tokens. $5 free credit to test.

I Spent $2,000 on GPT-5.5 Last Month. Now I Pay $75.

ModelHub Dev — Sun, 07 Jun 2026 03:22:32 +0000

Here is the math that changed my business.

The Bill

I run a high-volume RAG pipeline. 100M tokens per month for embedding and generation.

With GPT-5.5:

Input: 100M tokens x $5.00/M = $500
Output: ~100M tokens x $20.00/M = $2,000
Total: $2,500/month

With DeepSeek V4 Flash (via ModelHub):

Input: 100M tokens x $0.15/M = $15
Output: ~100M tokens x $0.60/M = $60
Total: $75/month

Monthly savings: $2,425.

But Is the Quality Good Enough?

DeepSeek V4 Flash scores 89 on the Arena leaderboard. GPT-5.5 scores 92. For text generation, summarization, and chatbots? The difference is negligible.

The Catch

You need to access Chinese AI models from outside China. That means:

Chinese phone number
WeChat account
Alipay

Or you use ModelHub - one API gateway with international payment.

What I Switched

Changed 2 lines of code in my config file. Took longer to write this post than to switch.

# Before
client = openai.OpenAI(api_key=openai_key, base_url="https://api.openai.com/v1")

# After
client = openai.OpenAI(api_key=modelhub_key, base_url="https://modelhub-api.com/v1")

Bottom Line

If you are processing more than 10M tokens/month, switching to Chinese models saves real money.

Get $5 free credit at https://modelhub-api.com/

One API Key for 45 Chinese AI Models: No Chinese Phone Number Required

ModelHub Dev — Sun, 07 Jun 2026 03:22:31 +0000

Before I found ModelHub, accessing Chinese AI models was painful.

You needed a Chinese phone number. Then WeChat. Then Alipay. Then someone to help you navigate the Chinese app ecosystem.

The Problem

I wanted to try DeepSeek R1. It was scoring 91 on the Arena leaderboard and cost a fraction of GPT-5.5. But I couldn't even sign up.

The Solution

I built ModelHub. One API gateway for 45+ Chinese AI models. OpenAI-compatible SDK. International payment.

What You Get

DeepSeek V4 Flash: $0.15/M tokens (vs GPT-5.5 at $5.00)
DeepSeek R1: $0.55/M tokens - scored 91 on Arena
Qwen 3: $0.10/M tokens
GLM-4: $0.20/M tokens
Kimi: 128K context window

Pricing

Starter: $15/mo (60M tokens)
Pro: $65/mo (280M tokens)
$5 free credit to test, no credit card

How It Works

import openai

client = openai.OpenAI(
    api_key="mh-sk-...",
    base_url="https://modelhub-api.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Hello!"}]
)

Same SDK, different endpoint. That's it.

Try it at https://modelhub-api.com/

DeepSeek V4 Flash vs GPT-5.5: A Cost Comparison Every Developer Should See

ModelHub Dev — Sun, 07 Jun 2026 03:14:30 +0000

If you are still paying GPT-5.5 prices, you are overpaying by 33x.

The Price Gap

Provider	Model	Price per 1M tokens
OpenAI	GPT-5.5	$5.00
ModelHub	DeepSeek V4 Flash	$0.15

33x cheaper for input, 25x cheaper for output.

Real-World Cost Example

Processing 100M tokens per month for a RAG pipeline:

With GPT-5.5: $2,000 per month
With DeepSeek V4 Flash (via ModelHub): $75 per month

Annual savings: $23,100

Performance

Despite being 33x cheaper, DeepSeek V4 Flash is competitive:

Arena ELO: 89 (vs GPT-5.5 at 92)
Coding: Excellent (Python, JavaScript)
Context: 128K tokens
Best for: chatbots, code gen, translation

How to Switch

Change 2 lines of code. That is all.

curl https://modelhub-api.com/v1/chat/completions \
  -H "Authorization: Bearer $KEY" \
  -d '{"model": "deepseek-v4-flash", "messages": [{"role": "user"}]}'

Get $5 free credit at https://modelhub-api.com/ - no credit card required.

Why Every Developer Needs a Second API Provider (and Why Chinese AI Models Are the Smart Choice)

ModelHub Dev — Sun, 07 Jun 2026 03:13:51 +0000

Get $5 free credit at ModelHub. No credit card required.