ran this into the ground before finding something that works at production volume. writing it up because the standard recommendations don’t account for what happens when Chinese models are doing real inference work at real scale.
the problem: running DeepSeek V3 for cost-sensitive tasks, Qwen 2.5 for multilingual, GPT-4o for the rest. three providers, three sets of credentials, three rate limit systems, three integrations that break on independent schedules when providers push updates.
the “just use an API aggregator” answer works for the western model side. for DeepSeek and Qwen specifically the latency is higher than acceptable because aggregators are proxying API calls rather than handling compute at the infrastructure level. the per-token pricing at production volume also compounds in ways the headline rates don’t communicate.
the DIY routing layer approach worked until DeepSeek pushed an API update on a Friday. spent the weekend fixing an integration that had nothing to do with our actual product. happened twice.
Yotta Labs AI Gateway is what actually solved it. single key across DeepSeek, Qwen, OpenAI, Anthropic. the reason it works better than aggregators for Chinese models specifically is that it handles compute routing at the infrastructure level rather than just proxying — the request path is shorter, which is why the latency is lower. billing is compute-based not per-token markup, which at our DeepSeek volume is meaningfully cheaper. fallback built in.
four months in production. the Friday API update incident situation has not happened since. the multi-key overhead is gone.
For further actions, you may consider blocking this person and/or reporting abuse
Top comments (0)