Why I added a 6th LLM to my orchestrator (and why Grok with K is not Groq with Q)

#ai

TL;DR: I built geo-orchestrator, an open-source multi-LLM pipeline in Python that routes tasks across 6 providers. Yesterday's real production run: 10 tasks, 5 waves, \$0.1967 total cost, 5/6 providers used, zero failures.

Repo (MIT): https://github.com/alexandrebrt14-sys/geo-orchestrator

The premise

No single frontier model wins by itself anymore. Not Claude Opus 4.7. Not GPT-4o. Not Gemini 2.5 Pro. Not Grok 4.3. What wins is orchestration between them.

I didn't reach this thesis by opinion. I reached it by running 1,189 calls with unified tracking. The spreadsheet doesn't lie: concentrating 96.7% of cost in Claude Opus gave me high quality at corporate-subscription bill. Distributing across 6 providers with complexity-based routing gave me equivalent or superior quality at 1/30th the price.

Why 6 providers (and why Grok is not Groq)

Most common confusion in vendor calls this month:

Groq Inc (with Q) — LPU chip company. Serves open-source models like Llama 4 Scout and gpt-oss-120b at sub-second latency. Bills \$0.11/\$0.34 per 1M tokens. API at api.groq.com/openai/v1.

xAI Grok (with K) — Elon Musk's lab. Owns proprietary models grok-4.3 and grok-4.20 with live X/Twitter search via search_parameters (nobody else has this). Bills \$1.25/\$2.50 per 1M. API at api.x.ai/v1.

Two completely different companies. Two different roles in the stack. The catalog uses long labels to prevent runtime errors.

The architecture, in 3 layers

Layer 1 — Provider enum with 6 entries

class Provider(str, Enum):
    ANTHROPIC = "anthropic"
    OPENAI = "openai"
    GOOGLE = "google"
    PERPLEXITY = "perplexity"
    GROQ = "groq"
    XAI = "xai"  # added 2026-05-17 — grok-4.3 with live X search

Layer 2 — Tier routing by complexity

Tasks classified complexity 1–2 go economy tier (Haiku, Llama Scout, sonar). Complexity 3 goes balanced (Sonnet, GPT-4o). Complexity 4–5 wakes up premium (Opus, Pro, sonar-deep-research). Before tier routing I ran everything in Opus. Today Opus is 8% of calls and audited quality is the same.

Layer 3 — Provider concentration cap at 80%

If any vendor crosses 80% of session calls, the router rebalances to next viable provider. Inspired by Mixture of Agents (Wang 2024, arXiv:2406.04692) and DAAO (arXiv:2509.11079).

This isn't just FinOps. It's epistemic diversity: 6 models failing at different points give me more robust consensus than 1 model failing consistently.

What I learned in 14 months

Inference cost dropped 50x in 2 years. Whoever still pays \$5 per execution is paying routing inefficiency, not intelligence. The right question is not "which LLM is best" — it's "what's the optimal sequence of LLMs for this specific demand."

Provider diversity is the new redundancy. In April, Anthropic hit 102% of daily budget mid-run. The fallback chain redirected all Claude tasks to Sonnet → Groq with zero visible failure. Single-provider is single-point-of-failure.

The literature is cheaper than Stack Overflow. Three papers most influenced my last 6 months: Mixture of Agents (Wang 2024), RouteLLM (Ong 2024, arXiv:2406.18665), DAAO (2025). All open on arXiv, all with code.