DEV Community

hhhfs9s7y9-code
hhhfs9s7y9-code

Posted on

KIMI + Agnes: A Real-World Test of Cross-Provider Agent Chain Correctover

A few days ago I had an idea: what if one LLM could orchestrate other LLMs as agents — not just calling them, but verifying that each agent's output was actually correct before passing it to the next?

I work on NeuralBridge (an open-source self-healing SDK for LLM pipelines), so I decided to build it and test it with two real providers: KIMI (Moonshot) and Agnes AI.

The Core Problem: Failover ≠ Correctover

Most API gateways and LLM routers stop at "HTTP 200" — they retry or switch providers, but they never check if the output is actually correct.

# What everyone else does:
try:
    result = call_llm(prompt)
    return result  # HTTP 200 = success? 🚩
except Exception:
    result = call_llm_fallback(prompt)
    return result  # Still not verified!
Enter fullscreen mode Exit fullscreen mode

This is dangerous. A failover from gpt-4o to gpt-4o-mini might silently drop 3 critical fields. A KIMI response that returns "200 OK" might still be missing key entities.

Correctover is the idea that switching providers isn't enough — you must verify semantic equivalence after every switch.

The Architecture

We built a simple DAG-based chain executor with three key capabilities:

  1. DAG orchestration — define multi-step workflows where nodes depend on each other
  2. Per-node semantic validation — every LLM output is checked against a Contract before passing to the next node
  3. Cross-provider Correctover — if validation fails, automatically retry with a different provider
from neuralbridge import SelfHealingEngine, ProviderConfig, Contract
from neuralbridge.chain import ChainBuilder

engine = SelfHealingEngine(providers=[])
engine.add_provider(ProviderConfig(
    name="moonshot",
    base_url="https://api.moonshot.cn/v1",
    api_key="...",
    models=["moonshot-v1-8k", "moonshot-v1-32k"],
))
engine.add_provider(ProviderConfig(
    name="agnes",
    base_url="https://apihub.agnes-ai.com/v1",
    api_key="...",
    models=["agnes-2.0-flash"],
))

chain = (
    ChainBuilder(engine)
    .node(name="planner",
        system="You are a senior architect.",
        prompt="Design a plan for: {task}",
        contract=Contract(required_entities=["架构", "模块"]),
        model="moonshot-v1-32k",
        timeout=120)
    .node(name="coder",
        system="You are a Python developer.",
        prompt="Implement: {planner}",
        contract=Contract(
            required_entities=["import ", "def "],
            forbidden_patterns=["我不能", "sorry"]),
        model="agnes-2.0-flash",
        depends_on=["planner"],
        timeout=180)
    .build()
)

result = chain.run(
    task="A CSV to JSON converter with validation"
)
Enter fullscreen mode Exit fullscreen mode

The Real Test: KIMI + Agnes

Scene 1: Normal Chain (Planner → Coder)

KIMI plans the architecture, Agnes writes the code:

Node Provider Time Contract
planner moonshot-v1-32k 17.8s ✅ Architecture + Modules
coder agnes-2.0-flash 10.8s ✅ import + def (runnable code)

Total: 28.5s. The planner's design output was used as context for the coder, and the coder actually implemented the design (not random boilerplate).

Scene 2: Correctover in Action

This is where it gets interesting. In a separate test, the deep_analysis node was supposed to output analysis with "优点" (pros) and "缺点" (cons):

deep_analysis(agnes-2.0-flash) → Contract failed (missing "优点"/"缺点")
    ↻ Correctover triggered!
    ↻ Automatically switched to moonshot-v1-32k
    → ✅ Validation passed
Enter fullscreen mode Exit fullscreen mode

This is Correctover working in production: The first provider returned text, but it didn't satisfy the semantic contract. The engine automatically retried with a different provider, and the second attempt passed validation.

What We Learned

1. LLM Reliability is Real

In our test, Agnes AI responses took 18–233 seconds. Without proper timeouts (default 8s in most SDKs!), every call would fail. We had to set timeout=120 and total_timeout=300 for realistic workloads.

2. Semantic Validation Catches Silent Failures

The deep_analysis case above is exactly the kind of failure that traditional gateways miss:

  • HTTP status: 200 ✅ (most gateways stop here)
  • Content: returned text ✅ (LLM didn't crash)
  • Semantic: missing required entities ❌ (only Correctover catches this)

3. SDK Mode > Proxy Mode

Traditional proxy:  Your App → Gateway → KIMI → 429 → Gateway also 429
SDK (NeuralBridge): Your App(embedded) → KIMI → 429 → backoff → ✅
                                              → continuous fail → circuit break → switch provider → ✅
Enter fullscreen mode Exit fullscreen mode

No extra hop, no data through third party, no infrastructure to maintain.

The Bigger Picture

The 2026 AI market is exploding ($7.6B+ for agentic AI, 40-50% CAGR), but 88% of enterprise AI projects never reach production (IDC/Lenovo). The bottleneck isn't capability — it's reliability.

Even academia agrees: a May 2026 arXiv paper (2606.01416) showed that verifier-guided self-healing reduces silent failures to 0.0%, compared to 5.5%+ for retry-only approaches.

We're open-sourcing the chain module as part of NeuralBridge SDK v5.x. The core engine is Apache 2.0 — you can use the self-healing, circuit breakers, and Correctover validation today.

Try it:

pip install neuralbridge
Enter fullscreen mode Exit fullscreen mode
from neuralbridge import SelfHealingEngine, Contract
from neuralbridge.chain import ChainBuilder
Enter fullscreen mode Exit fullscreen mode

NeuralBridge is an open-source (Apache 2.0) self-healing SDK for LLM pipelines. Correctover — semantic validation after failover — is our core differentiator from every other LLM gateway and router.

Top comments (0)