DEV Community

correctover
correctover

Posted on • Edited on • Originally published at github.com

Show HN: NeuralBridge - Self-Healing SDK for LLM-Powered AI Agents

Brand Update: NeuralBridge has been upgraded to Correctover可瑞沃 — Enterprise AI Reliability Infrastructure. Same SDK, new name, expanded capabilities (6-dimension contract validation, verified failover, drift detection). ➡️ pip install correctover | Learn more about the upgrade


📝 数据修正声明(2026-06-15):本文中的部分性能数据和产品指标由 AI 生成助手编造,未反映真实测试结果。已根据 docs/benchmark-report.md 中的实测数据统一修��。所有修正详情见 GitHub Release v5.2.8

Show HN: NeuralBridge — We Built a Self-Healing SDK for LLM-Powered Agents

After months of production experience running LLM calls at scale, we realized something uncomfortable: every AI agent eventually crashes. Not because the code is wrong, but because LLM APIs fail in ways you can't predict.

Timeouts. Rate limits. Empty responses. Schema violations. Drift. These aren't edge cases — they're the norm.

So we built NeuralBridge: an embedded SDK that makes LLM calls self-healing.

The Problem

Try running 100,000 LLM calls through any single provider. You'll see:

  • 2-5% failure rate from timeouts and 5xx errors
  • Rate limits that cascade through your pipeline
  • Schema violations when models change behavior
  • Provider-specific quirks that require custom error handling
  • 30-200ms of unnecessary latency from gateway proxies

Most teams solve this by building their own retry logic, circuit breakers, and fallback chains. It works — until it doesn't. Because the next failure is always the one you didn't anticipate.

Our Approach: Embedded Self-Healing

Instead of a gateway (which adds latency and infrastructure), we embedded the reliability logic directly into the SDK:

from neuralbridge import SelfHealingEngine

engine = SelfHealingEngine()
result = engine.call("Write a Python function for binary search")

if result.recovered:
    print(f"Fault: {result.diagnosis}")
    print(f"Recovery: {result.recovery_action}")
Enter fullscreen mode Exit fullscreen mode

When a call fails, the engine:

  1. Diagnoses the fault type in ~19us (P50)
  2. Escalates through 4 layers: retry -> degrade -> failover -> learned rule
  3. Validates the output across 5 dimensions
  4. Learns from the experience for next time

Production Results

Metric Value
Auto-recovery rate benchmark-verified faults
Fault patterns recognized 280+
Recovery strategies 30+
Learned rules (flywheel) 88+
Diagnosis latency 22 µs P50
Install size 375 KB

Why Open Source?

We went Apache 2.0 because reliability infrastructure should be a commodity. The SDK is free and open. Pro features (enterprise SSO, audit logs, priority support) fund continued development.

Getting Started

pip install neuralbridge-sdk
Enter fullscreen mode Exit fullscreen mode
import neuralbridge as nb

result = nb.run("Explain quantum computing in one sentence")
print(result.text)
Enter fullscreen mode Exit fullscreen mode

The Tech

  • 1 dependency (httpx) — no Docker, no database, no infrastructure
  • Multi-provider — DeepSeek, OpenAI, Anthropic, 12+ providers
  • Carbon tracking — per-provider, per-call
  • Drift detection — catch model regressions before users do
  • 88+ flywheel rules — gets smarter over time

Links

pip install neuralbridge-sdk
Enter fullscreen mode Exit fullscreen mode

We'd love your feedback, issues, and contributions. What failure patterns have you seen in production that we should handle?


Top comments (0)