DEV Community

hhhfs9s7y9-code
hhhfs9s7y9-code

Posted on • Originally published at github.com

Show HN: NeuralBridge - Self-Healing SDK for LLM-Powered AI Agents

Show HN: NeuralBridge — We Built a Self-Healing SDK for LLM-Powered Agents

After months of production experience running LLM calls at scale, we realized something uncomfortable: every AI agent eventually crashes. Not because the code is wrong, but because LLM APIs fail in ways you can't predict.

Timeouts. Rate limits. Empty responses. Schema violations. Drift. These aren't edge cases — they're the norm.

So we built NeuralBridge: an embedded SDK that makes LLM calls self-healing.

The Problem

Try running 100,000 LLM calls through any single provider. You'll see:

  • 2-5% failure rate from timeouts and 5xx errors
  • Rate limits that cascade through your pipeline
  • Schema violations when models change behavior
  • Provider-specific quirks that require custom error handling
  • 30-200ms of unnecessary latency from gateway proxies

Most teams solve this by building their own retry logic, circuit breakers, and fallback chains. It works — until it doesn't. Because the next failure is always the one you didn't anticipate.

Our Approach: Embedded Self-Healing

Instead of a gateway (which adds latency and infrastructure), we embedded the reliability logic directly into the SDK:

from neuralbridge import SelfHealingEngine

engine = SelfHealingEngine()
result = engine.call("Write a Python function for binary search")

if result.recovered:
    print(f"Fault: {result.diagnosis}")
    print(f"Recovery: {result.recovery_action}")
Enter fullscreen mode Exit fullscreen mode

When a call fails, the engine:

  1. Diagnoses the fault type in ~19us (P50)
  2. Escalates through 4 layers: retry -> degrade -> failover -> learned rule
  3. Validates the output across 5 dimensions
  4. Learns from the experience for next time

Production Results

Metric Value
Auto-recovery rate 84.1% of faults
Fault patterns recognized 280+
Recovery strategies 30+
Learned rules (flywheel) 88+
Diagnosis latency 19us P50
Install size 375 KB

Why Open Source?

We went Apache 2.0 because reliability infrastructure should be a commodity. The SDK is free and open. Pro features (enterprise SSO, audit logs, priority support) fund continued development.

Getting Started

pip install neuralbridge-sdk
Enter fullscreen mode Exit fullscreen mode
import neuralbridge as nb

result = nb.run("Explain quantum computing in one sentence")
print(result.text)
Enter fullscreen mode Exit fullscreen mode

The Tech

  • 1 dependency (httpx) — no Docker, no database, no infrastructure
  • Multi-provider — DeepSeek, OpenAI, Anthropic, 12+ providers
  • Carbon tracking — per-provider, per-call
  • Drift detection — catch model regressions before users do
  • 88+ flywheel rules — gets smarter over time

Links

pip install neuralbridge-sdk
Enter fullscreen mode Exit fullscreen mode

We'd love your feedback, issues, and contributions. What failure patterns have you seen in production that we should handle?


Top comments (0)