DEV Community: carsonroell-debug

I replaced $149/mo API error monitoring with $0.001 per fix — and agents pay only when it works

carsonroell-debug — Wed, 15 Apr 2026 12:02:51 +0000

How I built an outcome-based API proxy using x402 micropayments where AI agents only pay when errors are successfully diagnosed.

tags: ai, webdev, opensource, blockchain
cover_image:

canonical_url: https://selfheal.dev

The problem with API error handling for AI agents

Every AI agent builder hits the same wall: your agent calls an API, gets a 422, and crashes. The error message says "Unprocessable Entity" — not helpful for an autonomous system running at 3 AM.

The existing solutions are all subscription-based: $149-$349/month for error monitoring that tells you something went wrong. Your agent still crashes. You still get paged.

I wanted something different: an API proxy that fixes the error and only charges when the fix actually works.

What I built

SelfHeal is an API proxy that sits between your agent and any third-party API. Here's what happens:

When the API succeeds (2xx): Free pass-through. Zero cost. Under 200ms overhead.

When the API fails (4xx/5xx):

SelfHeal returns HTTP 402 with an x402 payment spec
Your agent pays $0.001-$0.005 USDC (gasless, on Base)
An LLM analyzes the exact error + your request payload
Returns structured JSON: what went wrong, how to fix it, whether to retry
Payment only settles if the analysis succeeds

That last point is key. Failed analyses = no charge. You literally only pay for results.

The x402 protocol

x402 is a payment protocol built by Coinbase for machine-to-machine micropayments. It works like this:

Server returns HTTP 402 (Payment Required) with a payment spec
Client signs a USDC transfer authorization (gasless via EIP-3009)
Client retries with payment proof in the X-PAYMENT header
Server verifies, delivers the resource, settles payment

For SelfHeal, "the resource" is an LLM-powered error diagnosis. The settlement only happens if the diagnosis is useful.

Show me the code

Without SelfHeal:

const res = await fetch("https://api.crm.com/contacts", {
  method: "POST",
  body: JSON.stringify({ name: "John Doe" })
});
// 422 Unprocessable Entity
// Agent crashes. You get paged.

With SelfHeal:

const res = await fetch("https://selfheal.dev/api/x402/proxy", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://api.crm.com/contacts",
    method: "POST",
    body: JSON.stringify({ name: "John Doe" })
  })
});

if (res.status === 402) {
  // Pay $0.001 USDC, get fix instructions
} else {
  // Success — free pass-through
}

No API keys. No signup. No monthly bill.

What the heal response looks like

{
  "healed": true,
  "settled": true,
  "transaction": "0xbf023f9c...",
  "error_analysis": {
    "error_category": "validation",
    "human_readable_explanation": "The request body is missing the required 'email' field.",
    "actionable_fix_for_agent": "Add the 'email' field to the request body.",
    "is_retriable": true,
    "suggested_payload_diff": {
      "add": { "email": "string (valid email address)" }
    }
  },
  "meta": {
    "cost_usdc": 0.001,
    "latency_ms": 3038
  }
}

Your agent gets structured instructions it can act on programmatically. Not a generic error page.

Pricing comparison

	SelfHeal	Sentry	TraceRoot.AI	Struct AI
Model	Pay per fix	$26-80/mo	$0-200/mo	$20-200/mo
Auto-fix	Yes	No	Human review	Human review
AI agents	Native	Bolted on	Partial	Partial
Signup	None	Required	Required	Required
Cost when it works	$0	Same price	Same price	Same price

SDKs

Both SDKs handle the x402 payment flow automatically:

npm install graceful-fail    # Node.js/TypeScript
pip install graceful-fail    # Python

LangChain and CrewAI integrations included.

What's next

Mainnet launch (currently on Base Sepolia testnet)
More LLM providers for error analysis
Agent framework integrations beyond LangChain/CrewAI

The code is open source: github.com/carsonlabs/graceful-fail

Try it: selfheal.dev

Built by Freedom Engineers. We build tools for AI agents.

I Built an MCP Server That Finds Broken Links. Here's How AI Agents Use It

carsonroell-debug — Wed, 01 Apr 2026 14:42:23 +0000

I published an MCP server that lets Claude, Cursor, and any MCP-compatible AI agent scan websites for broken links, estimate revenue impact, and suggest fixes. Here's why I built it and how it works.

Every day, my outreach engine scans affiliate blogs and finds the same thing: sites with 100+ pages almost always have 20-30 broken affiliate links that nobody knows about. Each one is a leak in the revenue bucket.

The auditing part was automated. But acting on the results still required me to look at the report, figure out what to fix, and make the changes.

That's exactly what AI agents are good at.

The Model Context Protocol lets AI agents call tools through a standard interface. Instead of giving an agent a screenshot or a CSV and saying "figure this out," you give it structured tools that return structured data.

So I wrapped LinkRescue's scanning engine as an MCP server. Now any agent can:

Scan a site → get back every broken link with status codes, page locations, and revenue estimates
Get fix suggestions → receive prioritized remediation steps with code snippets
Set up monitoring → schedule recurring checks

The core tool. Give it a URL, it crawls the site and returns a structured report.

An agent in Cursor could take this output and directly edit the source files. An agent in Claude could draft the update instructions for a team.

Set up scheduled monitoring and verify connectivity. Simple but necessary for agent workflows that run unattended

The server falls back to realistic simulated data when no backend API is connected, so you can test immediately without any setup.

Why MCP > API for This

A REST API returns data. An MCP server returns data that agents understand natively. The difference:

REST API: Agent calls endpoint → parses JSON → figures out what to do → makes another call
MCP server: Agent discovers tools → reads descriptions → chains them together automatically

With MCP, I've watched Claude scan a site, get the broken links report, pipe it directly to get_fix_suggestions, and present a prioritized fix plan — without any prompting about the workflow. The tool descriptions guide the agent.ive gone

The MCP server is open source. The full SaaS at linkrescue.io has additional features (dashboard, email alerts, historical tracking, team management).

I'm curious what other "scan and fix" workflows would benefit from MCP servers. If you've built something similar, I'd love to see it.

Links:

GitHub: carsonroell-debug/linkrescue-mcp
PyPI: linkrescue-mcp
Full product: linkrescue.io

Why AI Agents Fail Silently — And How to Fix It

carsonroell-debug — Sat, 28 Mar 2026 16:13:15 +0000

I spent three hours debugging an agent pipeline last week that wasn't broken.

No errors. No exceptions. The logs looked fine. The agent ran, did its thing, and returned a response. The response just happened to be completely wrong. Nothing in my stack told me that.

That's the problem. And if you're building anything with LLMs right now, you've almost certainly hit it already.

The failure mode nobody talks about

When a traditional API call breaks, you know. You get a 500, a timeout, an exception. Your monitoring catches it. Your retry logic kicks in. The system is designed around the assumption that failures are loud.

LLMs don't work that way.

An LLM can receive your prompt, process it fully, and return a confident, well-formatted, completely hallucinated response. No error code. No signal that anything went wrong. Just bad output delivered with the same confidence as good output.

Now chain a few of those together in an agent pipeline — where the output of one step becomes the input of the next — and you have a system that can fail catastrophically while every individual component reports green.

This isn't a hypothetical. It's the default behavior of every agent framework I've worked with.

Three ways agents fail silently

Empty or malformed output

You ask the agent to extract structured data. It returns an empty object, or it returns JSON that's one closing bracket short of valid. Your code tries to parse it, gets None or throws a quiet exception that gets swallowed somewhere upstream, and the pipeline just... continues. With nothing.

This one is insidious because it often only happens on edge cases — unusual inputs, long contexts, prompts that are slightly ambiguous. Your happy path tests pass. Production fails on the 7% of inputs you didn't think to test.

Hallucinated success

The agent was supposed to do something — call an API, write a file, complete a task — and it responds as if it did. "Done! I've updated the record." It didn't update anything. There was no tool call. It just said the words.

This is especially common when you're using a model that's been fine-tuned to be helpful and agreeable. It wants to give you what you asked for. If it can't do the actual thing, sometimes it just reports that it did.

Cascading failures

This is the expensive one. Step 1 produces subtly wrong output. Step 2 runs on that output and produces plausible-looking but also wrong output. By step 4 or 5, you're so far from correct that there's no recovering — and you have no idea where the chain broke.

The frustrating part is that each individual step looks fine in isolation. The bug isn't in any one component. It's in the gaps between them.

Why the standard tooling doesn't catch this

Most error handling is built around the assumption that errors are exceptional. You wrap things in try/catch, you set up Sentry, you watch your logs for stack traces.

But an LLM returning bad output isn't an exception. It's a valid response. Your infrastructure has no concept of "the agent succeeded technically but failed semantically." That distinction doesn't exist in HTTP status codes or Python exception hierarchies.

So you end up with monitoring that tells you your system is healthy while your users are getting garbage.

The fix isn't more logging. It's a different mental model: assume the output is wrong until you can verify it isn't.

What good error handling looks like in an agent pipeline

A few things that actually work:

Validate outputs structurally. If you're expecting JSON, parse it immediately and fail loudly if it's not valid. If you're expecting a specific schema, validate against it. Don't let bad structure propagate downstream.

Validate outputs semantically. This is harder, but often doable. If the agent is supposed to return a URL, check that it's a real URL. If it's supposed to return a number in a certain range, check that. Simple assertions catch a surprising percentage of failures.

Capture the full context when something fails. Not just the error — the prompt, the model, the parameters, the raw response, the timestamp. You need all of it to reproduce and debug the failure later.

Retry with context. When something fails, don't just retry the same call. Pass the failure back to the model: "Your previous response had this problem. Try again." A lot of LLMs will self-correct when you tell them specifically what was wrong.

Here's what this looks like in practice with graceful-fail:

from graceful_fail import agent_call, RetryConfig

retry_config = RetryConfig(
max_attempts=3,
capture_context=True,
self_heal=True # passes failure context back on retry
)

@agent_call(retry=retry_config)
def extract_data(text: str) -> dict:
response = llm.complete(f"Extract structured data from: {text}")
return parse_json(response) # raises if malformed — triggers retry

When parse_json fails, graceful-fail captures the full context — the input, the raw LLM response, the exception — and on the next attempt, it appends that context to the prompt so the model knows what it did wrong. In practice this recovers about 70% of failures that would otherwise silently propagate.

The bigger picture

Here's what I keep coming back to: agents are about to get a lot more autonomous. The things we're building today are doing relatively contained tasks — summarizing, extracting, classifying. What's coming is agents that take real actions: sending emails, writing code, making purchases, managing infrastructure.

In that world, silent failures aren't just annoying. They're dangerous.

The time to build proper error handling into your agent pipelines is now, while the tasks are still low-stakes enough that bad output is embarrassing rather than catastrophic. The patterns are the same whether you're building a document parser or an autonomous business process. Get the foundations right early.

I built selfheal.dev because I got burned by this problem enough times that I finally sat down and built the thing I wished existed. It's open source — the package is graceful-fail on PyPI and npm. If you're building agent pipelines and you're not thinking about this yet, I hope this saves you the three hours I lost to a pipeline that wasn't broken.