Astrodevil

Posted on Apr 8

How to Build Self-Healing AI Agents with Monocle, Okahu MCP and OpenCode

#ai #opensource #tutorial #programming

Coding agents write code. When that code fails, who debugs it? Right now, that's still you. The agent writes, you interpret error logs, you prompt the agent to fix. The debugging loop stays open.

The fix is giving agents access to their own telemetry. An agent that can query its own traces can verify its work, diagnose failures, and iterate without waiting for a human to read logs.

This tutorial shows you how to build a self-healing agent that runs tests, queries its own production traces via MCP, identifies root causes, and fixes bugs without human intervention.

By the end, you'll have a working demo where:

A buggy Text-to-SQL application fails its test suite
An agent queries its own traces from Okahu Cloud via MCP
The agent identifies and fixes bugs based on trace analysis
All tests pass, without a human reading logs or prompting fixes

What we'll cover:

What is Monocle and why auto-instrumentation matters
What is Okahu MCP and how agents consume telemetry
Setting up the self-healing demo environment
Running the agent and watching it fix itself
Key takeaways for building autonomous debugging loops

Let's start by understanding the two core technologies that make this possible.

What is Monocle?

In traditional software, you read the source code to understand what the application does. The logic is deterministic and inspectable. In agent-driven applications, the code is scaffolding. The actual decision-making happens inside the model at runtime.

An agent that can't access traces is an agent working without documentation. It will guess where failures occur and propose changes based on incomplete information.

Monocle helps developers and platform engineers building or managing generative AI apps monitor these in prod by making it easy to instrument their code to capture traces that are compliant with open-source cloud-native observability ecosystem. It automatically captures traces from LLM SDKs like OpenAI, LangChain, and LlamaIndex without any manual span creation.

Agents won't manually add telemetry to their generated code. If instrumentation requires effort, it won't happen. Monocle solves this by auto-instrumenting supported SDKs the moment they're used.

Here's what it takes to enable automatic trace capture:

from monocle_apptrace import setup_monocle_telemetry

# One line to enable automatic trace capture
setup_monocle_telemetry(workflow_name="text_to_sql_analyst")

That's it. From this point forward, every OpenAI SDK call, every database query, every tool invocation is captured as a trace span.

These traces include:

LLM calls: Inputs, outputs, model name, token usage
App traces: OpenTelemetry compatible for exporting traces/spans from an application.
Tool invocations: Agent framework tool calls and responses
Errors and latency: Exception details and timing data

When something goes wrong, the trace contains everything an agent needs to diagnose the problem. The question is: how does the agent access that trace data?

Zero-config instrumentation using monocle

It is also worthy to note that Monocle does not depend on existing OpenTelemetry instrumentation.

What is Okahu MCP?

Okahu is a cloud observability platform built for AI applications. It ingests traces from Monocle, stores them, and provides dashboards for visualization.

Dashboards are designed for human eyes. They're full of charts, graphs, and interactive trace viewers. An agent can't click through a dashboard. The data exists, but agents can't access it.

**MCP (Model Context Protocol)** is a standard for exposing data sources to AI agents. Okahu MCP turns the observability platform into a programmable API that agents can query directly. Instead of a human clicking through a dashboard, an agent calls /okahu:get_latest_traces and parses the JSON response.

Observability platforms must evolve from human dashboards to programmatic interfaces that agents can consume. MCP is how that happens.

To connect an agent to Okahu MCP, add this configuration:

{
 "mcp": {
  "okahu": {
   "type": "remote",
   "url": "https://mcp.okahu.ai/mcp",
   "headers": {
    "x-api-key": "your-okahu-api-key"
   },
   "enabled": true
  }
 }
}

Now the agent can query its own traces. It sees the same data a human would see in the dashboard, but in a format it can parse, reason about, and act on. Make sure you authenticate the server by running these command:

opencode mcp auth okahu

This will generate a URL and will redirect you to authenticate with Okahu cloud and a successful authentication will show a screen like the one below:

With Monocle capturing traces and Okahu MCP exposing them, we have the infrastructure for self-healing.

Prerequisites

Before starting, make sure you have the following:

Python 3.10+ installed on your system
OpenAI API key for LLM calls (the demo uses GPT-4o)
Okahu API key for trace storage (sign up at okahu.ai)
OpenCode or a similar coding agent with MCP support

Once you have these ready, let's clone the demo repository and set up the environment.

Step 1: Clone and Set Up the Demo

Start by cloning the demo repository and creating a virtual environment:

git clone https://github.com/Arindam200/awesome-ai-apps
cd mcp_ai_agents/telemetry-mcp-okahu

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install monocle_apptrace monocle_test_tools openai fastapi pytest python-dotenv

Next, create a .env file with your API keys:

# .env
OPENAI_API_KEY="your-openai-key"
OPENAI_MODEL="gpt-4o"
OKAHU_API_KEY="your-okahu-key"
MONOCLE_EXPORTER="okahu"

The MONOCLE_EXPORTER=okahu setting tells Monocle to send traces to Okahu Cloud instead of logging them locally. This is important because we're suppressing all local logs to force the agent to use MCP for debugging.

Finally, initialize the sample database:

python setup_db.py

This creates a sales.db SQLite database with users and orders tables. The Text-to-SQL application will generate queries against this database.

Now let's look at what makes this demo interesting: the intentionally buggy application.

Step 2: Understand the Buggy Application

The demo includes a pre-built analyst.py with three intentional bugs. Each bug produces a specific trace signature that the agent must find and interpret:

Bug	What's Wrong	What the Trace Shows
1	Uses `client.completions.create()` instead of `client.chat.completions.create()`	API error: "NotFoundError: /v1/completions not found for gpt-4o"
2	Accesses `.text` instead of `.message.content` on the response	AttributeError in trace span
3	Schema prompt says `customers/products` but DB has `users/orders`	SQL execution error: "no such table: customers"

Why pre-built bugs instead of letting the agent write code from scratch? Two reasons:

Reproducibility: Every demo run starts with the same failures
No guessing: The agent must read traces to understand what's wrong

You can reset to the buggy state anytime with:

python reset_demo.py

This script overwrites analyst.py with the buggy version, ready for the agent to fix.

With the buggy application in place, let's run the tests and see it fail.

Step 3: Run Tests and See Failures

The test suite uses Monocle Test Tools to validate not just outputs, but traces. We're testing that the right LLM calls were made, not just that the code ran.

Here is what the test_analyst.py file looks like and what it tests:

@MonocleValidator().monocle_testcase(sql_generation_test_cases)
def test_generate_sql_with_monocle(my_test_case: TestCase):
    """
    Test SQL generation using Monocle validator.

    This validates:
    - OpenAI inference spans are generated correctly
    - SQL output matches expected patterns (similarity check)

    If this fails, check Okahu MCP traces for:
    - Missing inference spans (wrong API method used)
    - Incorrect SQL generation (schema mismatch)
    """
    MonocleValidator().test_workflow(generate_sql, my_test_case)

# Direct database tests (no monocle validation needed - these test the DB itself)
def test_execute_query_users_table():
    """
    Test direct SQL execution on the actual database.
    This verifies the users table exists and has data.
    """
    sql = "SELECT * FROM users LIMIT 3"
    results = execute_query(sql)

    assert len(results) == 3, "Should return 3 users"
    # Verify column structure: (user_id, username, email)
    assert len(results[0]) == 3, "Each user row should have 3 columns"

def test_execute_query_orders_table():
    """
    Test direct SQL execution on orders table.
    This verifies the orders table exists and has data.
    """
    sql = "SELECT * FROM orders WHERE amount > 100 LIMIT 5"
    results = execute_query(sql)

    assert len(results) > 0, "Should find orders over $100"
    # Verify column structure: (order_id, user_id, amount, order_date)
    assert len(results[0]) == 4, "Each order row should have 4 columns"

Run the tests:

pytest test_analyst.py -v

You'll see output like this:

The direct database tests pass (the database itself is fine), but the LLM-powered SQL generation fails.

Here's what makes this demo realistic: there are no useful local logs. We've suppressed all local logging:

import logging
logging.disable(logging.CRITICAL)

The agent cannot read local logs to debug. It must query Okahu MCP to understand what went wrong. This forces the self-healing loop through observability infrastructure, exactly how it would work in production.

The trace shows exactly what went wrong. The agent queries this via MCP

The trace shows exactly what went wrong. The agent queries this via MCP

Step 4: Configure the Self-Healing Agent

The demo uses an OpenCode agent mode called @analyst_v3. This is a custom agent configuration stored in .opencode/agents/analyst_v3.md with specific rules for self-healing behavior.

Tool access alone isn't enough. The agent needs structured rules that encode how an experienced developer would debug.

Here are the summary of the core rules:

**Self-Healing Rules:**

1. Run tests first, observe failures
2. Wait 5 seconds for trace ingestion
3. Query Okahu MCP: `/okahu:get_latest_traces` with workflow_name='text_to_sql_analyst_v3'
4. Fix bugs based on trace analysis only
5. Archive old code to `versions/analyst_vN.py` before each fix
6. Record the trace ID used to diagnose each fix
7. Repeat until all tests pass

**Critical Constraint:**
NO GUESSING. If no traces are available, STOP and report failure.
Do not attempt to fix code without trace evidence.

The "no guessing" rule is the key constraint. Without it, the agent might make up fixes based on general knowledge. By requiring trace evidence, every fix must be grounded in telemetry. The agent can cite its sources.

The agent also has a list of approved packages it can install if missing:

monocle_apptrace, monocle_test_tools, openai, fastapi, pytest, python-dotenv

If it encounters a ModuleNotFoundError, it can install from this list, but nothing else.

With the agent configured, let's trigger the self-healing loop.

Step 5: Run the Self-Healing Loop

With traces accessible via MCP, the agent can close the loop itself. It runs, fails, queries traces, diagnoses, fixes, and repeats.

Trigger the agent with this prompt:

@analyst_v3 Fix the buggy Text-to-SQL codebase:

The `analyst.py`, `test_analyst.py`, and `main.py` files already exist but have bugs.

1. Run Tests: Execute `pytest test_analyst.py -v` to see failures.
2. Analyze Traces: Wait 5s, then query Okahu MCP with workflow_name='text_to_sql_analyst_v3'.
3. Fix Loop:
    - Archive current `analyst.py` to `versions/analyst_vN.py`
    - Fix the bug based on trace analysis
    - Record the trace ID used to diagnose each fix
    - Run tests again
    - Repeat until all tests pass
4. Final Report: Output a summary table of all issues fixed with their associated trace IDs.

Rules: No debug files. Debug only via Okahu MCP traces. Always call the MCP tool to get the logs from traces, do not use the local logs in the terminal

Here's what happens, completely autonomously:

Agent runs tests → 3 failures detected
Agent waits 5 seconds → Allows traces to be ingested into Okahu
Agent queries Okahu MCP → Retrieves trace data for the failed runs
Agent reads the first trace → Sees "NotFoundError: /v1/completions not found for gpt-4o"
Agent archives analyst.py → Saves to versions/analyst_v1.py
Agent fixes Bug #1 → Changes completions.create() to chat.completions.create()
Agent runs tests → 2 failures remain
Agent queries MCP again → Sees AttributeError for .text access
Agent fixes Bug #2 → Changes .text to .message.content
Agent runs tests → 1 failure remains
Agent queries MCP → Sees "no such table: customers"
Agent fixes Bug #3 → Updates schema in prompt from customers/products to users/orders
Agent runs tests → All 5 tests pass

Agent runs, fails, queries traces, fixes, repeats until all tests pass.

The entire loop runs without human intervention. The agent reads, reasons, and repairs using only its own telemetry.

Step 6: The Fix Summary

The agent produces a traceable summary. Every fix links to the trace that prompted it:

Issue	Description	Trace ID
1	Wrong OpenAI API (completions → chat.completions)	trace_a1b2c3d4...
2	Wrong response attribute (.text → .message.content)	trace_e5f6g7h8...
3	Schema mismatch (customers/products → users/orders)	trace_i9j0k1l2...

You can take any trace ID, look it up in Okahu, and see exactly what error led to that fix. The human's role shifts from doing the debugging to auditing the system that did the debugging.

The archived versions in versions/ show the progression:

analyst_v1.py: Original buggy version
analyst_v2.py: After fixing API method
analyst_v3.py: After fixing response attribute
analyst.py: Final working version

The human wasn't in the debugging loop. The human set up the environment, triggered the agent, and reviewed the summary. The debugging itself was autonomous.

OpenCode agent running test, using monocle traces on Okahu via MCP to debug and fix the buggy app

Key Takeaways

Here's what we've shown and why it matters:

1. Auto-instrumentation is essential for self-healing

Agents won't manually add telemetry to their generated code. If you want agents to debug their own work, instrumentation must be automatic. Monocle captures traces from supported SDKs with a single line of setup.

2. MCP enables agent-native observability

Dashboards are designed for humans. MCP turns observability platforms into programmable APIs that agents can query. Okahu MCP exposes the same trace data a human would see, but in a format agents can parse and act on.

3. "No Trace, No Fix" prevents hallucinated fixes

Without grounding in telemetry, agents might guess at fixes based on general knowledge. The "no trace, no fix" rule ensures every repair is evidence-based. If the agent can't find trace data, it stops and reports the issue rather than guessing.

4. Trace-based testing validates the full pipeline

Monocle Test Tools doesn't just check that code runs. It validates that the right traces exist. If the LLM wasn't called correctly, the inference span won't exist, and the test fails. This catches issues that output-only testing would miss.

5. Human role shifts from debugging to monitoring

The human sets up the environment, triggers the agent, and reviews the fix summary. The iterative debugging loop (run, fail, diagnose, fix, repeat) is fully autonomous. This is a fundamental shift in how we build and maintain AI applications.

The Evolution: Who Consumes Telemetry?

This demo represents a real shift in how software gets built:

Phase	Who Writes Code	Who Reads Telemetry	Human Role
Software 1.0	Human	Human	Everything
Software 2.0	Coding Agent	Human	Prompts + interprets errors
Autonomous	Coding Agent	Coding Agent	Monitors only

Who consumes telemetry data?

In Software 1.0, humans wrote code, read dashboards, and fixed bugs. In Software 2.0, agents write code, but humans still interpret errors and prompt fixes. The agent is the "hands," but the human remains the "brain."

In the autonomous phase, which is what this demo shows, the agent writes, runs, debugs, and fixes. The human monitors the process and reviews outcomes. The agent consumes its own telemetry.

The lesson from every team that has pushed agents toward autonomy is the same: the critical work is never writing code. It's designing the environment, encoding constraints, structuring knowledge, and building feedback loops. As human involvement decreases, the systems that steer and verify agent behavior must become more robust, not less.

This requires a shift in how we build observability platforms. Dashboards aren't enough. Platforms must expose MCP interfaces that agents can query programmatically. The data is the same; the consumer is different.

Conclusion

Self-healing agents are possible when three conditions are met:

Instrumentation is automatic: Monocle captures traces without manual span creation
Telemetry is programmatically accessible: Okahu MCP exposes traces as an API
Testing validates traces, not just outputs: Monocle Test Tools ensures the right calls were made

The result is a coding agent that can debug its own code by querying production telemetry. It runs tests, reads traces, identifies root causes, and applies fixes, closing the loop without human intervention.

The organizations that invest early in giving agents access to telemetry, embedding evaluation into workflows, and exposing programmatic observability interfaces will scale agent-driven development effectively. Those that don't will find their agents operating blind, producing code that looks correct but behaves unpredictably.

The question is no longer "who writes the code?" It's "who consumes the telemetry?" When agents can do both, the loop closes, and we've entered a new phase of software engineering.

Resources

Monocle: pip install monocle_apptrace | GitHub
Monocle Test Tools: pip install monocle_test_tools | Trace-based validation framework
Okahu Cloud: okahu.ai | AI observability platform with MCP support
Demo Repository: GitHub | Full source code for this tutorial
OpenCode: Install here

This article demonstrates a proof-of-concept for autonomous agent debugging. The patterns shown here (auto-instrumentation, MCP-based telemetry access, and trace-driven testing) are the building blocks for self-healing AI systems.

Top comments (3)

Shivay Lamba • Apr 27

Amazing article

Some comments may only be visible to logged-in visitors. Sign in to view all comments.