Building a Bitcoin SRE Agent: Incident Response Meets AI

#mcp #bitcoin #kubernetes #ai

Building a Bitcoin SRE Agent: Incident Response Meets AI

What if your Bitcoin node had an AI SRE on call 24/7?

Not a chatbot that answers "what is Bitcoin?" -- an actual site reliability engineer that detects fee spikes at 3am, investigates mempool floods, diagnoses root causes, and tells you exactly what to do. One that follows the same incident response protocol your best on-call engineer follows, except it never sleeps and never forgets a runbook.

That's what I built for the MCP & AI Agents Hackathon. Here's how.

The Problem with Bitcoin Monitoring Today

Running Bitcoin infrastructure is operationally demanding. A node operator deals with:

Fee volatility: Fees can jump 10x in minutes during inscription mints or exchange consolidations. Send a transaction at the wrong time and you overpay by $50. Wait too long and your payment doesn't confirm for hours.
Mempool floods: 200,000+ unconfirmed transactions pile up, memory usage spikes, and your node starts evicting low-fee transactions. You need to know if this is transient or sustained.
Mining anomalies: A major pool goes offline, hashrate drops 15%, block times stretch to 25 minutes. Is this a temporary outage or something to worry about?
Node health: Your node falls behind, loses peers, or runs low on disk. By the time Grafana pages you, you've already missed blocks.

The standard response is a wall of dashboards, static alert thresholds, and a runbook wiki page that was last updated six months ago. The operator wakes up, SSHs in, runs commands, cross-references block explorers, and makes a judgment call. This process is slow, error-prone, and doesn't scale.

The Solution: An AI Agent That Follows SRE Protocol

The Bitcoin SRE Agent replaces the runbook with an AI agent that follows a structured four-phase incident response protocol:

DETECT -- Compare current state against baseline thresholds. Is this fee rate 3x the 7-day average? Are there more than 100K unconfirmed transactions? Has it been 20+ minutes since the last block?

INVESTIGATE -- Gather context. If fees spiked, check the mempool composition. Are inscription transactions dominating? Is block production normal? What does the next block template look like?

DIAGNOSE -- Correlate signals. Fee spike + mempool flood + inscription surge = Ordinals mint event. Fee spike + normal mempool = estimation overshoot. Slow blocks + hashrate drop = mining pool outage.

RECOMMEND -- Produce actionable guidance. "Wait 2 hours, fees should normalize. Use 15 sat/vB if time-sensitive. No escalation needed." Or: "CRITICAL -- fees above 500 sat/vB sustained for 3 blocks. Alert the team."

Architecture: Three Sponsor Projects Working Together

The agent uses all three hackathon sponsor projects, each in its natural role:

agentregistry (discover bitcoin-mcp)
    |
    v
agentgateway (rate limit + OTEL traces)
    |
    v
kagent ToolServer (bitcoin-mcp, 49 tools)
    |
    v
kagent Agent CRD (Bitcoin SRE Agent)
    |
    v
Incident Response Loop

kagent provides the orchestration layer. The ToolServer CRD wraps bitcoin-mcp -- my MCP server with 49 tools for querying the Bitcoin network. The Agent CRD defines the SRE agent itself, including a 500+ word system prompt that encodes the full incident response protocol, detection thresholds, diagnostic patterns, and safety rules.

agentgateway sits in front of bitcoin-mcp, enforcing rate limits (60 req/min to prevent runaway investigation loops) and exporting OpenTelemetry traces to Jaeger. After an incident, you can replay every tool call the agent made -- complete forensics.

agentregistry publishes bitcoin-mcp so other agents and operators can discover it. The registry entry describes all 49 tools across 9 categories, making it composable into other agent workflows.

The Demo: A Fee Spike at 3am

Here's what happens when you run the agent:

=== Bitcoin SRE Agent -- Incident Response Demo ===

[1/4] NODE HEALTH CHECK
  Block Height:    889,241
  Sync Status:     GREEN -- SYNCED (height 889,241)
  Network:         main

[2/4] FEE ENVIRONMENT
  Assessment:      YELLOW -- Elevated (28.0 sat/vB)
  Recommendation:  Non-urgent txs should wait. Use 17 sat/vB if time-sensitive.

[3/4] MEMPOOL ANALYSIS
  Assessment:      YELLOW -- Congested (67,234 txs, 245 MB)
  Recommendation:  Estimated clearing time: ~3 hours at current hashrate.

[4/4] SITUATION REPORT
  Overall Status:  YELLOW
  Diagnosis:       Fee spike correlated with mempool congestion.
                   Likely organic demand surge or inscription event.
  Recommendation:  Non-urgent transactions should wait.
                   Monitor for resolution over next 1-3 hours.

The agent called three bitcoin-mcp tools (get_blockchain_info, get_fee_estimates, get_mempool_info), analyzed the results against its built-in thresholds, correlated the signals, and produced a structured situation report. No dashboards, no manual investigation, no runbook lookups.

The System Prompt: Encoding Operator Knowledge

The most interesting part of this project isn't the code -- it's the system prompt in the kagent Agent CRD. It encodes years of operator knowledge into structured rules:

Fee interpretation: Normal is 1-10 sat/vB. Elevated is 10-50. High is 50-200. Crisis is 200+.
Diagnostic patterns: Fee spike + mempool flood + inscriptions = mint event. Fee spike + normal mempool = estimation overshoot.
Safety rules: Never broadcast a transaction without explicit approval. Never make financial recommendations. Always caveat uncertainty.
Escalation criteria: Alert humans when peers drop below 2, node falls 100+ blocks behind, or fees exceed 500 sat/vB for 3+ blocks.

This is the key insight: the agent's intelligence comes from the prompt, not the code. The code just calls tools and formats output. The prompt makes it an SRE.

Why This Pattern Matters Beyond Bitcoin

The architecture -- domain MCP server + kagent orchestration + agentgateway security -- isn't Bitcoin-specific. It's a template for any infrastructure SRE agent:

Database SRE: Wrap pg_stat_statements as MCP tools. Detect slow queries, recommend index changes.
Kubernetes SRE: Wrap kubectl as MCP tools. Detect crash loops, recommend resource adjustments.
Network SRE: Wrap SNMP/NetFlow as MCP tools. Detect anomalous traffic, recommend firewall rules.

The pattern is always the same: domain tools (MCP server) + structured reasoning (agent prompt) + guardrails (gateway) + discoverability (registry). kagent makes the agent Kubernetes-native. agentgateway makes it production-safe. agentregistry makes it composable.

Try It Yourself

# Clone and run the demo
pip install bitcoin-mcp
python demo.py

No Bitcoin node required -- bitcoin-mcp falls back to the free Satoshi API automatically. The demo connects, runs a health check, analyzes fees and mempool, and produces a situation report in about 10 seconds.

For the full stack with agentgateway and Jaeger tracing:

docker compose up -d
python demo.py --gateway
# Open http://localhost:16686 to see traces in Jaeger

All code, CRDs, and configs are in the submissions/agents/ directory.

What's Next

This is a proof of concept, but the path to production is clear:

Scheduled execution: Run the health check on a cron (every 15 minutes via kagent). Alert on severity changes.
Historical baselines: Store past readings and compare against rolling averages instead of static thresholds.
Multi-node: Monitor a fleet of Bitcoin nodes, not just one. The ToolServer CRD supports multiple instances.
Automated remediation: For safe operations (restart a stuck node, add peers), let the agent act without human approval.
A2A composition: Other agents can delegate to the Bitcoin SRE agent via kagent's A2A protocol. A general infrastructure agent could ask "is Bitcoin healthy?" and get a structured response.

The future of infrastructure operations isn't more dashboards. It's agents that understand your systems, follow your runbooks, and wake you up only when they need to.

Built for the MCP & AI Agents Hackathon -- Building Cool Agents category.

Tools: bitcoin-mcp | kagent | agentgateway | agentregistry