DEV Community

Cover image for Why Your SRE Agents Need a Graph
Kirill Polishchuk
Kirill Polishchuk

Posted on • Originally published at kirponik.hashnode.dev

Why Your SRE Agents Need a Graph

Traditional automation relies on Directed Acyclic Graphs (DAGs)—linear pipelines that execute steps A, then B, then C. Tools like GitHub Actions and Jenkins excel at this. They're perfect for deterministic workflows like building Docker images or running test suites.

But infrastructure failures aren't linear. When your database chokes at 3 AM, the recovery process is iterative: you observe metrics, form a hypothesis, test it, and when it fails—you backtrack and try another angle. Even after finding the root cause, you often need to pause and ask for human approval before executing a potentially destructive remediation.

A linear pipeline can't do any of this. If step 2 fails, the pipeline dies. It can't loop back to gather more data. It can't pause mid-execution to wait for human input.

This is why AI agents need graph-based orchestration. Not the rigid DAGs of CI/CD pipelines, but cyclic, stateful graphs that support iteration, maintain context across cycles, and can pause for human approval at critical moments.

Here's how I built a production-ready autonomous SRE system using LangGraph for the orchestration and PydanticAI for the agent intelligence.

Why AI Agents Need a Graph?

Traditional automation uses linear pipelines. But AI agents are different—they think, they iterate, they sometimes need to ask for help. Without a graph-based orchestrator, you're left with brittle scripts that can't adapt.

The Three Superpowers of Graph-Based Agent Orchestration:

1. Cyclic Routing: Think → Act → Reflect → Repeat

Real debugging is iterative. An agent makes a hypothesis, tests it, and either succeeds or loops back with new information. A graph with cyclic routing lets your agents iterate naturally:

2. Stateful Memory: Building Context Across Iterations

Each agent cycle builds on the last. The graph maintains state—observations, metrics, hypotheses—so agents don't start from scratch every time. This persistent context is crucial for complex debugging scenarios where the root cause only becomes apparent after several failed attempts.

3. Human-in-the-Loop with Interrupts: Safety at Critical Moments

The most powerful feature: graphs can pause execution and wait for human input. When your agent wants to kill a query on the production database, it should ask first. With the right orchestrator, this is elegant—the graph simply pauses, sends a Slack message with Approve/Reject buttons, and waits. When the human clicks, the graph resumes exactly where it left off.

Without a graph orchestrator, implementing human approval would require complex state polling or external workflow engines. With the right tool, it's a natural part of the workflow.

The Problem with DAGs in Infrastructure

Imagine a traditional automation script trying to debug a database:

  1. Trigger: High CPU alert.
  2. Action: Fetch slow query log.
  3. Action: If slow query found, kill it.

What happens if the slow query log is empty because the issue is actually an InnoDB lock wait? The script fails, throws an exception, and wakes you up anyway.

What if the script identifies a fix but needs human approval before executing it? A DAG can't pause mid-execution—it just runs to completion or fails.

We need a system that can say: "Hmm, the slow log didn't give me the answer. Let me loop back, look at the disk I/O metrics, and form a new hypothesis. And before I execute any remediation, let me ask a human to confirm."

Stateful, Multi-Agent Orchestration with Human Approval

To solve this, I built a system with three core components:

The State Machine (LangGraph)

The recovery process is treated as a state machine rather than a pipeline. The graph maintains the incident's global memory:

class IncidentState(TypedDict):
    """Global memory that persists across agent cycles."""
    incident_id: str
    # Observations accumulate with each cycle (using operator.add)
    observations: Annotated[List[str], operator.add]
    metrics: dict
    hypothesis: Optional[str]
    remediation_sql: Optional[str]
    cycle_count: Annotated[int, operator.add]
    # Approval workflow state
    approval_status: Optional[str]  # pending, approved, rejected
    approver: Optional[str]
Enter fullscreen mode Exit fullscreen mode

As agents loop through diagnostic cycles, they append their findings here. This persistent context is crucial for complex debugging where the root cause only becomes apparent after several failed attempts.

The Brain (PydanticAI Agents)

Specialized AI agents handle different aspects of the investigation:

  • Metrics Agent: Fetches and interprets Prometheus data
  • Analyzer: Forms hypotheses from the metrics
  • Researcher: Validates hypotheses by querying the database

These agents return strictly typed JSON outputs, so the graph router knows exactly what to do with their results—loop back for more diagnosis, proceed to remediation, or escalate to a human.

The Safety Layer (MCP + Human Approval)

Two critical safety mechanisms:

MCP (Model Context Protocol): Acts as a secure gateway between the AI and your database. The AI never sees credentials—MCP holds them locally and only exposes read-only diagnostic tools. It's like a USB-C port for AI data access.

Human-in-the-Loop: Before any destructive action, the graph pauses and sends an interactive Slack message. The workflow literally cannot proceed until a human clicks "Approve" or "Reject."

Dynamic Configuration

A key insight: Grafana alerts already contain everything we need. The MySQL instance IP, whether it's a replica, the cluster name—all of it is in the alert labels. We extract this dynamically, eliminating the need for static configuration files or AWS credential management. Each alert creates its own isolated connection context.

The Orchestration Flow

Here's how the pieces fit together:

Cyclic Routing in Practice

The router is simple but powerful. Here's how it works in code:

def router(state: IncidentState) -> str:
    """Route the workflow based on current state."""
    # Guardrail: Prevent infinite loops and runaway costs
    if state["cycle_count"] > 5:
        return "escalate"

    # Success: Found the root cause
    if state["status"] == "ready_for_remediation":
        return "request_approval"

    # Iteration needed: Loop back for more diagnosis
    return "diagnose"
Enter fullscreen mode Exit fullscreen mode

The router decides:

  • If the researcher finds the root cause: Proceed to remediation
  • If the hypothesis is wrong: Loop back to the diagnose node with updated context
  • If we've tried 5 times without success: Escalate to a human
  • If we're ready to remediate: Pause and wait for human approval

This cycle continues until success, human intervention, or the safety limit is reached.

The Interrupt Pattern

When the graph reaches the approval point, it doesn't poll or busy-wait. It interrupts—pausing execution entirely and saving its state:

def approval_wait_logic(state: IncidentState):
    """Pause graph and wait for human approval via interrupt."""
    # Send Slack message with Approve/Reject buttons
    send_slack_approval_request(state["incident_id"], state["hypothesis"])

    # Interrupt: Pause execution entirely, wait for external input
    decision = interrupt({"request": "awaiting_approval", "incident_id": state["incident_id"]})

    # Graph resumes here when human clicks button
    if decision["approved"]:
        return {"status": "execute_remediation"}
    else:
        return {"status": "escalate"}
Enter fullscreen mode Exit fullscreen mode

When a human clicks "Approve" in Slack, the webhook resumes the graph with the decision:

# In Slack webhook handler
async def handle_approval(incident_id: str, approved: bool):
    decision = {"approved": approved, "approver": user_name}

    # Resume the graph with the human's decision
    async for event in graph.astream(
        Command(resume=decision),
        config={"configurable": {"thread_id": incident_id}}
    ):
        pass  # Graph continues from interrupt point
Enter fullscreen mode Exit fullscreen mode

Execution continues exactly where it left off. This is far more elegant than polling loops or external state machines.

Production Safety

Unlike standard scripts, agentic systems are billed per "thinking cycle." An LLM stuck in a loop trying to debug a phantom network issue will happily burn through tokens until your OpenAI bill looks like a phone number.

The cycle limit (capped at 5) ensures that if the AI is truly stumped, it gracefully escalates to a human SRE rather than looping infinitely.

Other critical guardrails:

  • Read-only database access: MCP only exposes diagnostic queries
  • Mandatory human approval: No destructive actions without explicit sign-off
  • Approval timeouts: Auto-escalate if humans don't respond in time
  • Structured logging: Full observability with correlation IDs for every incident

Why This Architecture Wins?

Without a graph orchestrator:

  • Scripts fail on first error
  • No way to iterate or backtrack
  • No built-in human approval mechanism
  • State is lost between steps
  • Can't implement "ask a human" mid-workflow

With LangGraph:

  • Agents iterate naturally: hypothesis → test → refine
  • State persists across cycles
  • Interrupts enable human-in-the-loop safety
  • Clear routing logic based on typed agent outputs
  • Built-in cycle limits prevent runaway costs

By moving away from linear DAGs and utilizing cyclic graphs with governed tool access (MCP) and human-in-the-loop interrupts, we finally have an infrastructure recovery system that behaves like a real engineer: it investigates, it fails, it adapts, it asks for help when needed, and it tries again.

The Bottom Line

Linear pipelines work for deterministic processes. But infrastructure failures are messy, non-linear, and often require human judgment. AI agents need an orchestrator that matches this reality—one that supports iteration, maintains context, and can pause for human input.

Graph-based orchestration isn't just a nice-to-have for AI agents. It's the difference between a brittle script that wakes you up at 3 AM and an autonomous system that either fixes the issue or escalates with full context.


📚 Resources

Want to build this yourself? Here is the reading list I used to put this architecture together:

  • 🔗 LangGraph Documentation: The framework for building stateful, multi-actor applications with interrupts. Read the docs
  • 🔗 LangGraph Interrupt Pattern: How to pause graphs for human input. Read the docs
  • 🔗 PydanticAI: The typed, robust agent framework by the creators of Pydantic. Read the docs
  • 🔗 Model Context Protocol (MCP): The open standard for securely connecting AI to data sources. Official Specification

Top comments (0)