DEV Community

Krishna shakula
Krishna shakula

Posted on

IRAS: Building an Autonomous AI Agent for Incident Response

IRAS: Building an Autonomous AI Agent for Incident Response

Incident response is broken. When alerts fire at 3 AM, on-call engineers wake up to handle routine triage, root cause analysis, and remediation planning—work that doesn't require human judgment, just time and attention. IRAS solves this by automating the entire incident response workflow with an autonomous AI agent that keeps humans in control.

The Problem

Most on-call incidents follow a predictable pattern:

  1. Alert fires
  2. Engineer wakes up, triages the alert
  3. Engineer investigates root cause
  4. Engineer creates a remediation plan
  5. Engineer executes the plan
  6. Engineer writes a post-mortem

For routine incidents (disk full, memory leak, failed job retry), steps 1-4 don't require human judgment. They require pattern matching and analysis—exactly what AI is good at. Yet engineers still get paged.

The Solution: IRAS

IRAS is an autonomous AI agent built on production-grade technology:

  • FastAPI for HTTP endpoints
  • LangGraph for multi-step agentic workflows
  • Pydantic AI for structured outputs and validation
  • Claude (Anthropic) for reasoning and analysis
  • Python for the entire stack

How It Works

When an alert fires, IRAS executes a fully autonomous workflow:

Alert → Triage → RCA → Remediation Plan → Post-Mortem → Human Approval
Enter fullscreen mode Exit fullscreen mode

Each step is handled by Claude with structured outputs validated by Pydantic. The entire workflow is orchestrated by LangGraph as a state machine with approval gates.

Key metric: Sub-2-minute incident resolution from alert to remediation plan.

Human-in-the-Loop Control

IRAS doesn't execute remediation automatically. Every step requires human approval:

  1. Triage approval: Confirm the incident classification
  2. RCA approval: Confirm the root cause analysis
  3. Remediation approval: Approve the remediation plan before execution
  4. Post-mortem approval: Review the generated post-mortem

AI does the heavy lifting. Humans stay in control.

Production-Grade Reliability

IRAS isn't a prototype. It's built for production:

  • 99% test coverage with 292 passing tests
  • Zero external test dependencies—mock clients included for local development
  • Integrated observability: Logging, PagerDuty, Slack support
  • Fallback mock clients: Test without external services
  • Docker-ready: Run locally or in production

Testing Strategy

The test suite includes:

  • Unit tests for each workflow step
  • Integration tests for the full incident response workflow
  • Mock PagerDuty and Slack clients for isolated testing
  • No external service dependencies

Run the full test suite locally:

pytest --cov=iras --cov-report=html
Enter fullscreen mode Exit fullscreen mode

Getting Started

IRAS only requires an Anthropic API key:

git clone https://github.com/krishnashakula/IRAS
cd IRAS
export ANTHROPIC_API_KEY=your_key_here
docker-compose up
Enter fullscreen mode Exit fullscreen mode

The mock clients are enabled by default, so you can test the full workflow without PagerDuty or Slack.

Architecture

IRAS is structured as a LangGraph state machine:

from langgraph.graph import StateGraph

# Define incident state
class IncidentState(TypedDict):
    alert: Alert
    triage: TriageResult
    rca: RCAResult
    remediation_plan: RemediationPlan
    post_mortem: PostMortem
    approvals: Dict[str, bool]

# Build workflow
graph = StateGraph(IncidentState)
graph.add_node("triage", triage_node)
graph.add_node("rca", rca_node)
graph.add_node("remediation", remediation_node)
graph.add_node("post_mortem", post_mortem_node)

# Add approval gates
graph.add_edge("triage", "approval_triage")
graph.add_edge("approval_triage", "rca")
# ... more edges
Enter fullscreen mode Exit fullscreen mode

Each node uses Claude for analysis and Pydantic AI for structured outputs.

Real-World Impact

Reduces MTTR

Mean Time To Resolution drops dramatically. Routine incidents get analyzed in 2 minutes instead of 30 minutes.

Eliminates Routine Wake-Ups

On-call engineers stop getting paged for incidents that don't require human judgment. Only serious incidents or approval decisions wake them up.

Maintains Human Control

Every action requires human approval. AI is a tool, not a replacement.

Comprehensive Post-Mortems

Automatic post-mortem generation means every incident gets documented, even routine ones.

Integration

IRAS integrates with:

  • PagerDuty: Fetch alerts, update incident status
  • Slack: Send notifications, get approvals
  • Mock clients: Test without external services

Why This Matters

Incident response is a solved problem for routine incidents. The analysis is predictable. The remediation is known. The only variable is human approval. IRAS automates the predictable parts and keeps humans in control of the decisions.

For on-call engineers, this means:

  • Fewer 3 AM wake-ups
  • Faster incident resolution
  • Better post-mortems
  • More time for strategic work

Open Source

IRAS is open source and production-ready. Check it out: https://github.com/krishnashakula/IRAS

Built with Python, FastAPI, LangGraph, Pydantic AI, and Claude. 99% test coverage. Zero external test dependencies. Only requires an Anthropic API key.

Start automating your incident response today.

Top comments (0)