Krishna shakula

Posted on May 8

IRAS: Building an Autonomous AI Agent for Incident Response

#ai #agents #devops #automation

IRAS: Building an Autonomous AI Agent for Incident Response

Incident response is broken. When alerts fire at 3 AM, on-call engineers wake up to handle routine triage, root cause analysis, and remediation planning—work that doesn't require human judgment, just time and attention. IRAS solves this by automating the entire incident response workflow with an autonomous AI agent that keeps humans in control.

The Problem

Most on-call incidents follow a predictable pattern:

Alert fires
Engineer wakes up, triages the alert
Engineer investigates root cause
Engineer creates a remediation plan
Engineer executes the plan
Engineer writes a post-mortem

For routine incidents (disk full, memory leak, failed job retry), steps 1-4 don't require human judgment. They require pattern matching and analysis—exactly what AI is good at. Yet engineers still get paged.

The Solution: IRAS

IRAS is an autonomous AI agent built on production-grade technology:

FastAPI for HTTP endpoints
LangGraph for multi-step agentic workflows
Pydantic AI for structured outputs and validation
Claude (Anthropic) for reasoning and analysis
Python for the entire stack

How It Works

When an alert fires, IRAS executes a fully autonomous workflow:

Alert → Triage → RCA → Remediation Plan → Post-Mortem → Human Approval

Each step is handled by Claude with structured outputs validated by Pydantic. The entire workflow is orchestrated by LangGraph as a state machine with approval gates.

Key metric: Sub-2-minute incident resolution from alert to remediation plan.

Human-in-the-Loop Control

IRAS doesn't execute remediation automatically. Every step requires human approval:

Triage approval: Confirm the incident classification
RCA approval: Confirm the root cause analysis
Remediation approval: Approve the remediation plan before execution
Post-mortem approval: Review the generated post-mortem

AI does the heavy lifting. Humans stay in control.

Production-Grade Reliability

IRAS isn't a prototype. It's built for production:

99% test coverage with 292 passing tests
Zero external test dependencies—mock clients included for local development
Integrated observability: Logging, PagerDuty, Slack support
Fallback mock clients: Test without external services
Docker-ready: Run locally or in production

Testing Strategy

The test suite includes:

Unit tests for each workflow step
Integration tests for the full incident response workflow
Mock PagerDuty and Slack clients for isolated testing
No external service dependencies

Run the full test suite locally:

pytest --cov=iras --cov-report=html

Getting Started

IRAS only requires an Anthropic API key:

git clone https://github.com/krishnashakula/IRAS
cd IRAS
export ANTHROPIC_API_KEY=your_key_here
docker-compose up

The mock clients are enabled by default, so you can test the full workflow without PagerDuty or Slack.

Architecture

IRAS is structured as a LangGraph state machine:

from langgraph.graph import StateGraph

# Define incident state
class IncidentState(TypedDict):
    alert: Alert
    triage: TriageResult
    rca: RCAResult
    remediation_plan: RemediationPlan
    post_mortem: PostMortem
    approvals: Dict[str, bool]

# Build workflow
graph = StateGraph(IncidentState)
graph.add_node("triage", triage_node)
graph.add_node("rca", rca_node)
graph.add_node("remediation", remediation_node)
graph.add_node("post_mortem", post_mortem_node)

# Add approval gates
graph.add_edge("triage", "approval_triage")
graph.add_edge("approval_triage", "rca")
# ... more edges

Each node uses Claude for analysis and Pydantic AI for structured outputs.

Real-World Impact

Reduces MTTR

Mean Time To Resolution drops dramatically. Routine incidents get analyzed in 2 minutes instead of 30 minutes.

Eliminates Routine Wake-Ups

On-call engineers stop getting paged for incidents that don't require human judgment. Only serious incidents or approval decisions wake them up.

Maintains Human Control

Every action requires human approval. AI is a tool, not a replacement.

Comprehensive Post-Mortems

Automatic post-mortem generation means every incident gets documented, even routine ones.

Integration

IRAS integrates with:

PagerDuty: Fetch alerts, update incident status
Slack: Send notifications, get approvals
Mock clients: Test without external services

Why This Matters

Incident response is a solved problem for routine incidents. The analysis is predictable. The remediation is known. The only variable is human approval. IRAS automates the predictable parts and keeps humans in control of the decisions.

For on-call engineers, this means:

Fewer 3 AM wake-ups
Faster incident resolution
Better post-mortems
More time for strategic work

Open Source

IRAS is open source and production-ready. Check it out: https://github.com/krishnashakula/IRAS

Built with Python, FastAPI, LangGraph, Pydantic AI, and Claude. 99% test coverage. Zero external test dependencies. Only requires an Anthropic API key.

Start automating your incident response today.

DEV Community

IRAS: Building an Autonomous AI Agent for Incident Response

IRAS: Building an Autonomous AI Agent for Incident Response

The Problem

The Solution: IRAS

How It Works

Human-in-the-Loop Control

Production-Grade Reliability

Testing Strategy

Getting Started

Architecture

Real-World Impact

Reduces MTTR

Eliminates Routine Wake-Ups

Maintains Human Control

Comprehensive Post-Mortems

Integration

Why This Matters

Open Source

Top comments (0)