DEV Community

Krishna shakula
Krishna shakula

Posted on

IRAS: Building a Production-Grade Autonomous Incident Response Agent

IRAS: Building a Production-Grade Autonomous Incident Response Agent

Incident response at 3 AM is brutal. Your on-call engineer is woken up, scrambles to understand what's broken, manually triages the issue, performs root cause analysis, and then—if they're lucky—can finally propose a fix. This process typically takes 30+ minutes and burns out your team.

We built IRAS to automate this entire workflow. When an alert fires, IRAS triages the incident, performs RCA, generates a remediation plan, and drafts a post-mortem—all within 2 minutes. Your engineer reviews and approves the fix. That's it.

The Problem

Incident response is repetitive and exhausting:

  1. Alert fires → on-call engineer wakes up
  2. Manual triage → what's the severity? what's affected?
  3. Root cause analysis → why did this happen?
  4. Remediation planning → what's the fix?
  5. Post-mortem → document what happened and why
  6. Execution → apply the fix
  7. Follow-up → prevent recurrence

Steps 2-5 are highly repetitive and can be automated. IRAS handles all of them.

The Solution: IRAS

IRAS is an autonomous AI agent built on Claude, LangGraph, and FastAPI. It follows a deterministic workflow:

Alert → Triage → RCA → Remediation → Post-mortem → Human Approval → Execution
Enter fullscreen mode Exit fullscreen mode

Key Features

1. Fully Autonomous with Human Approval Gates

  • The agent makes decisions at each step (triage severity, identify root cause, propose fix)
  • Human approval is required before any remediation is executed
  • Safety-first design: no auto-remediation without review

2. Sub-2-Minute End-to-End Handling

  • Alert ingestion to remediation proposal in <120 seconds
  • Reduces on-call burden significantly
  • Enables faster incident resolution

3. Production-Grade Reliability

  • 99% test coverage with 292 passing tests
  • Comprehensive logging and observability
  • Deterministic workflow with structured outputs

4. Zero External Service Dependencies

  • Mock clients for Slack and PagerDuty included
  • No vendor lock-in
  • Runs entirely on your infrastructure

5. Automatic Post-Mortem Generation

  • Generates incident narratives automatically
  • Includes root cause, impact, and remediation details
  • Reduces post-incident documentation burden

Architecture

Tech Stack

  • FastAPI: REST API for alert ingestion and workflow orchestration
  • LangGraph: Multi-step agentic workflow with state management
  • Pydantic AI: Type-safe agent definitions and structured outputs
  • Claude: Core reasoning engine for triage, RCA, and remediation
  • Pytest: Comprehensive test suite with 99% coverage

Workflow Design

The agent follows a multi-step workflow:

  1. Alert Ingestion: Receives alert from monitoring system (Prometheus, DataDog, etc.)
  2. Incident Triage: Analyzes alert to determine severity, affected services, and impact
  3. Root Cause Analysis: Investigates logs, metrics, and system state to identify root cause
  4. Remediation Planning: Generates a step-by-step fix based on the root cause
  5. Post-Mortem Generation: Drafts incident narrative with timeline and learnings
  6. Human Approval: On-call engineer reviews and approves the proposed fix
  7. Execution: Applies the remediation (if approved)

Each step uses Claude with structured outputs (Pydantic) to ensure reliability and parseability.

Human-in-the-Loop Safety

No auto-remediation happens without human approval. The workflow is designed to:

  • Provide clear, actionable recommendations
  • Enable quick review and approval
  • Maintain human control and oversight
  • Reduce on-call burden without sacrificing safety

Testing and Reliability

IRAS includes 292 passing tests with 99% code coverage. Testing covers:

  • Unit tests: Individual agent steps (triage, RCA, remediation)
  • Integration tests: Full workflow end-to-end
  • Mock clients: Slack and PagerDuty mocked for testing without external dependencies
  • Edge cases: Handling of incomplete data, ambiguous root causes, etc.

The test suite ensures the agent behaves predictably and reliably in production.

Getting Started

Prerequisites

  • Python 3.11+
  • Docker (optional, for containerized deployment)
  • Anthropic API key (for Claude access)

Quick Start

# Clone the repo
git clone https://github.com/krishnashakula/IRAS.git
cd IRAS

# Install dependencies
pip install -r requirements.txt

# Set your Anthropic API key
export ANTHROPIC_API_KEY="your-key-here"

# Run the agent
python -m iras.main
Enter fullscreen mode Exit fullscreen mode

That's it. No complex setup, no vendor lock-in.

Docker Deployment

docker build -t iras .
docker run -e ANTHROPIC_API_KEY="your-key-here" iras
Enter fullscreen mode Exit fullscreen mode

Real-World Impact

In simulated production scenarios, IRAS:

  • Reduces on-call burden by 80%+: Eliminates manual triage and RCA
  • Accelerates incident resolution: Sub-2-minute response time
  • Improves post-mortem quality: Automatic, comprehensive incident narratives
  • Maintains safety: Human approval gates ensure control

Design Decisions

Why LangGraph?

LangGraph provides deterministic, multi-step workflows with state management. Unlike simple prompt chains, LangGraph enables:

  • Clear decision points and branching logic
  • State persistence across steps
  • Easy debugging and observability
  • Integration with human approval gates

Why Pydantic AI?

Structured outputs are critical for reliability. Pydantic AI ensures:

  • Type-safe agent definitions
  • Guaranteed parseability of agent responses
  • Validation at each step
  • Easy integration with downstream systems

Why Mock Clients?

Zero external dependencies means:

  • No Slack/PagerDuty API rate limits during testing
  • Deterministic test behavior
  • Faster test execution
  • Easier local development

Limitations and Future Work

Current Limitations:

  • Requires well-structured alert data (severity, service, description)
  • RCA quality depends on available logs and metrics
  • Remediation proposals are suggestions, not guaranteed fixes

Future Enhancements:

  • Multi-model support (GPT-4, Gemini, etc.)
  • Custom remediation playbooks
  • Integration with more monitoring systems
  • Feedback loops to improve RCA accuracy

Contributing

IRAS is open-source and welcomes contributions. Areas for improvement:

  • Additional test coverage
  • Performance optimizations
  • New integrations (monitoring systems, incident management platforms)
  • Documentation and examples

See the GitHub repo for contribution guidelines.

Conclusion

Incident response doesn't have to be painful. IRAS automates the repetitive parts while keeping humans in control. With 99% test coverage, zero external dependencies, and a production-grade stack, it's ready for real-world use.

If you're tired of 3 AM incident response, give IRAS a try. Your on-call engineer will thank you.

Get started: https://github.com/krishnashakula/IRAS


Have feedback or ideas? Open an issue or PR on GitHub. Let's make incident response less painful for everyone.

Top comments (0)