IRAS: Building a Production-Grade Autonomous Incident Response Agent
Incident response at 3 AM is brutal. Your on-call engineer is woken up, scrambles to understand what's broken, manually triages the issue, performs root cause analysis, and then—if they're lucky—can finally propose a fix. This process typically takes 30+ minutes and burns out your team.
We built IRAS to automate this entire workflow. When an alert fires, IRAS triages the incident, performs RCA, generates a remediation plan, and drafts a post-mortem—all within 2 minutes. Your engineer reviews and approves the fix. That's it.
The Problem
Incident response is repetitive and exhausting:
- Alert fires → on-call engineer wakes up
- Manual triage → what's the severity? what's affected?
- Root cause analysis → why did this happen?
- Remediation planning → what's the fix?
- Post-mortem → document what happened and why
- Execution → apply the fix
- Follow-up → prevent recurrence
Steps 2-5 are highly repetitive and can be automated. IRAS handles all of them.
The Solution: IRAS
IRAS is an autonomous AI agent built on Claude, LangGraph, and FastAPI. It follows a deterministic workflow:
Alert → Triage → RCA → Remediation → Post-mortem → Human Approval → Execution
Key Features
1. Fully Autonomous with Human Approval Gates
- The agent makes decisions at each step (triage severity, identify root cause, propose fix)
- Human approval is required before any remediation is executed
- Safety-first design: no auto-remediation without review
2. Sub-2-Minute End-to-End Handling
- Alert ingestion to remediation proposal in <120 seconds
- Reduces on-call burden significantly
- Enables faster incident resolution
3. Production-Grade Reliability
- 99% test coverage with 292 passing tests
- Comprehensive logging and observability
- Deterministic workflow with structured outputs
4. Zero External Service Dependencies
- Mock clients for Slack and PagerDuty included
- No vendor lock-in
- Runs entirely on your infrastructure
5. Automatic Post-Mortem Generation
- Generates incident narratives automatically
- Includes root cause, impact, and remediation details
- Reduces post-incident documentation burden
Architecture
Tech Stack
- FastAPI: REST API for alert ingestion and workflow orchestration
- LangGraph: Multi-step agentic workflow with state management
- Pydantic AI: Type-safe agent definitions and structured outputs
- Claude: Core reasoning engine for triage, RCA, and remediation
- Pytest: Comprehensive test suite with 99% coverage
Workflow Design
The agent follows a multi-step workflow:
- Alert Ingestion: Receives alert from monitoring system (Prometheus, DataDog, etc.)
- Incident Triage: Analyzes alert to determine severity, affected services, and impact
- Root Cause Analysis: Investigates logs, metrics, and system state to identify root cause
- Remediation Planning: Generates a step-by-step fix based on the root cause
- Post-Mortem Generation: Drafts incident narrative with timeline and learnings
- Human Approval: On-call engineer reviews and approves the proposed fix
- Execution: Applies the remediation (if approved)
Each step uses Claude with structured outputs (Pydantic) to ensure reliability and parseability.
Human-in-the-Loop Safety
No auto-remediation happens without human approval. The workflow is designed to:
- Provide clear, actionable recommendations
- Enable quick review and approval
- Maintain human control and oversight
- Reduce on-call burden without sacrificing safety
Testing and Reliability
IRAS includes 292 passing tests with 99% code coverage. Testing covers:
- Unit tests: Individual agent steps (triage, RCA, remediation)
- Integration tests: Full workflow end-to-end
- Mock clients: Slack and PagerDuty mocked for testing without external dependencies
- Edge cases: Handling of incomplete data, ambiguous root causes, etc.
The test suite ensures the agent behaves predictably and reliably in production.
Getting Started
Prerequisites
- Python 3.11+
- Docker (optional, for containerized deployment)
- Anthropic API key (for Claude access)
Quick Start
# Clone the repo
git clone https://github.com/krishnashakula/IRAS.git
cd IRAS
# Install dependencies
pip install -r requirements.txt
# Set your Anthropic API key
export ANTHROPIC_API_KEY="your-key-here"
# Run the agent
python -m iras.main
That's it. No complex setup, no vendor lock-in.
Docker Deployment
docker build -t iras .
docker run -e ANTHROPIC_API_KEY="your-key-here" iras
Real-World Impact
In simulated production scenarios, IRAS:
- Reduces on-call burden by 80%+: Eliminates manual triage and RCA
- Accelerates incident resolution: Sub-2-minute response time
- Improves post-mortem quality: Automatic, comprehensive incident narratives
- Maintains safety: Human approval gates ensure control
Design Decisions
Why LangGraph?
LangGraph provides deterministic, multi-step workflows with state management. Unlike simple prompt chains, LangGraph enables:
- Clear decision points and branching logic
- State persistence across steps
- Easy debugging and observability
- Integration with human approval gates
Why Pydantic AI?
Structured outputs are critical for reliability. Pydantic AI ensures:
- Type-safe agent definitions
- Guaranteed parseability of agent responses
- Validation at each step
- Easy integration with downstream systems
Why Mock Clients?
Zero external dependencies means:
- No Slack/PagerDuty API rate limits during testing
- Deterministic test behavior
- Faster test execution
- Easier local development
Limitations and Future Work
Current Limitations:
- Requires well-structured alert data (severity, service, description)
- RCA quality depends on available logs and metrics
- Remediation proposals are suggestions, not guaranteed fixes
Future Enhancements:
- Multi-model support (GPT-4, Gemini, etc.)
- Custom remediation playbooks
- Integration with more monitoring systems
- Feedback loops to improve RCA accuracy
Contributing
IRAS is open-source and welcomes contributions. Areas for improvement:
- Additional test coverage
- Performance optimizations
- New integrations (monitoring systems, incident management platforms)
- Documentation and examples
See the GitHub repo for contribution guidelines.
Conclusion
Incident response doesn't have to be painful. IRAS automates the repetitive parts while keeping humans in control. With 99% test coverage, zero external dependencies, and a production-grade stack, it's ready for real-world use.
If you're tired of 3 AM incident response, give IRAS a try. Your on-call engineer will thank you.
Get started: https://github.com/krishnashakula/IRAS
Have feedback or ideas? Open an issue or PR on GitHub. Let's make incident response less painful for everyone.
Top comments (0)