Posted on Feb 8

Silent Triage: The Zero-Chat War Room Protocol (Algolia Agent Studio Challenge)

#algoliachallenge #ai #devops #react

"Downtime costs an average of $9,000 per minute. In a 3 AM production crisis, you don't need a conversation. You need a deterministic decision."

💡 The Inspiration: Beyond the Chatbot

Every SRE knows the "Alert Fatigue." When a P0 incident hits, every second counts. Traditional AI assistants are too conversational—they waste time with greetings and "How can I help you today?".

I built Silent Triage to implement the Zero-Chat Protocol: a high-speed, RAG-driven interface that turns raw, messy logs into actionable fixes without the small talk.

🛡️ What is Silent Triage?

Silent Triage is a specialized Incident Response agent that acts as a digital first responder.

✅ Analyzes messy logs: Paste raw stack traces or alerts directly.

✅ Classifies Severity: Automatically detects if an issue is P0 (Critical) or P3 (Minor).

✅ Retrieves Ground Truth: Uses Algolia to search for past incident resolutions.

✅ Prescribes Action: Delivers a structured JSON-based remediation plan.

🎥 Watch the 90-second Demo

See how the agent identifies a Database Connection Timeout and suggests a fix based on historical data.

🧩 Technical Architecture: The "Ground Truth" Engine

The system is built on a high-performance RAG (Retrieval-Augmented Generation) stack using Algolia Agent Studio.

1️⃣ The Memory: Algolia Search Index

I populated an Algolia index named incident_history with structured data from past production failures. This ensures the agent's logic is grounded in reality, not hallucinations.

The Knowledge Base (incident_history.json):

[
  {
    "objectID": "inc-001",
    "title": "Database Connection Timeout - Production",
    "description": "Error: SequelizeConnectionError: connect ETIMEDOUT. The database cluster is not responding to heartbeat checks.",
    "severity": "P0",
    "cause": "Database connection pool exhausted due to unclosed connections.",
    "action": "Restart the primary database node and increase the max_connections limit in RDS.",
    "tags": ["database", "timeout", "critical", "backend"]
  },
  {
    "objectID": "inc-002",
    "title": "Slow Page Loads - Frontend Assets",
    "description": "Users reporting 5+ seconds to load the dashboard. Static assets (JS/CSS) are taking too long to download.",
    "severity": "P2",
    "cause": "CDN cache invalidation failed after the last deploy.",
    "action": "Purge CloudFront cache and verify S3 bucket permissions.",
    "tags": ["frontend", "performance", "cdn"]
  },
  {
    "objectID": "inc-003",
    "title": "Failed User Registration - API 500",
    "description": "POST /api/v1/register returning 500 Internal Server Error. Log: null pointer exception at UserService.java:45.",
    "severity": "P1",
    "cause": "Missing validation for null email addresses in the legacy registration flow.",
    "action": "Rollback to the previous stable build (v1.2.4) and add null-check in the UserService.",
    "tags": ["api", "java", "500-error", "auth"]
  },
  {
    "objectID": "inc-004",
    "title": "Broken Images in Product Catalog",
    "description": "Images not rendering in the mobile app. Getting 403 Forbidden when fetching from the media server.",
    "severity": "P3",
    "cause": "Expired SSL certificate on the media subdomain.",
    "action": "Renew the Let's Encrypt certificate via Certbot.",
    "tags": ["images", "ssl", "minor"]
  }
]

2️⃣ The Brain: Algolia Agent Studio

I used Agent Studio to orchestrate the intelligence layer. By connecting the incident_history index as a Search Tool, the agent "researches" historical data before formulating a response.

Agent Configuration & System Prompt:

Role: Professional SRE & DevOps Incident Triage Expert.
Objective: Analyze the user's error/incident report, SEARCH the 'incident_history' index for context, and provide a structured JSON decision.

Context: You have access to a database of past incidents via the Algolia Search Tool. Use it to find similar patterns.

Instructions:
1. Analyze the user's input.
2. Search the index for similar past issues.
3. Classify severity and recommend actions based on search results.
4. Output ONLY valid JSON. Do not use Markdown formatting.

🛠️ The "Silent" Protocol: Structured Output

To maintain the "Zero-Chat" standard, I engineered a strict JSON schema. This allows the frontend to render the solution in a tactical HUD (Heads-Up Display) immediately.

Response Schema:

{
  "severity": "P0" | "P1" | "P2" | "P3",
  "probable_cause": "Brief technical explanation (max 1 sentence)",
  "recommended_action": "Concrete steps to fix or mitigate",
  "related_incident_ids": ["List of objectIDs found"],
  "confidence_score": Number (0-100),
  "language": "es" | "en"
}

🏗️ Frontend Architecture: React + Custom Hooks

The UI is built with React following a clean architecture pattern:

src/
├── components/     # Reusable UI components
├── hooks/          # Custom React hooks for state management
├── services/       # API integration layer (Algolia Agent Studio)
└── utils/          # Helper functions and parsers

Key technical decisions:

✅ Custom hooks for agent communication

✅ Service layer abstraction for API calls

✅ Component-based architecture for maintainability

✅ Dark theme optimized for high-stress environments

🚀 Why This Wins: Reliability Over Hallucination

Most AI agents "guess" when they encounter an error. Silent Triage is different. By grounding the model with Algolia's Search Tool, the agent retrieves actual historical context.

✅ Tactical HUD: A React-based interface designed for dark "War Room" environments.

✅ Telemetry Extraction: Automatically identifies IPs and endpoints from raw text.

✅ Confidence Score: Transparent scoring based on how well the input matches historical records.

📋 Post-Incident Automation

Beyond triage, Silent Triage automates the post-mortem workflow:

One-Click PDF Report Generation

After analyzing an incident, the system generates a professional PDF Audit Report using jspdf:

✅ Incident Summary: Severity, timestamp, and confidence score

✅ Root Cause Analysis: Grounded in historical data

✅ Recommended Actions: Step-by-step remediation plan

✅ Related Incidents: References to past similar cases

This eliminates the manual copy-paste process that wastes critical minutes during P0 events.

Jira-Ready Format

The analysis can be exported in Jira/Textile syntax, allowing instant ticket creation:

h2. [P0] Database Connection Timeout - Production

*Probable Cause:* Database connection pool exhausted due to unclosed connections.

*Recommended Action:* 
# Restart the primary database node
# Increase max_connections limit in RDS
# Review connection pooling configuration

*Related Incidents:* INC-001
*Confidence:* 92%

This bridges the gap between AI-powered triage and enterprise ticketing systems.

🔗 Project Links

🌐 Live Demo: silent-triage-hackathon.vercel.app

💻 Source Code: GitHub Repository

👤 Developed by Sherman95 for the Algolia Agent Studio Challenge.

DEV Community