When production blinks, engineers reach for logs, metrics, and runbooks—not another generic chatbot. A useful incident copilot classifies impact, surfaces the right playbook, and drafts customer-safe language without pretending it replaced your paging system.
The core idea: separate triage facts from comms tone—two agents, one supervisor, fixtures first—then swap in real data sources.
What we are building
- Severity triage — structured tool output (component, customer impact, suggested severity).
- Runbook retrieval — keyword search over on-call playbooks (fixtures → Confluence / Notion later).
- Customer update drafting — short status text with cautious language.
-
Supervisor routing —
createSupervisorchooses triage vs runbook vs comms across rounds.
Add IncidentController plus three @Agent classes to providers, import the file before AgentModule, and expose POST /incident/analyze (single coordinator) and POST /incident/supervisor (multi-worker router).
Runtime prerequisites: same AI-native baseline—Postgres for template boot, npm run db:push, OPENAI_API_KEY, npm run dev.
Fixture runbooks
const RUNBOOKS = [
{
id: 'rb-payment-latency',
title: 'Payment API latency',
steps: [
'Check p95 on checkout-service vs payments-api.',
'Verify recent deploys; consider rollback if error budget burned.',
'Page payments on-call if DB connection saturation > 80%.',
],
tags: ['payments', 'latency'],
},
{
id: 'rb-data-pipeline-stall',
title: 'Batch pipeline stalled',
steps: [
'Confirm Kafka consumer lag for topic ingestion-raw.',
'Inspect dead-letter queue depth.',
'Scale workers if CPU < 40% and lag rising.',
],
tags: ['data', 'kafka'],
},
];
Agents
Triage agent
@Agent({
name: 'IncidentTriageAgent',
description: 'Classifies incidents and suggests immediate checks',
systemPrompt:
'You are an SRE triage bot. Always call classifyIncident first, then suggestChecks. Never downplay customer impact.',
maxSteps: 8,
temperature: 0,
})
@Service()
export class IncidentTriageAgent {
@Tool({
description: 'Classify incident text into component, user impact, and suggested severity P1-P4',
parameters: [
{ name: 'summary', type: 'string', description: 'Free-text incident summary', required: true },
],
})
async classifyIncident(input: { summary: string }) {
const text = input.summary.toLowerCase();
const payment = text.includes('payment') || text.includes('checkout');
return {
component: payment ? 'payments-api' : 'unknown',
customerImpact: payment ? 'checkout_degraded' : 'investigating',
suggestedSeverity: payment ? 'P2' : 'P3',
};
}
@Tool({
description: 'Return two immediate shell-friendly checks (placeholders for real runbooks)',
parameters: [
{ name: 'component', type: 'string', description: 'System component from classifyIncident', required: true },
],
})
async suggestChecks(input: { component: string }) {
return {
checks: [
`kubectl logs deploy/${input.component} --tail=200`,
`open dashboard/${input.component}/golden-signals`,
],
};
}
}
Runbook search agent
@Agent({
name: 'RunbookSearchAgent',
description: 'Finds runbook steps for a symptom',
systemPrompt: 'You retrieve runbooks. Call searchRunbooks before advising operational steps.',
maxSteps: 6,
temperature: 0,
})
@Service()
export class RunbookSearchAgent {
@Tool({
description: 'Search on-call runbooks by keyword or tag',
parameters: [
{ name: 'query', type: 'string', description: 'Symptom or component', required: true },
],
})
async searchRunbooks(input: { query: string }) {
const q = input.query.toLowerCase();
const hits = RUNBOOKS.filter(
(r) =>
r.title.toLowerCase().includes(q) ||
r.tags.some((t) => q.includes(t)) ||
r.steps.some((s) => s.toLowerCase().includes(q))
);
return { hits };
}
}
Customer comms agent
@Agent({
name: 'CustomerCommsAgent',
description: 'Drafts cautious external status updates',
systemPrompt:
'You draft short status page updates: what we know, what we do not, next update ETA. Avoid blame and speculation.',
maxSteps: 6,
temperature: 0.3,
})
@Service()
export class CustomerCommsAgent {
@Tool({
description: 'Validate tone flags (blocklist) before returning draft text',
parameters: [
{ name: 'draft', type: 'string', description: 'Proposed customer-facing text', required: true },
],
})
async lintPublicDraft(input: { draft: string }) {
const blocked = /\b(guaranteed|root cause identified|never)\b/i.test(input.draft);
return { blocked, reason: blocked ? 'Remove absolute claims' : 'ok' };
}
}
HTTP controller (supervisor)
@Controller('incident')
export class IncidentController {
constructor(private readonly agentService: AgentService) {}
@Post('supervisor')
async supervisor(@Body() body: { message: string }) {
const runtime = this.agentService.getRuntime();
const supervisor = runtime.createSupervisor({
name: 'incident-supervisor',
workers: ['IncidentTriageAgent', 'RunbookSearchAgent', 'CustomerCommsAgent'],
systemPrompt:
'Route to IncidentTriageAgent for classification and immediate checks, RunbookSearchAgent for playbook retrieval, CustomerCommsAgent for external drafts. Use multiple rounds if needed.',
maxRounds: 8,
});
const result = await supervisor.run(body.message);
return {
response: result.response,
rounds: result.rounds.map((r) => ({
round: r.round,
worker: r.decision.worker ?? 'supervisor',
thought: r.decision.thought ?? null,
})),
};
}
/** Single-agent fast path for triage-only */
@Post('triage')
async triageOnly(@Body() body: { message: string }) {
const result = await this.agentService.execute('IncidentTriageAgent', body.message);
return { response: result.response, executionId: result.executionId };
}
}
Running & testing
npx @hazeljs/cli g app incident-copilot-demo --template=ai-native
cd incident-copilot-demo
cp .env.example .env
docker compose up -d postgres && npm run db:push
npm run dev
1) Triage-only — triage.json:
{ "message": "Checkout is slow; customers see 8s latency on payment confirm in EU." }
curl.exe -s -X POST http://localhost:3000/incident/triage
-H "Content-Type: application/json"
-d "@triage.json"
2) Supervisor — incident.json:
{
"message": "Kafka lag spiked on ingestion-raw; pipeline stalled. Draft a cautious customer update and cite runbook steps."
}
curl.exe -s -X POST http://localhost:3000/incident/supervisor
-H "Content-Type: application/json"
-d "@incident.json"
3) Inspector — http://localhost:3000/__hazel for route and module introspection.
Pair with GuardrailsModule before exposing CustomerCommsAgent outside a VPN.
What to change before production
- Wire
searchRunbooksto your real KB; keep version ids on every retrieved step. - Add approval for public drafts (
requiresApproval: trueon publish tools. - Attach execution ids to PagerDuty / Jira incidents for audit.
- Add rate limits on
/incident/*to avoid abuse during noisy false alarms.
HazelJS gives you native multi-agent routing without a second orchestration product: specialists stay small, createSupervisor exposes round metadata for dashboards, and the same module stack scales to RAG and Flow-backed playbooks.
Top comments (0)