Shehriyar Malik

Posted on Jun 16

Runbooks in Minutes: An On-Call Incident Copilot with HazelJS

#ai #automation #javascript #sre

When production blinks, engineers reach for logs, metrics, and runbooks—not another generic chatbot. A useful incident copilot classifies impact, surfaces the right playbook, and drafts customer-safe language without pretending it replaced your paging system.

The core idea: separate triage facts from comms tone—two agents, one supervisor, fixtures first—then swap in real data sources.

What we are building

Severity triage — structured tool output (component, customer impact, suggested severity).
Runbook retrieval — keyword search over on-call playbooks (fixtures → Confluence / Notion later).
Customer update drafting — short status text with cautious language.
Supervisor routing — createSupervisor chooses triage vs runbook vs comms across rounds.

Add IncidentController plus three @Agent classes to providers, import the file before AgentModule, and expose POST /incident/analyze (single coordinator) and POST /incident/supervisor (multi-worker router).

Runtime prerequisites: same AI-native baseline—Postgres for template boot, npm run db:push, OPENAI_API_KEY, npm run dev.

Fixture runbooks

const RUNBOOKS = [
  {
    id: 'rb-payment-latency',
    title: 'Payment API latency',
    steps: [
      'Check p95 on checkout-service vs payments-api.',
      'Verify recent deploys; consider rollback if error budget burned.',
      'Page payments on-call if DB connection saturation > 80%.',
    ],
    tags: ['payments', 'latency'],
  },
  {
    id: 'rb-data-pipeline-stall',
    title: 'Batch pipeline stalled',
    steps: [
      'Confirm Kafka consumer lag for topic ingestion-raw.',
      'Inspect dead-letter queue depth.',
      'Scale workers if CPU < 40% and lag rising.',
    ],
    tags: ['data', 'kafka'],
  },
];

Agents

Triage agent

@Agent({
  name: 'IncidentTriageAgent',
  description: 'Classifies incidents and suggests immediate checks',
  systemPrompt:
    'You are an SRE triage bot. Always call classifyIncident first, then suggestChecks. Never downplay customer impact.',
  maxSteps: 8,
  temperature: 0,
})
@Service()
export class IncidentTriageAgent {
  @Tool({
    description: 'Classify incident text into component, user impact, and suggested severity P1-P4',
    parameters: [
      { name: 'summary', type: 'string', description: 'Free-text incident summary', required: true },
    ],
  })
  async classifyIncident(input: { summary: string }) {
    const text = input.summary.toLowerCase();
    const payment = text.includes('payment') || text.includes('checkout');
    return {
      component: payment ? 'payments-api' : 'unknown',
      customerImpact: payment ? 'checkout_degraded' : 'investigating',
      suggestedSeverity: payment ? 'P2' : 'P3',
    };
  }

  @Tool({
    description: 'Return two immediate shell-friendly checks (placeholders for real runbooks)',
    parameters: [
      { name: 'component', type: 'string', description: 'System component from classifyIncident', required: true },
    ],
  })
  async suggestChecks(input: { component: string }) {
    return {
      checks: [
        `kubectl logs deploy/${input.component} --tail=200`,
        `open dashboard/${input.component}/golden-signals`,
      ],
    };
  }
}

Runbook search agent

@Agent({
  name: 'RunbookSearchAgent',
  description: 'Finds runbook steps for a symptom',
  systemPrompt: 'You retrieve runbooks. Call searchRunbooks before advising operational steps.',
  maxSteps: 6,
  temperature: 0,
})
@Service()
export class RunbookSearchAgent {
  @Tool({
    description: 'Search on-call runbooks by keyword or tag',
    parameters: [
      { name: 'query', type: 'string', description: 'Symptom or component', required: true },
    ],
  })
  async searchRunbooks(input: { query: string }) {
    const q = input.query.toLowerCase();
    const hits = RUNBOOKS.filter(
      (r) =>
        r.title.toLowerCase().includes(q) ||
        r.tags.some((t) => q.includes(t)) ||
        r.steps.some((s) => s.toLowerCase().includes(q))
    );
    return { hits };
  }
}

Customer comms agent

@Agent({
  name: 'CustomerCommsAgent',
  description: 'Drafts cautious external status updates',
  systemPrompt:
    'You draft short status page updates: what we know, what we do not, next update ETA. Avoid blame and speculation.',
  maxSteps: 6,
  temperature: 0.3,
})
@Service()
export class CustomerCommsAgent {
  @Tool({
    description: 'Validate tone flags (blocklist) before returning draft text',
    parameters: [
      { name: 'draft', type: 'string', description: 'Proposed customer-facing text', required: true },
    ],
  })
  async lintPublicDraft(input: { draft: string }) {
    const blocked = /\b(guaranteed|root cause identified|never)\b/i.test(input.draft);
    return { blocked, reason: blocked ? 'Remove absolute claims' : 'ok' };
  }
}

HTTP controller (supervisor)

@Controller('incident')
export class IncidentController {
  constructor(private readonly agentService: AgentService) {}

  @Post('supervisor')
  async supervisor(@Body() body: { message: string }) {
    const runtime = this.agentService.getRuntime();
    const supervisor = runtime.createSupervisor({
      name: 'incident-supervisor',
      workers: ['IncidentTriageAgent', 'RunbookSearchAgent', 'CustomerCommsAgent'],
      systemPrompt:
        'Route to IncidentTriageAgent for classification and immediate checks, RunbookSearchAgent for playbook retrieval, CustomerCommsAgent for external drafts. Use multiple rounds if needed.',
      maxRounds: 8,
    });
    const result = await supervisor.run(body.message);
    return {
      response: result.response,
      rounds: result.rounds.map((r) => ({
        round: r.round,
        worker: r.decision.worker ?? 'supervisor',
        thought: r.decision.thought ?? null,
      })),
    };
  }

  /** Single-agent fast path for triage-only */
  @Post('triage')
  async triageOnly(@Body() body: { message: string }) {
    const result = await this.agentService.execute('IncidentTriageAgent', body.message);
    return { response: result.response, executionId: result.executionId };
  }
}

Running & testing

npx @hazeljs/cli g app incident-copilot-demo --template=ai-native
cd incident-copilot-demo
cp .env.example .env
docker compose up -d postgres && npm run db:push
npm run dev

1) Triage-only — triage.json:

{ "message": "Checkout is slow; customers see 8s latency on payment confirm in EU." }

curl.exe -s -X POST http://localhost:3000/incident/triage 
-H "Content-Type: application/json" 
-d "@triage.json"

2) Supervisor — incident.json:

{
  "message": "Kafka lag spiked on ingestion-raw; pipeline stalled. Draft a cautious customer update and cite runbook steps."
}

curl.exe -s -X POST http://localhost:3000/incident/supervisor 
-H "Content-Type: application/json" 
-d "@incident.json"

3) Inspector — http://localhost:3000/__hazel for route and module introspection.

Pair with GuardrailsModule before exposing CustomerCommsAgent outside a VPN.

What to change before production

Wire searchRunbooks to your real KB; keep version ids on every retrieved step.
Add approval for public drafts (requiresApproval: true on publish tools.
Attach execution ids to PagerDuty / Jira incidents for audit.
Add rate limits on /incident/* to avoid abuse during noisy false alarms.

HazelJS gives you native multi-agent routing without a second orchestration product: specialists stay small, createSupervisor exposes round metadata for dashboards, and the same module stack scales to RAG and Flow-backed playbooks.

DEV Community