DEV Community

Cover image for On-Call Incident Triage Panel
Sreenu Sasubilli
Sreenu Sasubilli

Posted on

On-Call Incident Triage Panel

This is a submission for the Algolia Agent Studio Challenge: Consumer-Facing Non-Conversational Experiences

What I Built

I built an On-Call Triage Intelligence Panel for SRE and DevOps teams.

Instead of a chatbot, this system proactively surfaces the most relevant operational patterns, likely causes, and first-check actions when an engineer is diagnosing an incident. The goal is to reduce cognitive load and MTTR during high-stress on-call situations — without requiring back-and-forth conversation.

Engineers already work inside dashboards, runbooks, and incident tools. This experience enhances that existing workflow by injecting AI-driven retrieval intelligence directly into incident triage, rather than asking users to “chat” with a system.


Demo

Live Index (Algolia Search Explorer):

https://dashboard.algolia.com/apps/BF4Z56HB7R/explorer/browse/oncall_triage_kb

This demo uses Algolia’s Search Explorer to simulate how the On-Call Triage Panel would operate when embedded inside an SRE workflow (alerting tools, observability dashboards, or internal runbooks).

Example scenarios you can try:

  • High latency incident
    • Query: orders-api p99 latency
    • Filters:
    • service = orders-api
    • env = prod

High latency incident

  • Network-related incident
    • Query: packet loss
    • Filters:
    • service = orders-api
    • env = staging

Packet Loss Incident

  • Database error spike
    • Query: db errors
    • Filters:
    • service = payments-api
    • env = prod

Database Error Incident

You can also explore by filtering on:

  • service (payments-api, search-api, orders-api)
  • env (prod, staging, any)
  • severity (high, medium, low)

What happens:
Each search instantly surfaces the most relevant triage patterns, showing:

  • Why the pattern was matched (contextual explainability)
  • Likely root causes based on historical incidents
  • A copy-ready “first checks” checklist for immediate on-call action

This demonstrates proactive, non-conversational assistance — intelligence is injected directly into the workflow without requiring chat or back-and-forth interaction.

Mock Screenshot (UI Concept):

Source: https://github.com/sasubillis/oncall_triage_mock/blob/main/index.html

On-call triage panel mock


How I Used Algolia Agent Studio

Algolia Agent Studio was used to power the retrieval intelligence layer, not a conversational UI.

Indexed Data

I indexed ~100 realistic SRE knowledge records including:

  • Incident patterns
  • Historical incidents
  • Symptoms
  • Services and environments
  • Severity levels
  • Likely causes
  • First-check remediation steps
  • Confidence and explanation metadata

Each record is structured to represent operational decision artifacts, not free-form text.

Retrieval Strategy

Instead of prompting an LLM, the system relies on:

  • AI-powered semantic relevance
  • Attribute weighting
  • Typo tolerance
  • Contextual ranking across multiple signals

For example, a query like:

packet loss payments api

Automatically retrieves:

  • Relevant historical incidents
  • Matching triage patterns
  • Environment-appropriate remediation steps

No conversation is required — the intelligence is embedded in retrieval itself.

Targeted Prompting (Non-Conversational)

Agent Studio is used to:

  • Explain why a pattern is shown
  • Rank patterns by confidence and operational relevance
  • Surface the most actionable next steps first

This is agentic behavior without dialogue.


Why Fast Retrieval Matters

In on-call scenarios, every second matters.

Sub-50ms retrieval allows triage guidance to appear instantly during active incidents, when seconds matter.

Algolia’s fast, contextual retrieval enables:

  • Sub-second access to operational knowledge
  • Reduced time spent searching runbooks
  • Fewer context switches during incidents
  • Faster identification of known failure patterns

Instead of engineers remembering where knowledge lives, the system remembers for them.

This is especially critical in high-severity incidents where cognitive overload is common.


Why This Fits the Challenge

This project is intentionally non-conversational.

There is no chat interface and no back-and-forth prompting. The AI value comes from:

  • Learning-based relevance ranking
  • Pattern recognition across historical incidents
  • Proactive surfacing of the most useful information at the right moment

It demonstrates how AI-powered retrieval can quietly enhance real workflows — exactly what the Consumer-Facing Non-Conversational Experiences category is about.


Closing Thoughts

Many AI demos focus on talking to users.

This project focuses on helping users think less during critical moments.

By embedding AI directly into operational workflows through Algolia’s retrieval engine, this approach shows how intelligent systems can assist users without ever asking a question back.

Top comments (0)