Investigation Reports: When Monitors Get Smarter

#ai #observability #devops #logging

Authored by Marco Aquilanti

When a monitor fires, there's a familiar sequence of checks required to find the root cause. The engineers who set up the monitor usually know these steps by heart — they know the dependencies, the error codes, what to check and where. But for the on-call responder, these steps aren't always obvious. Historically, the solution was to force engineering teams to document the checks in a playbook and hope the responder would read it under pressure.

Today, we can offload these checks to an LLM, shifting the responder's role from gathering evidence to reviewing a diagnosis — significantly reducing MTTR.

Our new Investigation Reports feature does exactly this: an LLM completes the investigation and delivers a detailed report before the human even acknowledges the alert.

Investigation Reports builds on BrontoScope, our first AI-powered investigation feature — and on the positive customer feedback it generated.

BrontoScope vs. Investigation Reports

Both BrontoScope and Investigation Reports perform automated investigations and provide reports, but they work differently.

BrontoScope starts with a user request to investigate a specific error event in the logs. The investigation follows a defined workflow aimed at establishing when and where an error is occurring. The LLM guides the process and summarizes findings synchronously — the user is waiting for a response and gets it in seconds.

Investigation Reports is triggered by a system event (a monitor firing), with no user waiting for a synchronous response. This gives the LLM more time — not seconds but minutes — to query data and analyze results. Investigating an alert is also a more generic task than BrontoScope's focused error investigation, making it harder to define a single fixed workflow that succeeds in every scenario.

For these reasons, Investigation Reports lets the LLM operate more freely — giving it tools and context rather than a coded workflow.

Tools Are Easy, Context Is Hard

The tools side is straightforward: the LLM can call Bronto's APIs to perform lightning-fast log search, query key-value dictionaries, check monitor history, retrieve precomputed metrics, and more.

Context is the harder problem.

LLMs make good logical decisions when provided with relevant, well-explained context. But the context window is limited — and it's been demonstrated repeatedly that longer context leads to worse answers and higher hallucination rates. This is known as "context rot" (see research from Chroma and this arXiv paper).

For an effective investigation, the LLM needs more than just the monitor that fired. It needs historical context and an understanding of the monitored system. But dumping thousands of tokens of documentation into the prompt backfires — it drastically increases hallucination risk and degrades report accuracy.

User-Defined Investigation Context

The precise knowledge needed for a good investigation is hard for an LLM to infer autonomously — but it can be provided by the engineer who owns the monitor.

In a dedicated "Investigation Prompt" text area, the user instructs the LLM on what to check and what to do when the monitor fires. Free-form text makes the feature highly flexible, effectively letting users define an ad-hoc workflow for each specific use case.

Engineers and SREs commonly include:

List of dependencies of the affected service
Related log datasets and how to correlate/query them
Relevant keys and metrics to check
What to include in the report — affected components, customers, or users

The screenshot below shows an example investigation prompt telling the LLM to check datasets in a collection named "booking system":

And here's the Investigation Report generated when that monitor fired — the LLM followed the instructions, ran multiple queries, and produced a report with a potential root cause, diagnosis, and timeline:

Investigation Reports Beyond Incident Response

Bronto's own customer support and sales teams found an unexpected use case. They set up monitors to be notified when new organizations are created or contracts are updated in the system — keeping the team up to date on new sign-ups and customer onboarding.

Investigation Reports automates the task of fetching context about each event. Details like contract type, retention plan, company size, and location are queried across multiple logs and assembled into a report that arrives within a minute of the monitor notification. This lets the team quickly identify relevant events among routine ones.

Below is the investigation prompt used by the customer support team, and an example of the automatically generated report:

Investigation Reports are a great illustration of what LLMs are genuinely good at: taking a well-framed task with relevant context and producing a structured, actionable summary faster than any human could. Every monitor notification now comes with relevant information to speed up resolution.

We'll be building further on this capability in the coming months — using AI alongside Bronto's logging platform to help teams reduce toil, resolve issues faster, and extract more value from their data.

Explore Bronto Labs