Manas Sharma

Posted on May 14

I Built a Dashboard in 30 Seconds with AI

#ai #devops #sre #observability

The Problem

It's 2 AM. An alert fires. Cart service is throwing errors. You've got five minutes before someone escalates.

The runbook says: "Check the dashboard. Look at the logs." But which dashboard? What query? You're half-asleep, the alert description tells you nothing useful, and now you're supposed to write SQL from scratch while someone in Slack asks "any update?"

Most of us have been there. And most runbooks were written by someone who never had to use them under pressure.

What if you could just type: "cart is throwing errors. find the root cause." and get a real answer?

That's what I tested with the new AI Assistant in OpenObserve. Here's what happened.

It's Not Anomaly Detection. It's Something Simpler.

Most AI + observability discussions jump straight to anomaly detection or ML-powered forecasting. Those are interesting. But the thing that's actually changing how I work right now is simpler: an assistant embedded in the platform that lets me ask questions in plain English and get answers from my own production data.

No SQL. No PromQL. Just describe what you want.

I ran four real scenarios against live data from an otel-demo microservices app and a Kubernetes cluster. Here's how each one went.

1. The Dashboard Request That Normally Kills Your Afternoon

Someone from the business team asks for a dashboard. They don't know SQL. They don't know PromQL. They just want to see what's happening with nginx — request rate, how fast it's responding, how many errors.

Normally this kills thirty minutes: finding the right log stream, writing queries, dragging panels, tweaking units.

Instead, I typed:

create a dashboard for my nginx logs showing request rate, latency percentiles, and 4xx vs 5xx errors.

Thirty seconds later I had a production-ready dashboard. It picked the right log stream. It listed the relevant fields. It wrote the SQL queries. It chose appropriate visualizations — line chart for request rate, heatmap for latency distribution, stacked bar for status codes. These were real queries against actual data. Not a template.

Here's what stuck with me: the person who asked for this could have done it themselves. They don't need to know what a PromQL query looks like. They just describe what they want to see.

2. Same Thing, Different Domain: Infrastructure

Application logs worked. But what about infrastructure?

build a K8s host metrics dashboard showing CPU, memory, disk per node.

Completely different data source — Kubernetes metrics, not nginx logs. Same experience. The assistant figured out where the data lived, what metrics to pull, and how to visualize them.

What impressed me was the panel design. Usage per node and cumulative across the cluster. Separate tabs for CPU, memory, and disk. It understood that "CPU per node" implies a time series grouped by host, not a single aggregate gauge. That's the kind of design decision a human SRE makes after looking at the data — and the assistant just did it.

The assistant had enough context about the infrastructure to know what clusters were running and what hosts were connected. I didn't explain my setup. It already knew.

3. Proactive: Don't Wait Until Something Breaks

Dashboards are great, but nobody wants to stare at them all day. I wanted to see if I could use the assistant proactively — scan everything, find problems before they escalate.

what's the health of the otel-demo right now? if anything is red, create an alert.

This isn't asking for one dashboard or one service. It's saying: scan all services, tell me how we're doing, and if something looks off, lock in an alert so I'm covered.

It checked error rates and latencies across every service. Found the ones running green, identified the ones that weren't. And for anything red — it created an alert. Right there. No configuration. No navigating to the alerts page.

This is the kind of thing most teams only set up after an incident, during the postmortem, when someone says "we should have caught this earlier." One sentence and you're covered before the page goes off.

4. Something's Actually Broken: Root Cause Analysis

Now the real test. The cart service in the otel-demo app is throwing errors. Not a synthetic scenario — a real incident.

otel-demo app cart is throwing errors. find the root cause.

What happened next is worth breaking down step by step:

It searched across both logs and traces — not one or the other, both at once
It looked for errors in the last six hours and found none
It automatically widened the search window — I didn't tell it to do that
It identified the pattern: cart service failing on database writes under load
It showed me the exact traces, the error distribution over time, and the specific downstream call that was failing

Every step was visible. I could expand any tool call, see the exact query it ran, and verify the result. It's not a black box. It shows its work — and if I disagreed with where it was going, I could redirect it.

Once I had the root cause, I stayed in the same conversation:

alert me if cart error rate crosses 10 errors in 5 minutes.

Same context. Same conversation. Investigation to prevention in two sentences.

That last part is what I keep coming back to. The assistant doesn't just help you find problems — it helps you lock in the fix so you don't get paged for the same thing at 3 AM next week.

Beyond the UI: Take It to Your IDE

Here's the part that changes the workflow entirely. You don't have to be inside the OpenObserve UI to get this.

OpenObserve exposes all of this through an MCP server. Connect your AI coding assistant (Claude Code, Cursor, whatever you use) directly to your production observability data. One command:

claude mcp add o2 https://api.openobserve.ai/api/default/mcp \
  -t http \
  --header "Authorization: Basic <YOUR_TOKEN>"

That's it. Under five minutes. Now your IDE can query production logs, metrics, and traces. Debug a deploy from your terminal. Pull up a trace without leaving your editor. Check error rates during a code review.

The assistant follows you wherever you work — not just inside the observability platform.

What This Actually Changes

There's been a lot of noise about AI in observability. Most of it falls into two camps:

Anomaly detection — useful in theory, unpredictable in practice, hard to trust
AI replaces on-call — not happening, and most engineers don't want it to

The thing that's working right now is neither of those. It's reducing the friction between "something is wrong" and "here's what I know."

Not replacing your judgment. Not replacing your experience. Just removing the parts of incident response that feel like operating a query builder with one eye open at 2 AM.

From "I need to see what's happening" to "I know what happened and we're covered next time" — in one conversation.

Resources

Have you tried connecting AI assistants to your observability stack? What's working? What's still painful? Drop a comment — I'm genuinely curious what others are seeing.

DEV Community