Over the last two years I’ve watched observability tools quietly turn into something else. They stopped being “more dashboards” and started behaving like a junior SRE who never sleeps: spots anomalies, suggests root causes, even drafts post-mortems.
As a CTO working with multiple SaaS and data platforms at Pynest, I see the same pattern: once environments become cloud-native and globally distributed, humans alone cannot keep up. That’s where AI-driven observability — Dynatrace, New Relic, Grafana and similar platforms — stops being “nice to have” and becomes basic survival.
The New Relic Observability Forecast, for example, reports that adoption of AI technologies is already the top driver for observability initiatives, and organizations that use AI-driven observability see materially higher business value and ROI.
Why We Moved To AI-Driven Observability
Before we brought AI into the stack, our setup looked like many others:
- APM in one tool
- Logs in another
- Custom business dashboards in something home-grown
During a major incident, we would open six tabs and start manual correlation. MTTR was often measured in hours, not minutes.
We decided to experiment with AI-enabled observability for three reasons:
- Too many moving parts. Microservices, serverless, multiple clouds — no one person could hold the dependency graph in their head.
- Alert fatigue. People either ignored alerts or tuned them so aggressively that real problems slipped through.
- Expensive war rooms. Every serious outage meant half the senior team on a call at 2 a.m.
Platforms like Dynatrace now ship AI engines such as Davis® AI that continuously analyze dependencies and telemetry to detect anomalies and pinpoint root causes, aiming to move customers from reactive to preventive operations. That matched exactly what we needed.
How AI-Powered Observability Works In Practice
In our case, the AI layer sits on top of metrics, logs, traces and events coming from OpenTelemetry and vendor agents. Day to day, it changes work in a few concrete ways:
- Automated anomaly detection. Instead of static thresholds, the system learns “normal” for each service and flags deviations.
- Intelligent alerting. We get one incident with a causal graph rather than 200 near-identical alerts.
- Root-cause hints. The platform suggests where the regression likely started — a specific deployment, database, or external dependency.
- AIOps hooks. For a handful of well-understood scenarios, incidents trigger runbooks or rollback pipelines automatically.
On a real client project (a payment platform with strict SLAs), that combination cut the time to identify the root cause by roughly 40% and reduced noisy alerts by more than half. The important part is not that “AI solved everything”, but that humans now spend more time deciding what to do, and less time just figuring out what is happening.
Observability leaders have been pushing in this direction for years. As Charity Majors, cofounder and CTO at Honeycomb, likes to emphasize, real observability is about answering new, unanticipated questions about your system — not just staring at three fixed “pillars”. AI simply helps teams explore those questions faster when the data volume is beyond human scale.
Similarly, in a recent Grafana Labs article, Ben Sully, Senior Software Engineer, describes how their AI assistant helps teams resolve incidents faster, reduce alert fatigue and guide investigations step by step, instead of leaving engineers alone with a wall of charts.
Has AI Lived Up To The Hype?
Short answer: yes, but only if you treat it as augmentation, not autopilot.
Where AI delivers:
- Noise reduction. Correlated, de-duplicated alerts are a huge win for SRE teams.
- Faster incident triage. Suggested root causes are often “good enough” to start remediation immediately.
- Better cloud cost conversations. When you can quantify how a noisy service hurts both reliability and spend, it’s much easier to prioritize fixes.
Where it still struggles:
- Garbage in, garbage out. If telemetry is poor or incomplete, AI will confidently point you in the wrong direction.
- Business context. The system doesn’t know that “checkout in Germany is more critical than an internal admin panel”, unless you encode that explicitly.
- Trust. Engineers must understand why a suggestion was made, not just see a probability score.
In that sense, AI-powered observability is forcing us to do the unglamorous homework: better instrumentation, cleaner data models, clear priorities.
The Organizational Change No One Talks About
Most articles focus on tools. In reality, AI observability changes teams first.
What we had to adjust at Pynest and with our clients:
- SRE as product managers. SRE and platform teams now spend more time designing signals and workflows for the AI, not just maintaining dashboards.
- On-call as “AI pair programming”. On-call engineers learn to interrogate the AI — “show me similar incidents”, “what changed in the last 15 minutes?” — instead of clicking charts by hand.
- New skills. We look for people who are comfortable with both telemetry data and business impact, because they are the ones who can tune the models meaningfully.
There are good external examples too. Air France-KLM’s IT leadership publicly highlighted how Dynatrace’s AI and prediction capabilities give them early warnings, reduce operational impact and support more sustainable operations. That is exactly the kind of story CIOs want to tell to their boards.
Lessons Learned, Pitfalls And What We’d Do Differently
If I were starting again, I’d structure an AI-observability journey around three phases:
-
Stabilize the basics.
- Standardize telemetry (OpenTelemetry where possible).
- Define a small set of truly critical user journeys and SLOs.
- Turn off 60–70% of legacy alerts before adding anything new.
-
Add AI in narrow, high-value paths.
- Start with anomaly detection around those critical journeys.
- Use AI-driven RCA suggestions, but always confirm with raw data.
- Let the platform draft post-mortems, with humans editing them.
-
Automate carefully.
- Automate only well-understood fixes (safe rollbacks, cache flushes).
- Keep a human approval step for anything that can touch customer data or money.
- Track metrics like MTTR, number of incidents per month, and cloud spend, so you can prove value.
The biggest pitfall I see is treating AI observability as “yet another dashboard project”. It isn’t. It is a long-term change in how teams see and operate systems.
How We Approach This At Pynest
At Pynest, we rarely “sell tools”. Instead, we help clients redesign how they run critical workloads.
On one engagement with a European fintech, we combined:
- A commercial AI-enabled observability platform for deep automatic analysis
- An open-source stack (Prometheus, Loki, Tempo, Grafana) for flexibility
- Custom SLO and error-budget logic tied to business KPIs
The result wasn’t just fewer alerts. It was a new operating model: product teams owning their SLOs, SREs curating signals and AI workflows, and leadership finally seeing a single, trusted picture of reliability and cost.
That, in my view, is where AI-powered observability really pays off. Not when it magically “finds bugs with machine learning”, but when it becomes the shared language between engineering, operations and the business.
For CIOs looking at 2026, I’d summarize it this way:
Don’t start with “Which tool should we buy?”. Start with “Which decisions do we want this system to help us make faster and with more confidence?”
If you can answer that clearly, AI-driven observability has a good chance to deliver on its promises — and maybe even let your SRE team sleep a bit more.
Top comments (0)