Root Cause Analysis Across Every Signal, On One Screen

#sre #devops #monitoring #kubernetes

Automated root cause analysis that reads logs, metrics, and traces together and cites every claim to its source, so you cut MTTR instead of hopping tabs

It's 2:11am. Checkout is throwing 500s and compose-post-service is screaming, seven times its normal error volume, top of every dashboard you own. So you start there. Twenty-five minutes in, someone asks the question that ends the incident: "is the auth service even up?" It isn't. user-service let a TLS cert expire and logged almost nothing on the way down. The loudest service was the victim. The quiet one was the cause.

That gap, between the service with the most errors and the service that actually broke, is where most of your mean-time-to-resolution goes. Real root cause analysis isn't "find the noisiest service." It's reading logs, metrics, and traces together to find who everyone else is pointing at, then proving it. Epok does that automatically and cites every claim back to the exact log line, span, or metric, so you cut MTTR instead of assembling the answer across six tabs at 2am.

Why the loudest service is rarely the root cause

We ran this as a controlled experiment because the pattern is so consistent. Twenty services, steady traffic, mature baselines. Then we injected a cascade.

user-service starts failing TLS handshakes. It barely logs: a cert error is a few lines, then silence. Downstream, compose-post-service can't authenticate anyone, so it throws on every request. Thousands of errors. The loud victim. home-timeline-service falls over too, also pointing back upstream.

Sort by error count and you investigate compose-post-service first. It has 7x the logs. It's wrong. Error volume measures distance from the failure, not proximity to it: the services furthest downstream shout the most, because every request they handle now fails. Pick by loudness and the cascade hands you a decoy.

The causal information is sitting right there in the text. When service A can't reach service B, A's error says so:

compose-post-service: "User authentication failed: user-service returned 500" compose-post-service: "Cannot validate user token: connection to user-service timed out" home-timeline-service: "Timeline generation failed: user-service unavailable"

Three different services, all naming the same upstream. user-service is referenced in errors from three other services. compose-post-service is referenced by zero. The thing everyone points at is the cause. A human gets there eventually, usually after the 25-minute detour through the victim. The machine should get there in thirty seconds, because counting "who points at whom" is exactly the bookkeeping software is good at and tired engineers are not.

What automated RCA actually has to do

No single signal hands you the answer; the evidence is scattered on purpose. The error text names the culprit, but only if you read across services instead of within one. The traces, where you have them, draw the dependency edge from victim to culprit directly. The metrics catch the inflection earliest: user-service's handshake-failure rate bends before the downstream flood starts, which tells you the order of events.

Each is a partial view, and the work that eats your night is correlation: logs in one tab, APM in another, a dashboard in a third, your eyes lining up timestamps across all of them. Worse than tedious, it's biased. Whichever signal you open first anchors you. Start in the log-volume chart and you anchor on the loud victim.

Epok reads the signals together and ranks the probable cause from all of them at once. It pulls service names out of error text with word-boundary matching, counts cross-service references, checks which signal moved first, weighs the error category (auth, connection, and timeout errors point upstream), and folds in trace dependency edges and severity. In our run, user-service ranked first despite 7x fewer error logs than the victim, because the ranking weighs who-points-at-whom and what-moved-first, not raw volume. None of those steps is clever on its own. Doing all of them in the seconds after an alert, automatically, on every incident, is the product.

Cited root cause beats a confident summary

Plenty of tools now generate an incident narrative. The question that matters is whether you can check it. A summary that says "likely caused by user-service" and shows nothing is a coin flip you're asked to trust at 2am. We've been on the receiving end of enough wrong-but-confident AI summaries to refuse to ship one.

Every line of Epok's root cause draft links to its evidence: the error string and which service emitted it, the span where the latency went, the metric and timestamp where it bent. Click "user-service is referenced by three other services" and you land on those three log lines. The verdict is falsifiable in one click. It also lets Epok be honest about uncertainty: when the evidence is thin, it says so and ranks lower instead of laundering a guess into a headline. A confident wrong answer costs you more than no answer.

Where each signal earns its keep

This cascade was error-shaped, so the text carried most of the load. That won't always be true. Error text wins when the culprit is un-instrumented: databases, caches, queues, cron jobs, and network gear rarely carry a trace agent but show up by name in downstream errors, and a cert expiry or DNS change produces connection errors naming the failing endpoint. Traces win for latency degradation, where nothing errors but time is vanishing in the call chain. Metrics win for the early inflection and slow drift no single log line reveals. Same ranking engine, different evidence. The point of multi-signal RCA isn't that one signal beats another. It's that Epok uses whichever is present instead of failing because your culprit skipped the one your tool depends on. That's also why "silent" failures are so dangerous, and we go deep on those in the incidents that hide between alerts.

The honest comparison with APM

Datadog's Watchdog and New Relic's Decisions are genuinely good at correlating alerts into one incident, and when APM is instrumented across the call chain, both can surface the origin. We won't pretend otherwise. The catch is where that answer lives and how much you assemble. The "which service" verdict sits in the trace product, so you leave the alert, open APM, read the call graph, and correlate it back yourself. When the culprit is un-instrumented, the trace-based origin signal goes quiet and you're grepping logs in a separate tab.

The assembly is the real cost: not the license, the twenty-five minutes of tab-hopping per incident, multiplied across every page and a tired on-call rotation. Epok collapses that to one screen, detection, ranked cause, what changed, and blast radius, with the evidence one click deep. The tool finds it and proves it instead of handing you a search bar. We make that argument in full in detection-first observability.

What it costs

Most platforms meter the signals separately: a per-GB log rate, a per-host or per-instrumented-service APM charge, and a cardinality penalty on top. The bill grows with how many signals you turn on. For ~1 TB/month of logs across 20 services, the per-host APM line item alone typically runs into the high hundreds of dollars per month on the major platforms, on top of ingestion. (Competitor pricing reflects public pricing pages as of Q1 2026 — vendor pricing changes; verify current rates.)

Epok is one flat price with the signals included, not metered. The 14-day trial opens every feature on roughly 1 TB of volume. Team is $199/month (1 TB), Growth is $599/month (4 TB), Custom starts at $5,000/month; overage is $0.20/GB. There's no free-forever tier; verify current pricing at getepok.dev/pricing. Detection and cross-signal root cause aren't a separate SKU, they're the product. The point isn't "skip APM to save money." It's that your signals land in one tool that reads them together and shows its work, instead of three meters and three tabs you stitch by hand.

What the experiment showed, and what it didn't

One synthetic, controlled run. Not a published benchmark, not an accuracy claim about your stack. With that caveat: the cascade collapsed into a focused set of alerts instead of one page per failing service, and the ranked cause pointed at user-service, the quiet culprit, with the top "what changed" entries all referencing it. Your results will vary with your services, your instrumentation, and how your errors are worded.

We built Epok because the information needed to find root cause is almost always already in your telemetry. Collecting it was never the hard part. Reading it together, fast, at 2am, and showing the work so you can trust the answer, that's the part everyone else left to you.

Point your logs, metrics, traces, and RUM at the 14-day trial, break something on purpose, and watch it name the culprit before you've opened a second tab.

FAQ

What is automated root cause analysis?

Automated root cause analysis is when a tool, not an engineer, identifies the service or change that caused an incident and shows the evidence. Epok does it by reading logs, metrics, and traces together: extracting service names from error text, counting which services reference which, checking which signal moved first, and ranking the probable cause with every claim cited to a specific log line, span, or metric.

Why isn't the service with the most errors usually the root cause?

Because error volume measures distance from the failure, not proximity to it. In a cascade, downstream services fail on every request and produce far more errors than the upstream culprit, which may log only a handful of lines before going quiet. Ranking by error count points you at the loud victim. Ranking by who-references-whom and what-moved-first points you at the actual cause.

How does multi-signal RCA reduce MTTR?

It removes the manual correlation step. Instead of opening logs, APM, and dashboards in separate tabs and lining up timestamps by eye, you get a ranked, cited root cause on one screen seconds after the alert. The minutes normally spent assembling that answer, and the bias of anchoring on whichever tab you opened first, are where the MTTR savings come from.

Do I need traces or APM for Epok to find the root cause?

No. Epok uses whatever signal is present. When a service is un-instrumented, common for databases, caches, queues, and network infrastructure, it still appears by name in downstream error text, and Epok reads that. Where traces exist, the dependency edge corroborates the conclusion. Where metrics exist, an early inflection helps order the events.

Can I verify Epok's root cause conclusion?

Yes. Every claim in the root cause draft links to its source: the exact log line, span, or metric. A statement like "user-service is referenced in errors from three other services" links directly to those three lines, so you can confirm the verdict in one click instead of trusting a summary.

Try Epok free. First alerts in minutes.
No credit card. Every detector included, root cause on every incident. Full baseline coverage at 7 days.

Start 14-day trial
https://getepok.dev/blog/multi-signal-root-cause