Jay Saadana

for Steadwing

Posted on Jun 8

Your Logs Have the Answer. You Just Can't Find It Fast Enough.

#ai #sre #kubernetes #devops

Three weeks ago, one of the teams we work with had a checkout outage. The root cause a malformed database query introduced in a deploy 40 minutes earlier was sitting in their CloudWatch logs the entire time. Timestamped. Stack-traced. Perfectly clear.

They found it 22 minutes after the alert fired.

Not because they weren't looking. Because they were looking in Elasticsearch first. Their checkout service logs to CloudWatch, but the API gateway that routes to checkout logs to Elasticsearch. The engineer on call didn't remember which was which. So they spent 8 minutes searching Elasticsearch, found nothing relevant, switched to CloudWatch, spent another 6 minutes getting the query syntax right, then another 8 minutes narrowing the time window to find the specific error.

Twenty-two minutes. The log line had been sitting there since minute one.

This isn't a story about a bad engineer or bad tooling. It's a story about what happens when incident data is scattered across platforms that don't talk to each other.

Key Takeaways

The root cause of your last incident was probably in the logs within minutes of the alert firing. Your engineer found it 20 minutes later because they were searching the wrong platform first.
Nobody decides to run three logging platforms. It happens over two years because different teams pick different tools, and by the time you notice, checkout logs to CloudWatch and payments logs to Elasticsearch and nobody has a map.
Log search during an incident is nothing like normal debugging. You're guessing at queries, in a syntax you use once a month, looking for something you can't describe yet, while Slack is asking for a status update.
Steadwing searches all six supported logging platforms in parallel CloudWatch, Elasticsearch, Loki, GCP Logging, Mezmo, and Scalyr scoped by alert timestamps, recent deploys, and metric anomalies. The 13–22 minute manual hunt drops to about 30 seconds.
You don't need to migrate to one logging platform. That project takes a year and most teams never finish it. You just need your existing platforms to be searchable as one system when something breaks.

The Logging Landscape Nobody Planned

Here's how it typically happens. Your first few services log to CloudWatch because you're on AWS and it was the default. Then the data team sets up Elasticsearch because they need full-text search on application events. Someone on the platform team introduces Loki because it's lightweight and works well with their Grafana setup. A couple of services that run on GCP use GCP Cloud Logging.

Nobody sat in a room and decided to run four logging platforms. It happened incrementally over two years, and by the time anyone noticed, each platform had different services, different retention policies, different query languages, and different people who knew how to use them.

Dash0's 2025 analysis describes this perfectly: "when logs are spread across disconnected tools, investigations slow down and critical signals get buried in noise." But the standard advice consolidate onto one platform is a multi-quarter migration that most teams never finish. And it doesn't solve the problem for the incidents happening right now.

The practical reality for most engineering teams is that logs will continue to live in multiple places. The question isn't how to fix that. It's how to make it not matter during a P0.

What Log Investigation Actually Looks Like at 2 AM

Let's walk through what happens when an engineer gets paged for a service returning errors.

The first problem is figuring out where to look: Which service is affected? Which platform does that service log to? If it's a cascading failure across multiple services, the logs might be in two or three different platforms. The engineer either knows this from memory or they don't. If they don't, they're checking the wiki which may or may not be accurate.

The second problem is the query itself: CloudWatch Logs Insights, LogQL, Elasticsearch's query DSL, GCP's logging query language each has its own syntax. The engineer is writing queries in a language they might use once a month, typo-checking field names, waiting for results, getting nothing, adjusting the time window, trying again. Middleware's research puts it bluntly: "only the engineer who built the logging setup actually knows how to query it."

The third problem is time ranges: The alert fired at 2:47 PM but the actual problem might have started at 2:30. Or 2:00. The engineer picks a window and hopes. Too narrow and they miss the cause. Too wide and they're scrolling through thousands of irrelevant lines trying to spot the one that matters.

The fourth problem and the one nobody talks about is that log search without context is basically guessing: The engineer is typing "timeout" or "500 error" or "connection refused" into a search bar, hoping something relevant comes back. But the most useful log search happens when you already know what you're looking for. During an incident, you don't. That's the whole point you're using logs to figure out what happened. Without knowing which deploy changed what, which metric spiked when, and which alert correlates with which service, the search is unfocused.

This is why log investigation takes 13–22 minutes during a typical incident not because the tools are slow, but because the human has to navigate platform fragmentation, query syntax, time window ambiguity, and lack of context simultaneously. Under pressure. While Slack is asking for updates.

The Hidden Cost: Duplicated Effort

There's one more layer that makes this worse.

During a multi-engineer incident, two or three people often search logs independently. Engineer A opens CloudWatch. Engineer B opens CloudWatch. They're running similar queries with slightly different parameters. Neither knows the other is looking.

When someone finally finds the relevant log line, they paste it in Slack. The other engineers have already spent 5–10 minutes on redundant searches. Multiply that across the team and you've burned 15–20 minutes of collective engineering time on work that needed to happen once.

This isn't a coordination failure. It's a tooling gap. If the log search happened once, automatically, with results delivered to everyone the duplication disappears entirely.

What Parallel Search With Context Looks Like

Steadwing connects to six logging platforms: AWS CloudWatch, GCP Cloud Logging, Elasticsearch, Mezmo, Scalyr, and Grafana Loki.

When an investigation triggers, it doesn't search them one at a time. It queries all connected platforms simultaneously using the alert timestamp from PagerDuty, the recent deploy data from GitHub, and the metric anomalies from Datadog to scope the search precisely.

The engineer doesn't pick a platform. They don't write a query. They don't guess at a time range. The relevant log lines show up in the RCA with timestamps, context, and links back to the source platform correlated with deploy data, metric changes, error tracking from Sentry, and infrastructure events from Kubernetes.

The 22-minute log hunt from the story at the top of this post? The log line was in CloudWatch at minute one. With parallel search and deploy context, Steadwing would have surfaced it in under 30 seconds already correlated with the deploy that caused it and the fix needed to resolve it.

For Engineering Leaders

The instinct when log investigation is slow is to consolidate platforms. One tool, one query language, one place to search. It makes sense in theory.

In practice, platform consolidation is a 6–12 month project that touches every team's logging pipeline. Most organizations start it and never finish. And it doesn't help with the incidents happening between now and whenever the migration is done.

The alternative: leave your logs where they are and make them searchable as one system during incidents. Steadwing connects to the platforms you already run, queries them in parallel, and delivers the results as part of a complete RCA alongside metrics, deploys, alerts, and infrastructure data.

No migration. No agents. No code changes. Your logs stay where they are. They just become findable when it matters.
Start free at steadwing.com

Frequently Asked Questions

How does Steadwing search logs across multiple platforms?

When an investigation triggers, Steadwing queries all connected logging platforms in parallel. It uses context from the alert (PagerDuty), recent deploys (GitHub/GitLab), and metric anomalies (Datadog) to automatically scope the search the right services, the right time window, the right error patterns. Results come back correlated with everything else in the RCA.

Do we need to change our logging setup?

No. Steadwing reads from your logging platforms as they are. Your logs stay in CloudWatch, Elasticsearch, Loki, or wherever they live. No changes to your ingestion pipeline, retention policies, or log format.

What if different services log to different platforms?

That's exactly the problem Steadwing is built for. It doesn't matter if checkout logs to CloudWatch and payments logs to Elasticsearch. When an incident involves both, Steadwing searches both simultaneously and correlates the results.

Which logging platforms are supported?

AWS CloudWatch, GCP Cloud Logging, Elasticsearch, Mezmo (formerly LogDNA), Scalyr, and Grafana Loki. Full details at docs.steadwing.com/integrations.

Top comments (5)

Bap • Jun 9

Great piece, kudos Jay

Mudassir Khan • Jun 9

the 'which was which' detail is the real problem statement. not the query syntax, not the retention policy. the cognitive overhead of knowing which platform to check first is what kills the MTTR.

we hit the same thing adding AI agent observability. an agent runtime, an LLM API call, a vector retrieval step, and a downstream tool call each log to different places. when something fails mid chain, you can't grep for a trace id across four platforms fast enough to answer 'did the model hallucinate or did the retrieval break?'

the 'searchable as one system' framing is right. the migration argument loses every time someone gives it, but the unified query interface argument usually wins. does Dash0 handle semantic search or just syntax level query federation?

Meenal • Jun 12

Good read—this really highlights how MTTR is often lost in context switching, not lack of data. Logs are usually there early, but scattered tools and query friction slow everything down. The real gap is unified context, not just better search.

Ajay Devineni • Jun 12

This title hits exactly right. The problem isn't log volume it's retrieval latency under pressure.
The checkout story here is familiar. We've had nearly identical incidents where the log line existed from minute one, but the engineer on call was navigating platform memory, not the actual failure.
What makes this worse in AI agent environments specifically: the failure chain isn't linear anymore. An agent runtime, an LLM inference call, a retrieval step, and a downstream tool call each emit signals to different sinks. When something fails mid-chain, you're not asking "what errored" you're asking "did the model produce bad output, or did the retrieval surface bad context, or did the tool call fail silently?" Those are three different platforms, three different query languages, and three different time offsets on the same causal chain.
I've been working on semantic SLIs for AI agent systems metrics like Decision Quality Rate and Hallucination Error Rate specifically because traditional log-based RCA breaks down when the "error" isn't an exception but a semantically wrong output. The log line exists. It just doesn't look like a failure to any existing alert rule.
A few things that consistently move the needle in production:
Enforced log schema at the service boundary free-text logs are archaeology. Consistent field names (service, trace_id, error_code, causal_step) are searchable in 30 seconds instead of 22 minutes.
Alert-to-log semantic alignment if your alert fires on a condition that isn't directly queryable in your log index with the same field names, you have a latent MTTR tax. Every incident.
Deploy-correlated search windows the question isn't "what errored at 2:47 PM." It's "what changed in the 40 minutes before 2:47 PM." Anchoring log search to deploy events cuts the time window problem almost entirely.
The logs always had the answer. The discipline is designing them so that answer is findable at 2am by someone who didn't write the service — and increasingly, by an AI agent that needs to explain why it made the decision it made.
— Ajay Devineni | Sr. SRE/DevOps Engineer | AWS Community Builder

🔗 agentsre (semantic SLIs for AI systems) · github.com/Ajay150313

Yash Mishra • Jun 15

exciting