Nex Tools

Posted on May 18

Claude Code for Log Analysis: How I Stopped Drowning in Stack Traces

#claudecode #observability #devops #productivity

The first time a production incident hit at 2 AM, I spent two hours scrolling through logs before I found the line that mattered. It was a single timestamp buried inside 800,000 entries from the same hour. The bug had been throwing the same exception in a hot loop, drowning out the rare error that actually caused the outage. By the time I found it, the customer impact window had stretched past anything we wanted to be telling the postmortem audience.

That night taught me something. Log analysis is not a search problem. It is a triage problem. The interesting signal is almost never the most frequent line. It is the rare line that happens once or twice and then never again. Humans are terrible at finding rare signals in high-volume noise, especially at 2 AM. Grep is even worse, because grep finds matches but does not rank them.

This is where Claude Code rewired how I do log analysis. The workflow I built turns a wall of unstructured text into a ranked list of things worth investigating, and it does it in seconds rather than hours. I run it every time something goes wrong in production, and it has compressed my median time to root cause from somewhere around an hour to somewhere around five minutes. Here is how the workflow works.

Why Traditional Log Analysis Falls Apart at Scale

The standard tools for log analysis were built for a world where logs were small. Grep, awk, and tail work fine when you have a few thousand lines and a clear idea of what you are looking for. They fall apart at modern volumes for two reasons.

The first reason is that you do not know what you are looking for. You know that something went wrong, but the error message that caught your attention might not be the actual cause. It might be a downstream symptom. The cause is buried somewhere earlier in the timeline, inside a log line that looked unremarkable when it was written.

The second reason is that the signal-to-noise ratio is brutal. A production service running at moderate load produces tens of thousands of log lines per minute. The vast majority of those lines are routine. The interesting ones are needles in a haystack of needles. Even when you find a candidate, you cannot easily tell whether it is rare or common without running another query.

The bottleneck in log analysis is never the speed of the search. It is the speed of pattern recognition across high-volume text. The tools we have are good at search and bad at pattern recognition, which is exactly the wrong tradeoff for the job.

The teams I have worked with that have invested heavily in observability platforms still have this problem. The platforms make ingestion easier but they do not solve the pattern recognition problem. They let you slice the data faster but they still require you to know what slice to ask for. When the incident is novel, you do not know.

If you want context on why I treat observability as a first-class engineering concern, the workflow I described in Claude Code for Observability Stacks lays out the broader system this log analysis workflow plugs into.

The Frequency Skill

The first skill in the workflow handles frequency analysis. Given a log file and a time window, the skill produces a ranked list of every distinct log pattern and how often it occurred.

The interesting part is the normalization. The skill recognizes that two log lines with different timestamps, different request IDs, and different user IDs but the same underlying message template are the same pattern. It strips out the variable parts and groups the lines by template. The output is a list of templates, each with a count, an example line, and a sample of the variable values.

The ranking flips the normal log ordering on its head. The most common patterns are at the bottom, not the top. The rare patterns, the ones that occurred once or twice in the window, float to the top. The list becomes a tour of every unusual thing that happened during the incident window, ranked by how rare it was.

The first time I ran this on a real incident, the cause jumped out from position three on the list. It was a single log line from a connection pool exhaustion event, twelve seconds before the cascade of customer-facing errors started. The line had been there the whole time, but it was invisible inside the 800,000 line haystack.

The Correlation Skill

Once the frequency skill has surfaced rare patterns, the correlation skill ties them together. The skill takes a set of candidate patterns and looks for temporal relationships between them.

The relationships it finds are useful for debugging. If pattern A always appears within five seconds of pattern B, that is worth knowing. If pattern A spikes shortly before pattern C starts firing, that is worth knowing. If pattern D appears only when pattern E has not appeared for several minutes, that is also worth knowing.

The skill also looks at cross-service correlations. When the log stream includes multiple services, the skill can ask whether a pattern in service X correlates with a pattern in service Y. The cross-service view often reveals causes that are invisible inside any single service.

The output is a small graph of temporal relationships. Each edge is annotated with the lag time and the strength of the correlation. The graph is far smaller than the original log volume and is much easier to reason about.

The Hypothesis Skill

The third skill builds hypotheses about what went wrong. Given the frequency analysis and the correlation graph, the skill produces a ranked list of possible root causes.

Each hypothesis comes with the evidence that supports it. The evidence is specific log lines, specific timestamps, specific correlation strengths. The hypothesis is not a guess. It is a falsifiable claim grounded in the actual log data, which means I can validate or reject it quickly.

The ranking is based on how well the evidence supports the hypothesis. A hypothesis that is consistent with every observed pattern is ranked higher than one that only explains part of the data. A hypothesis that contradicts a known pattern is ranked lower.

I treat the top hypothesis as the starting point of my investigation rather than the answer. Sometimes the top hypothesis is correct and I move on. Sometimes it is wrong but the evidence reveals a different cause that the skill missed. Either way, the starting point is much better than scrolling logs from the beginning.

If you want to see how this hypothesis-driven approach extends to incident response broadly, Claude Code for Incident Response covers the full workflow. The log analysis skills described here are the substrate that makes the incident response workflow possible.

The Diff Skill

Long-running incidents have a special challenge. The logs from before the incident and the logs from during the incident look almost identical at the line level, but the distributions are different. The diff skill quantifies the difference.

The skill takes two log windows. One is a baseline, usually a similar period from before the incident. The other is the incident window itself. The skill compares the frequency distributions of every log pattern in the two windows and surfaces the ones that changed the most.

The patterns that increased dramatically in the incident window are usually symptoms. The patterns that decreased are sometimes the most interesting. A drop in the rate of successful operations or healthy heartbeats often points more directly at the cause than the new errors do.

The diff also surfaces patterns that are entirely new. Lines that appear in the incident window but do not appear at all in the baseline are particularly interesting. They are the things the system was not doing in normal operation, which means they are very likely related to the incident.

How the Workflow Runs in Practice

When an incident fires, my first move is to grab the log window. I usually take fifteen minutes before the first customer-visible error and fifteen minutes after. The window is small enough to process quickly and large enough to capture context.

I pass the window to the frequency skill. The output is a ranked list of patterns that takes about ten seconds to skim. The rare patterns at the top are my first candidates. I tag the ones that look interesting.

The correlation skill takes the tagged patterns and produces a small graph. The graph usually reveals the rough order of events. I can see which pattern came first, which came second, which seemed to trigger which.

The hypothesis skill takes everything and gives me a starting point. Maybe two or three candidate root causes, each with supporting evidence. I validate the top hypothesis by checking the evidence directly and either confirming it or ruling it out.

If the incident is long-running, I bring in the diff skill. The baseline comparison is particularly useful when the incident is a gradual degradation rather than a sudden break. The patterns whose rates have drifted reveal the degradation in a way that simple error scanning cannot.

The whole workflow takes ten to fifteen minutes for a typical incident. The actual fix often takes longer than the analysis, which is the opposite of how my workflow used to be balanced.

What This Workflow Did to My Practice

The most measurable change is the median time to root cause. Before this workflow, I would estimate it at sixty to ninety minutes for a typical incident. Today it is closer to five to ten minutes for the same class of incident. The reduction is dramatic enough that I no longer dread getting paged.

The less measurable change is more important. I now expect to understand what happened, not just to make it stop. Pre-workflow, I would frequently hit a state where the system was healthy again but I did not really know why or what had caused the original problem. The temptation was to declare victory and move on. Post-workflow, I almost always have a hypothesis grounded in evidence by the time the system stabilizes. The postmortems are shorter and more accurate because the root cause is already documented.

The third change is in how I think about logs themselves. I used to view logging as a debugging output, something to write more of when I was stuck. Now I view it as a structured signal source that needs to be analyzable at scale. The way I write log statements has changed. I include more context, I use more consistent templates, I avoid baking variable values into the static parts of the message. The downstream tooling works better because the upstream emission is more disciplined.

If you want to take this further, my full set of practical workflows is in the Claude Code Practical Workflows series on DEV.to. The series covers everything from incident response to refactoring to migrations.

FAQ

Does this work for unstructured logs?

Yes. The frequency skill normalizes log lines into templates even when the lines are unstructured. Structured logs are easier to work with, but the workflow does not require them.

What about logs that are too large to process in a single pass?

The skills can sample. For very large windows, the frequency analysis runs on a sample first and then drills down on the interesting patterns at full resolution. The sampling is much faster and usually surfaces the same candidates as the full pass.

Does this replace observability platforms?

No. The observability platform is where the logs live. The skills consume what the platform provides. They complement the platform by adding pattern recognition that the platform itself does not offer.

How do I get started?

Start with the frequency skill. Pick a recent incident, grab the log window, run the analysis, and see what shows up. The first time you find a needle the platform missed, you will know whether the workflow is worth investing in further.

The log analysis workflow is one piece of a larger pattern. The pattern is using Claude Code to add pattern recognition layers on top of tools that were designed for human-scale data and now have to work at machine-scale volumes. Every layer makes a different part of the job tractable. Log analysis was where the payoff was clearest for me, and it is where I would recommend starting if you want to try this on your own systems.