Nex Tools

Posted on May 13

Claude Code for Observability Stacks: How I Stopped Flying Blind in Production

#claudecode #devops #observability #programming

The first real outage I had to debug without proper observability took fourteen hours to resolve. The system was throwing 500s intermittently. Logs showed nothing useful. Metrics showed the error rate climbing but no signal about why. Traces did not exist. I spent the entire day adding log lines, redeploying, watching, and adding more log lines until I finally cornered the root cause.

The fix took eight minutes once I understood what was happening. The other thirteen hours and fifty-two minutes were spent building the observability that should have already been in place.

After that incident, I made a rule. Every service has to have observability built in from day one. Not as a future improvement. Not as something that gets added when there is time. Built in from the first commit. I have kept that rule for years now, and Claude Code is the thing that made it cheap enough to actually follow.

Here is the workflow I use to build and maintain observability across every service I run.

What Good Observability Actually Means

Good observability is the ability to answer questions about your system without having to deploy new code to answer them. When something breaks, you should be able to look at what is already being collected and figure out the answer. When you cannot answer the question with existing data, that is a gap in your observability and the next outage will be the one that exposes it.

The three pillars are logs, metrics, and traces. Each one answers different questions. Logs tell you what happened. Metrics tell you how often it happened and how it changed over time. Traces tell you what was happening at the same time across the rest of the system.

You do not have observability when you have all three pillars deployed. You have observability when you can answer the question you actually need to answer at three in the morning when something is on fire and you have ten minutes to find the root cause. The three pillars are necessary but not sufficient.

The hard part of observability is not picking the tools. The hard part is making sure the right data is being collected in the right shape, with the right labels, at the right cardinality. This is where most observability stacks fail. The tools are deployed but the data is wrong, and at the moment of crisis the answer is not in the system.

The Claude Code workflow targets the data collection problem directly. The skills produce instrumentation that is consistent across services, that captures the right shape of information, and that gets maintained as the services evolve.

The Instrumentation Skill

The instrumentation skill takes a service and adds the structured logs, metrics, and traces that the service needs to be debuggable.

The skill starts by reading the service code and identifying the natural instrumentation points. Public API endpoints get request and response logging with latency metrics and distributed trace spans. Database queries get duration metrics and trace spans with the query type as a label. External API calls get the same treatment, plus retry counts and circuit breaker state. Background jobs get start, complete, and failure events with duration histograms.

The skill applies the same patterns across every service. The endpoint logs have the same fields everywhere. The trace spans have consistent naming. The metrics have a shared set of labels. The consistency is what lets dashboards and alerts work across services without per-service configuration.

The skill avoids over-instrumentation. Adding a log line to every function in a codebase produces enormous volumes of low-value data that drowns out the signals you actually need. The skill instruments at the boundary points where data crosses subsystem lines. Inside a subsystem, only the points that have historically been useful for debugging get instrumented.

The skill also handles the cardinality problem. Metrics with high-cardinality labels explode in storage cost and query latency. The skill identifies labels that should be high-cardinality (request IDs, user IDs) and ensures they go into logs and traces rather than metrics. It identifies labels that should be low-cardinality (endpoint paths, error categories) and uses those for metrics.

The Schema Skill

Logs without a schema are unsearchable. Logs with a schema are queryable like a database. The schema skill makes sure every log line follows the schema for the service.

The schema starts simple. Every log line has a timestamp, a level, a service name, a request ID if one is available, and a message. Every log line is JSON, not free text. The fields beyond these are specific to the event being logged.

The skill captures the per-event fields as it instruments. A login event logs the user ID, the auth method, and whether the attempt succeeded. A database query event logs the query type, the table, the duration, and the row count. The fields are explicit and consistent across the codebase.

The skill produces a schema document that lists every event type, the fields it carries, and what each field means. The document gets checked into the repository and updated whenever a new event type is added. The document is what makes the logs usable by anyone other than the person who wrote them.

The schema also drives the log aggregation pipeline. The aggregator parses the JSON, extracts the fields, and indexes them so they can be queried. The indexing is much faster than searching through unstructured text. A query that takes thirty seconds against unstructured logs takes a few hundred milliseconds against indexed structured logs.

The Trace Sampling Skill

Distributed tracing is valuable but expensive. Tracing every request consumes too much storage and slows down query performance. Sampling solves the cost problem but introduces a bias problem. Random sampling misses the rare failures that you actually want to investigate.

The trace sampling skill implements smart sampling that catches the interesting traces while keeping the cost manageable. Every trace gets a sampling decision based on its characteristics.

Traces from healthy successful requests get sampled at a low rate, maybe one in a hundred. The aggregate behavior is captured but the storage cost is low. Traces from slow requests, where latency exceeds a threshold, get sampled at a higher rate, maybe one in ten. Traces from failed requests, where the response is an error, get sampled at one hundred percent. Every failure is captured.

The skill also implements tail sampling for the cases where the sampling decision has to wait until the trace is complete. A request that started normal but ended in an error needs to be sampled, but the decision can only be made after the error happens. The tail sampler buffers traces in memory and makes the decision when the trace ends.

The result is a trace storage that is dominated by failures and outliers, which is exactly what you want for debugging. The successful requests are represented but not dominant. The cost stays manageable and the data stays useful.

The Alert Generation Skill

Alerts are the part of observability that most teams get wrong. Either there are too many alerts and people stop responding, or there are too few alerts and outages go undetected for hours. The alert generation skill produces alerts that are actionable, specific, and rare.

The skill starts from the service-level objectives. Each service has objectives for availability, latency, and correctness. The alerts measure deviation from those objectives. When the error budget is being consumed at a rate that would exhaust it before the period ends, an alert fires.

The skill avoids the common trap of alerting on raw metrics. An alert that fires when CPU usage exceeds 90% is mostly noise. CPU usage at 90% is not a problem if the requests are still being served fast. The alert should fire on the user-visible effect, not the internal cause.

The skill also avoids per-instance alerts. When a single instance is unhealthy, the load balancer should route around it and the system should self-heal. The alert should fire when the system as a whole cannot self-heal, which is when the redundancy is exhausted.

Each alert produced by the skill includes the runbook link that explains what the alert means, what to check, and how to mitigate. The runbook is generated alongside the alert and updated whenever the alert is modified. The alert without the runbook is useless. The combination is actionable.

The Dashboard Generation Skill

Dashboards are the interface that lets a human understand a system at a glance. Dashboards that have everything on them are unusable. Dashboards that have only one metric are not informative enough. The right level of detail is hard to find.

The dashboard generation skill produces dashboards for each service following a consistent template. The template has four panels at the top showing the four golden signals: latency, traffic, errors, and saturation. Below those, there are panels for the specific behavior of the service.

The skill picks the specific panels based on what the service does. An API service gets per-endpoint latency and error breakdowns. A background worker gets queue depth and processing latency. A database client gets per-query duration and connection pool saturation. The specifics are different but the layout is consistent.

The skill also produces composite dashboards that show multiple services together. When a user-facing feature spans several services, the dashboard for that feature shows the relevant panels from each service on one page. The composite dashboards are what get used during incident response, when you need to see across the whole call chain at once.

The dashboards get committed to the repository as code rather than configured in the UI. The code is reviewable, versioned, and reproducible. When a dashboard changes, the change goes through the same review process as any other code change.

The Correlation Skill

The hardest part of debugging in production is correlating signals across the three pillars. The metric shows the error rate climbing. The logs show a stream of errors. The traces show specific failed requests. Connecting these requires a shared identifier that flows through all three.

The correlation skill ensures that every request gets a request ID that propagates everywhere. The ID is created at the edge of the system, included in the logs, attached to the trace, and used as a label on the relevant metrics. With the ID in place, you can pivot between the three pillars by querying for the same ID.

The skill also adds correlation for asynchronous flows. A background job triggered by a request gets the request ID propagated through the job queue. A retry of a failed operation gets the original request ID so the full retry chain can be traced. A user session ID gets attached to every request in the session so you can see the full user journey.

The correlation is what makes the observability data composable. Without it, each pillar is an island. With it, the pillars become a connected graph that you can navigate based on the question you are asking.

The Cost Control Skill

Observability data is expensive. Storage costs scale with the volume of data. Query costs scale with the cardinality of the labels. A naively instrumented service can produce so much data that the observability bill exceeds the compute bill.

The cost control skill keeps the observability spend in check. The skill watches the ingestion volume per service and alerts when a service starts producing significantly more data than its peers. The alert prompts a review of whether the additional data is valuable or whether it represents an instrumentation mistake.

The skill also implements log level controls per service. Production runs at INFO level by default, which captures the events that matter without the volume of DEBUG. When a service is being actively debugged, the level can be raised to DEBUG for a short window and then dropped back. The temporary verbose period gives you the data you need without paying for it all the time.

The skill manages retention as well. High-cardinality data like traces gets retained for a short window, maybe seven days, because the value of a trace drops quickly after the incident is resolved. Lower-cardinality data like metrics gets retained for a longer window, maybe a year, because long-term trend analysis is valuable. The retention policies match the value of the data.

The cost control skill turns observability from an open-ended expense into a managed one. The spend has a budget. The budget gets allocated across services based on their criticality. The skill makes sure the allocation is being respected.

How the Skills Compose

The skills compose into an observability practice. The instrumentation skill adds the data collection. The schema skill makes the data queryable. The trace sampling skill keeps the volume manageable. The alert generation skill turns the data into actionable signals. The dashboard generation skill turns the data into visual summaries. The correlation skill makes the data connectable. The cost control skill keeps the bill predictable.

A new service comes into the system with all of this from day one. The instrumentation skill runs on the initial codebase. The schema document is generated. The trace sampling is configured. The alerts are generated. The dashboards are created. The service ships with observability built in.

When the service evolves, the skills evolve with it. New endpoints get instrumented as they are added. New event types get added to the schema. New alerts and dashboards appear as the surface area grows. The observability stays current without anyone scheduling observability work.

What This Costs

Building the skills took a few weeks. The instrumentation skill was the largest single piece because it has to understand many different code patterns. The schema and dashboard generation skills came next. The alert generation skill required the most tuning because alert quality is hard to get right.

Once the skills are in place, the cost of adding observability to a new service is close to zero. The skill runs, the output is reviewed and merged, and the service is observable. Compared to the days or weeks of manual work this would have taken, the savings are enormous.

The bigger benefit is the consistency. Every service follows the same patterns. Every dashboard has the same shape. Every alert links to a runbook. When something breaks, the cognitive load of finding the right data is low because the data is always in the same place.

What the Skills Do Not Do

The skills do not pick your observability vendor. Whether you use the open source stack, a commercial platform, or something in between, the skills produce instrumentation that fits the OpenTelemetry standard. The downstream pipeline is yours to configure.

The skills also do not replace the human judgment in incident response. They give you the data, but the data does not interpret itself. When something is breaking, a human has to look at the dashboards, read the logs, and decide what to do. The skills make this easier but do not automate it.

The skills also do not write your service-level objectives. The objectives are a product and business decision. The skill takes the objectives as input and produces alerts and dashboards that measure against them, but the objectives themselves come from you.

Setting Up Your Own Stack

Start with structured logging. Get every service producing JSON logs with a consistent schema. This is the foundation that everything else builds on. Without it, the other pillars cannot connect to the logs.

Add request ID correlation next. Make sure every log line in a request flow carries the same ID. Once the IDs are in place, you can connect logs from different services that participated in the same request.

Add metrics third. Start with the four golden signals per service. Add custom metrics as you discover the need. Resist the temptation to add metrics for everything just because you can.

Add traces fourth. Traces are the most expensive pillar and the one with the highest setup cost, so it makes sense to add them after the cheaper pillars are working. The smart sampling skill keeps the cost manageable.

Add alerts and dashboards last. These depend on the data being clean and the schema being stable. Premature alerting produces noise. Premature dashboards become abandoned.

The Bigger Picture

Observability is the kind of work that pays off enormously but feels invisible when it is working. When the system is healthy, you do not think about your observability stack. When something breaks, the stack is either there to help you or it is not. The investment in observability is paid back in minutes saved during outages, but the savings compound across every outage for the life of the system.

The pattern in this workflow is the same pattern I keep using. The repetitive parts of observability work get automated. The judgment parts remain human. The result is a practice that scales without scaling headcount, and a production environment where outages get resolved in minutes instead of hours.

If you have services in production without proper observability, the answer is not to wait for a quieter quarter. Build the workflow. Add observability to one service. Use the workflow to add it to the next service. After a few services, the workflow is mature and adding observability to the rest is fast.

The first concrete step is structured logging. Every log line as JSON, every log line with a request ID, every service following the same schema. Once that is in place, the rest of the stack starts to make sense. Without it, every additional pillar is harder than it needs to be.

Pick one service. Add structured logging. Verify the logs are queryable. Then move to the next service. The compounding starts immediately.

FAQ

Which observability platform should I use? The skills produce OpenTelemetry-compatible output, which works with most platforms. Pick the platform based on your team's familiarity and your budget.

How much should I spend on observability? A reasonable starting point is between five and ten percent of the compute spend for the service. If you are spending much less, you probably do not have enough observability. If you are spending much more, you probably have too much.

What about security and privacy? Logs and traces can capture sensitive data. The skills include a redaction step that removes known sensitive fields before the data is shipped to the observability platform. Configure the redaction rules for your context.

What about local development? The skills produce the same instrumentation locally as in production. Local logs and traces go to a local collector. This way you can debug observability issues without needing to deploy.

What is the biggest mistake to avoid? Treating observability as something to add later. The data you cannot collect during the incident is data you cannot have. Build it in from the start.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

DEV Community