DEV Community: Nex Tools

Claude Code for Error Budgets: How I Stopped Arguing About Reliability and Started Measuring It

Nex Tools — Mon, 18 May 2026 11:07:35 +0000

For the first three years of running production systems I had the same fight with the same people about the same thing. The product team wanted to ship faster. The on-call team wanted to ship slower. Both sides had data. Neither side could prove the other was wrong. The arguments would end in compromise that nobody felt good about, and the next incident would restart the cycle.

The fix was not better arguments. The fix was an error budget. Once we had a budget, the question stopped being "should we ship this risky change" and started being "do we have budget left to spend." That is a much smaller question with a much clearer answer, and it changes the entire conversation between the people who build features and the people who keep them running.

Setting up an error budget program sounded simple in the SRE book. In practice it took me eighteen months and three failed attempts before I had something that actually worked. The thing that finally made it work was using Claude Code to handle the parts of the program that humans were too inconsistent to handle on their own. Here is the workflow I built and what it taught me about reliability as an engineering discipline rather than a debate topic.

Why Error Budgets Are Hard to Run in Practice

The theory of error budgets is straightforward. You pick a service level objective, usually expressed as a percentage of successful requests over some window. The difference between 100% success and your objective is your budget. When you have budget left, you can ship risky things. When you have burned through your budget, you slow down and focus on reliability work until the budget recovers.

The theory is clean. The practice is messy in ways the theory does not warn you about.

The first mess is measurement. Picking the right success criteria turns out to be much harder than it sounds. A request that returned 200 but took 12 seconds is not really a success. A request that returned 500 because the user sent malformed input is not really a failure. A request that succeeded for the user but failed in a way that corrupted internal state is the worst possible outcome and looks like a success in your metrics. Every team that runs an error budget hits these edge cases, and most teams either ignore them or argue about them forever.

The second mess is enforcement. Error budgets only work if the organization actually changes behavior when the budget is exhausted. In practice, the moment the budget runs out is also the moment somebody important wants to ship something important, and the budget gets waived. After this happens three or four times, the budget becomes a number on a dashboard that everyone ignores. The credibility cost of ignoring the budget once is much higher than people realize.

The third mess is cadence. A budget that resets monthly behaves very differently from a budget that resets quarterly, and both behave very differently from a sliding window. Each cadence has different failure modes. The wrong cadence for your traffic patterns can make the budget either too punishing or too lax, and neither extreme produces the cultural changes you wanted.

An error budget is not just a number. It is a contract between the people who build features and the people who run them, and like any contract, the value comes from how rigorously it is enforced rather than how cleverly it is written.

The teams I have seen succeed with error budgets are not the teams with the most sophisticated objectives. They are the teams that wired the budget into their actual release process so that the contract enforced itself. The teams that failed are the ones that left enforcement up to human judgment, because human judgment under pressure always finds a reason to ship.

The Objective Skill

The first skill in my error budget workflow handles objective definition. Given a service and its traffic patterns, the skill proposes a service level objective and the supporting indicators that feed into it.

The skill does not just pick a percentage. It looks at historical traffic, current failure rates, and customer impact patterns to recommend an objective that is achievable but meaningful. An objective of 99.99% sounds impressive but is usually meaningless for a service whose current state is 99.5%. An objective of 99% is useless for a service that already runs at 99.95%. The right objective is one that requires real work to maintain but does not require fantasy.

The skill also defines the success criteria precisely. It does not just say "successful requests." It specifies what success means for this particular service, including the edge cases. A successful request might require a 2xx status code, a response time under a specific threshold, and the absence of any internal error logs correlated with that request ID. The precision is important because vagueness is where the arguments start.

The output is a structured objective document that can be reviewed, debated, and eventually signed off by both the engineering team and the product team. The document is the foundation of the budget program. Without precise definitions, every subsequent decision becomes a fight about what the words mean.

The Burn Rate Skill

The second skill tracks burn rate in real time. The burn rate is the speed at which the budget is being consumed relative to the time remaining in the window.

A budget that burns at exactly the expected rate is not interesting. A budget that burns at three times the expected rate is a warning. A budget that burns at ten times the expected rate is an active fire. The skill watches the burn rate continuously and surfaces deviations from expected behavior.

The interesting design choice is the smoothing. A naive burn rate calculation produces wild swings every time a single bad request comes in. A heavily smoothed calculation hides real problems for hours. The skill uses multiple time horizons in parallel. A one-hour view catches active fires. A six-hour view catches sustained degradations. A 24-hour view catches gradual drift that nobody would notice in shorter windows.

When the burn rate crosses a threshold, the skill alerts. The alert is not just a number. It includes the context needed to act. Which endpoint is contributing most to the burn. Which deploy correlated with the change in burn rate. Whether the burn is concentrated in a single tenant or distributed across the user base. The context turns the alert from a notification into a starting point for investigation.

The investigation often connects directly to a log analysis pass. If you have set up the workflow I described in Claude Code for Log Analysis, the burn rate alert can hand off straight into pattern detection, which compresses the time from "budget is burning" to "we know why" by a meaningful margin.

The Policy Skill

The third skill enforces the budget policy in the release pipeline. The skill sits between the deploy command and the actual deploy and checks whether the current budget state permits the release.

The policy is configurable per service. A typical configuration might say that any deploy is permitted while the budget is above 50%, that only low-risk deploys are permitted between 25% and 50%, and that no non-critical deploys are permitted below 25%. The thresholds and the risk classifications are defined in the objective document so that the policy is mechanical rather than negotiable.

The mechanical enforcement is the entire point. When the budget gets low and the policy blocks a deploy, the response is not a debate about whether to override the policy. The response is a question about whether to spend the remaining budget on this specific change. If the answer is yes, the change ships and the budget burns further. If the answer is no, the change waits. Either way, the budget stays meaningful.

The skill also produces an audit trail. Every deploy that was permitted under the policy is logged with the budget state at the time. Every deploy that was blocked is logged with the reason. The audit trail makes it possible to look back at a quarter and see exactly how the budget was spent and whether the spending decisions were the right ones in retrospect.

The Postmortem Skill

The fourth skill connects error budget consumption to postmortem actions. After every significant budget burn, the skill produces a draft postmortem that documents what happened, how much budget was consumed, and what changes would prevent the burn from recurring.

The draft is not a finished postmortem. It is a structured starting point. The skill fills in the data sections automatically, including the timeline, the affected metrics, and the related deploys. The human writes the analysis sections, which are the parts that actually require judgment. The split between mechanical sections and judgment sections cuts the time to produce a postmortem roughly in half without reducing its quality.

The postmortem also includes a budget impact summary. The summary expresses the incident in terms of how much of the quarterly budget it consumed and how that affects the remaining release capacity for the quarter. The budget framing makes the postmortem read differently. Instead of saying "this incident lasted 47 minutes," the postmortem says "this incident consumed 18% of the quarterly budget." The second framing leads to different priorities about prevention.

For incident response context that pairs naturally with this postmortem workflow, the system I described in Claude Code for Incident Response handles the live response side and feeds directly into the postmortem skill once the incident is closed.

How the Workflow Runs in Practice

The workflow runs continuously rather than on demand. The burn rate skill is always watching. The policy skill is always sitting in front of the deploy pipeline. The postmortem skill triggers automatically when a burn crosses a threshold.

When the burn rate alerts, my first move is to check the context the alert provides. The endpoint, the correlated deploy, the affected user segment. Most of the time the context points directly at the cause. If the alert correlates with a recent deploy, the deploy is probably the cause and the response is a rollback. If the alert correlates with a tenant spike, the cause is probably load and the response is a scaling decision.

When the policy skill blocks a deploy, the response is a conversation rather than a fight. The conversation is about whether the change is important enough to spend remaining budget on, given that the budget cannot be replenished mid-quarter. Sometimes the answer is yes and the team takes ownership of the increased risk. Sometimes the answer is no and the change moves to the next quarter. Either answer is fine because both are deliberate.

When the postmortem skill produces a draft, my first move is to fill in the human analysis sections. The data is already there. The narrative, the root cause, the action items are the parts that need judgment. The structured starting point makes it easier to focus on the parts that matter rather than getting bogged down in data assembly.

The objective skill runs once per service when the service is onboarded and then again at quarterly reviews. The review cycle keeps the objectives calibrated to actual traffic and actual customer expectations. A service whose traffic has grown 10x in a year usually needs a tighter objective than the one it was launched with.

What This Workflow Did to My Practice

The most visible change is that the arguments stopped. The product team and the engineering team no longer fight about whether to ship a particular change. They look at the budget, they look at the change, and they make a decision. The decision is not always the one I would have preferred, but the decision-making process is much faster and much less politically expensive than the old arguments were.

The second change is that incidents feel different. An incident that would previously have produced a vague sense of "things were bad for a while" now produces a precise statement about how much budget was consumed and what that means for the rest of the quarter. The precision makes prevention work easier to justify, because the cost of the incident is no longer abstract.

The third change is in how the team thinks about reliability investments. Before the budget program, reliability work was something to argue for during planning. After the budget program, reliability work happens whenever the budget gets tight, because the alternative is freezing deploys. The forcing function is mechanical rather than political, which means the work actually happens.

The fourth change, which I did not expect, is in how features get scoped. The product team has started asking about expected error budget impact early in the design process. A feature that requires a risky new dependency now gets weighed against the budget cost of integrating it, not just the engineering cost of building it. The conversation about scope is informed by reliability data instead of opinions.

For the broader set of workflows that connect to this one, my Claude Code Practical Workflows series on DEV.to covers everything from observability through incident response, refactoring, migrations, and security. The error budget workflow ties many of them together because the budget is the unifying measurement that tells you whether the rest of the practice is working.

FAQ

What if the team does not want to commit to a service level objective?

That resistance is usually about fear of being held to an unrealistic number rather than disagreement with the concept. The objective skill helps because it grounds the objective in actual traffic and current failure rates, which makes the number defensible. Once the team sees that the proposed objective is achievable, the resistance usually fades.

How do I handle services with very low traffic?

Low-traffic services have noisy budget calculations because a single bad request consumes a much larger percentage of the budget. The skill handles this by using longer windows for low-traffic services and by combining related services into a single budget where appropriate. A budget calculated over 1,000 requests per quarter is meaningful in a way that a budget calculated over 50 is not.

What happens when the budget is exhausted halfway through the quarter?

The policy skill blocks non-critical deploys for the rest of the quarter. The team uses the time to do the reliability work that the burn rate revealed. This is the intended behavior. If exhausting the budget produces no behavioral change, the budget program is not working and the program needs to be reconsidered, not the budget.

Can I run multiple budgets per service?

Yes. A single service might have a budget for availability and a separate budget for latency. The skills handle multiple budgets per service and produce aggregated views for cases where the budgets need to be reasoned about together. Most teams start with a single availability budget and add additional budgets only when the first one is operating well.

How do I get product team buy-in?

The biggest unlock is reframing the budget as a permission slip rather than a restriction. The budget is what gives the product team the right to ship risky changes. Without the budget, every risky change has to be argued individually. With the budget, the team can ship anything that fits within the available capacity. Most product teams respond well to this framing once they see that the budget enables faster shipping when reliability is healthy.

The error budget workflow is the piece of my SRE practice that I would recommend to any team that has ongoing tension between product and engineering about reliability. The tension is real and it does not resolve itself through arguments. It resolves through a contract that enforces itself mechanically, and the workflow I described is how I made that contract operational. The investment is significant. The payoff is that reliability stops being a source of conflict and becomes a source of shared planning, which is the version of the relationship that healthy teams have.

Claude Code for Log Analysis: How I Stopped Drowning in Stack Traces

Nex Tools — Mon, 18 May 2026 10:54:44 +0000

The first time a production incident hit at 2 AM, I spent two hours scrolling through logs before I found the line that mattered. It was a single timestamp buried inside 800,000 entries from the same hour. The bug had been throwing the same exception in a hot loop, drowning out the rare error that actually caused the outage. By the time I found it, the customer impact window had stretched past anything we wanted to be telling the postmortem audience.

That night taught me something. Log analysis is not a search problem. It is a triage problem. The interesting signal is almost never the most frequent line. It is the rare line that happens once or twice and then never again. Humans are terrible at finding rare signals in high-volume noise, especially at 2 AM. Grep is even worse, because grep finds matches but does not rank them.

This is where Claude Code rewired how I do log analysis. The workflow I built turns a wall of unstructured text into a ranked list of things worth investigating, and it does it in seconds rather than hours. I run it every time something goes wrong in production, and it has compressed my median time to root cause from somewhere around an hour to somewhere around five minutes. Here is how the workflow works.

Why Traditional Log Analysis Falls Apart at Scale

The standard tools for log analysis were built for a world where logs were small. Grep, awk, and tail work fine when you have a few thousand lines and a clear idea of what you are looking for. They fall apart at modern volumes for two reasons.

The first reason is that you do not know what you are looking for. You know that something went wrong, but the error message that caught your attention might not be the actual cause. It might be a downstream symptom. The cause is buried somewhere earlier in the timeline, inside a log line that looked unremarkable when it was written.

The second reason is that the signal-to-noise ratio is brutal. A production service running at moderate load produces tens of thousands of log lines per minute. The vast majority of those lines are routine. The interesting ones are needles in a haystack of needles. Even when you find a candidate, you cannot easily tell whether it is rare or common without running another query.

The bottleneck in log analysis is never the speed of the search. It is the speed of pattern recognition across high-volume text. The tools we have are good at search and bad at pattern recognition, which is exactly the wrong tradeoff for the job.

The teams I have worked with that have invested heavily in observability platforms still have this problem. The platforms make ingestion easier but they do not solve the pattern recognition problem. They let you slice the data faster but they still require you to know what slice to ask for. When the incident is novel, you do not know.

If you want context on why I treat observability as a first-class engineering concern, the workflow I described in Claude Code for Observability Stacks lays out the broader system this log analysis workflow plugs into.

The Frequency Skill

The first skill in the workflow handles frequency analysis. Given a log file and a time window, the skill produces a ranked list of every distinct log pattern and how often it occurred.

The interesting part is the normalization. The skill recognizes that two log lines with different timestamps, different request IDs, and different user IDs but the same underlying message template are the same pattern. It strips out the variable parts and groups the lines by template. The output is a list of templates, each with a count, an example line, and a sample of the variable values.

The ranking flips the normal log ordering on its head. The most common patterns are at the bottom, not the top. The rare patterns, the ones that occurred once or twice in the window, float to the top. The list becomes a tour of every unusual thing that happened during the incident window, ranked by how rare it was.

The first time I ran this on a real incident, the cause jumped out from position three on the list. It was a single log line from a connection pool exhaustion event, twelve seconds before the cascade of customer-facing errors started. The line had been there the whole time, but it was invisible inside the 800,000 line haystack.

The Correlation Skill

Once the frequency skill has surfaced rare patterns, the correlation skill ties them together. The skill takes a set of candidate patterns and looks for temporal relationships between them.

The relationships it finds are useful for debugging. If pattern A always appears within five seconds of pattern B, that is worth knowing. If pattern A spikes shortly before pattern C starts firing, that is worth knowing. If pattern D appears only when pattern E has not appeared for several minutes, that is also worth knowing.

The skill also looks at cross-service correlations. When the log stream includes multiple services, the skill can ask whether a pattern in service X correlates with a pattern in service Y. The cross-service view often reveals causes that are invisible inside any single service.

The output is a small graph of temporal relationships. Each edge is annotated with the lag time and the strength of the correlation. The graph is far smaller than the original log volume and is much easier to reason about.

The Hypothesis Skill

The third skill builds hypotheses about what went wrong. Given the frequency analysis and the correlation graph, the skill produces a ranked list of possible root causes.

Each hypothesis comes with the evidence that supports it. The evidence is specific log lines, specific timestamps, specific correlation strengths. The hypothesis is not a guess. It is a falsifiable claim grounded in the actual log data, which means I can validate or reject it quickly.

The ranking is based on how well the evidence supports the hypothesis. A hypothesis that is consistent with every observed pattern is ranked higher than one that only explains part of the data. A hypothesis that contradicts a known pattern is ranked lower.

I treat the top hypothesis as the starting point of my investigation rather than the answer. Sometimes the top hypothesis is correct and I move on. Sometimes it is wrong but the evidence reveals a different cause that the skill missed. Either way, the starting point is much better than scrolling logs from the beginning.

If you want to see how this hypothesis-driven approach extends to incident response broadly, Claude Code for Incident Response covers the full workflow. The log analysis skills described here are the substrate that makes the incident response workflow possible.

The Diff Skill

Long-running incidents have a special challenge. The logs from before the incident and the logs from during the incident look almost identical at the line level, but the distributions are different. The diff skill quantifies the difference.

The skill takes two log windows. One is a baseline, usually a similar period from before the incident. The other is the incident window itself. The skill compares the frequency distributions of every log pattern in the two windows and surfaces the ones that changed the most.

The patterns that increased dramatically in the incident window are usually symptoms. The patterns that decreased are sometimes the most interesting. A drop in the rate of successful operations or healthy heartbeats often points more directly at the cause than the new errors do.

The diff also surfaces patterns that are entirely new. Lines that appear in the incident window but do not appear at all in the baseline are particularly interesting. They are the things the system was not doing in normal operation, which means they are very likely related to the incident.

How the Workflow Runs in Practice

When an incident fires, my first move is to grab the log window. I usually take fifteen minutes before the first customer-visible error and fifteen minutes after. The window is small enough to process quickly and large enough to capture context.

I pass the window to the frequency skill. The output is a ranked list of patterns that takes about ten seconds to skim. The rare patterns at the top are my first candidates. I tag the ones that look interesting.

The correlation skill takes the tagged patterns and produces a small graph. The graph usually reveals the rough order of events. I can see which pattern came first, which came second, which seemed to trigger which.

The hypothesis skill takes everything and gives me a starting point. Maybe two or three candidate root causes, each with supporting evidence. I validate the top hypothesis by checking the evidence directly and either confirming it or ruling it out.

If the incident is long-running, I bring in the diff skill. The baseline comparison is particularly useful when the incident is a gradual degradation rather than a sudden break. The patterns whose rates have drifted reveal the degradation in a way that simple error scanning cannot.

The whole workflow takes ten to fifteen minutes for a typical incident. The actual fix often takes longer than the analysis, which is the opposite of how my workflow used to be balanced.

What This Workflow Did to My Practice

The most measurable change is the median time to root cause. Before this workflow, I would estimate it at sixty to ninety minutes for a typical incident. Today it is closer to five to ten minutes for the same class of incident. The reduction is dramatic enough that I no longer dread getting paged.

The less measurable change is more important. I now expect to understand what happened, not just to make it stop. Pre-workflow, I would frequently hit a state where the system was healthy again but I did not really know why or what had caused the original problem. The temptation was to declare victory and move on. Post-workflow, I almost always have a hypothesis grounded in evidence by the time the system stabilizes. The postmortems are shorter and more accurate because the root cause is already documented.

The third change is in how I think about logs themselves. I used to view logging as a debugging output, something to write more of when I was stuck. Now I view it as a structured signal source that needs to be analyzable at scale. The way I write log statements has changed. I include more context, I use more consistent templates, I avoid baking variable values into the static parts of the message. The downstream tooling works better because the upstream emission is more disciplined.

If you want to take this further, my full set of practical workflows is in the Claude Code Practical Workflows series on DEV.to. The series covers everything from incident response to refactoring to migrations.

FAQ

Does this work for unstructured logs?

Yes. The frequency skill normalizes log lines into templates even when the lines are unstructured. Structured logs are easier to work with, but the workflow does not require them.

What about logs that are too large to process in a single pass?

The skills can sample. For very large windows, the frequency analysis runs on a sample first and then drills down on the interesting patterns at full resolution. The sampling is much faster and usually surfaces the same candidates as the full pass.

Does this replace observability platforms?

No. The observability platform is where the logs live. The skills consume what the platform provides. They complement the platform by adding pattern recognition that the platform itself does not offer.

How do I get started?

Start with the frequency skill. Pick a recent incident, grab the log window, run the analysis, and see what shows up. The first time you find a needle the platform missed, you will know whether the workflow is worth investing in further.

The log analysis workflow is one piece of a larger pattern. The pattern is using Claude Code to add pattern recognition layers on top of tools that were designed for human-scale data and now have to work at machine-scale volumes. Every layer makes a different part of the job tractable. Log analysis was where the payoff was clearest for me, and it is where I would recommend starting if you want to try this on your own systems.

Claude Code for Feature Flags: How I Ship Risky Changes Without Losing Sleep

Nex Tools — Wed, 13 May 2026 11:43:28 +0000

The riskiest deployment I have ever done was a payment processor migration. The old processor was being deprecated. The new one had better rates but a completely different API. The migration touched the most sensitive code path in the business. A bug in the new path would silently lose revenue or charge customers incorrectly. There was no acceptable amount of downtime.

I shipped that migration on a Tuesday afternoon at three in the morning. No, that is not a typo. I shipped it in the middle of a normal Tuesday because feature flags let me. The new path was already in production behind a flag set to zero percent of traffic. I ran a script that increased the percentage gradually: one percent, then ten, then fifty, then one hundred. Each step, I watched the metrics. If anything looked wrong, I would have rolled back to zero with a single command. Nothing looked wrong. The migration completed in about ninety minutes and nobody on the team even knew it was happening except me.

That is what feature flags do when they work right. They turn a scary deployment into a routine one. They let you separate the act of shipping code from the act of activating it. They give you the ability to react to problems in seconds instead of minutes or hours. But they only work right when the infrastructure around them is solid. Building that infrastructure is what Claude Code helps with.

Here is the workflow I use to run feature flags across multiple products.

What Feature Flags Are Actually For

Feature flags get pitched as a way to do A/B testing. That is the marketing pitch and it sells products. The actual reason to have feature flags in your codebase is more boring and more important. They let you decouple deployment from release.

When the deployment ships, the new code is in production but the new behavior is not active. When you decide the behavior should be active, you flip the flag and the behavior turns on without redeploying. When you discover a problem, you flip the flag back and the behavior turns off without redeploying.

The value of feature flags is not in the experimentation they enable. The value is in the asymmetry they create between deployment risk and release risk. A bad deployment forces a rollback. A bad release flips a flag. The difference is the difference between a Sunday night incident and a Tuesday afternoon decision.

The challenge is that feature flags themselves become a source of complexity. Code that is gated behind flags is harder to reason about. Tests have to cover both branches. The flag state has to be consistent across requests. The flag has to be cheap to evaluate. The flag has to be possible to clean up when the experiment is done.

Most teams I have worked with started with a simple flag system and then watched it grow into something unmanageable. Hundreds of flags. Stale flags from features shipped years ago. Flags that nobody remembers what they do. Flag configurations that disagree across environments. The complexity of the flag system eventually exceeds the complexity it was meant to manage.

The Claude Code workflow tackles this by automating the lifecycle of flags from creation through cleanup, and by enforcing the patterns that keep the system manageable.

The Flag Creation Skill

The flag creation skill takes a feature request and produces the flag-gated implementation. The skill handles the boilerplate that makes flags consistent and the discipline that prevents bad patterns.

The skill creates the flag definition in the central flag registry. The registry has a single source of truth for every flag, including its name, its description, its expected lifetime, its owner, and its allowed values. The flag does not exist if it is not in the registry. New flags require an explicit registration step that captures the metadata.

The skill instruments the new code with the flag check at the right level. The check is at the boundary where the new behavior diverges from the old. Putting the check too deep means you have to thread the flag value through many layers. Putting the check too shallow means the entire request path has to be duplicated. The right place is the smallest scope that contains all the diverging logic.

The skill writes both branches of the code. The new branch is the new behavior. The old branch is the existing behavior, preserved unchanged. Both branches have tests. The tests run for both branches in CI, ensuring that turning the flag on or off does not break the build.

The skill also writes the migration plan. The plan documents how the flag will be rolled out: starting percentage, ramp schedule, success criteria, and rollback criteria. The plan goes into the PR description and gets reviewed alongside the code. Without a plan, the flag has no path to being fully on or fully off.

The Rollout Skill

Once a flag exists, it has to be rolled out. The rollout skill handles the progressive activation that turns a flag from zero percent to one hundred percent.

The skill operates on a rollout schedule. The schedule has stages, each with a target percentage and a duration. Stage one might be one percent for one hour. Stage two might be ten percent for one day. Stage three might be fifty percent for one day. Stage four is one hundred percent.

Between stages, the skill checks the health metrics for the rollout. The metrics include the error rate for the new behavior, the latency, the conversion rate if it is a user-facing change, and any custom metrics specified in the rollout plan. If the metrics are within the acceptable range, the rollout proceeds to the next stage. If they are outside the range, the rollout halts and a human is notified.

The skill never advances past a stage on a schedule alone. The schedule sets the maximum pace, but the metrics set the actual pace. A rollout that looks fine on the schedule but has degrading metrics will stop. A rollout that has clean metrics but is ahead of schedule will not skip the wait, because the wait is what lets long-tail issues surface.

The skill also handles segment-based rollouts. Sometimes you want to roll out to specific users first: internal staff, then beta testers, then a geographic region, then everyone. The skill expresses these segments in the rollout plan and applies them in sequence. The segment-based rollout catches issues that percentage-based rollouts miss, because a one percent rollout might still miss specific cohorts where the bug manifests.

The Evaluation Skill

The flag has to be evaluated at runtime, and the evaluation has to be fast and consistent. The evaluation skill produces the runtime code that checks whether a flag is on for a given context.

The skill produces an evaluation function that takes a context (user ID, request attributes, environment) and returns the flag value. The function is deterministic. Given the same flag state and the same context, it always returns the same value. This consistency is important for testing and debugging. If you cannot reproduce a flag evaluation, you cannot debug the resulting behavior.

The skill caches flag values aggressively. Flag definitions change rarely. The flag values for a given context can be computed once per request and reused everywhere. The skill produces an in-request cache that avoids redundant evaluations. For longer-lived contexts like user sessions, there is also a session-level cache.

The skill also handles flag dependencies. Sometimes one flag depends on another. The dependent flag should not be evaluated if the parent flag is off. The skill produces evaluation code that respects the dependency graph and avoids spurious evaluations.

The evaluation has to be cheap. Every request might evaluate dozens of flags. If each evaluation takes a millisecond, the request latency adds up quickly. The skill produces evaluation code that completes in microseconds for the cached case and in milliseconds even for the cold case. The performance budget for flag evaluation is much tighter than most teams realize.

The Observability Skill

Flags need observability for the same reason any production code needs observability. You have to know what the flag is doing. The observability skill adds the instrumentation that makes flags debuggable.

Every flag evaluation produces a log line. The line includes the flag name, the context attributes that mattered, the value returned, and the reason the value was chosen. The reason is important. A flag returning true might be returning true because the user is in the rollout cohort, or because the user is in a force-on list, or because the flag is fully on. The reason tells you which path was taken.

Every flag value gets a metric. The metric tracks the distribution of values for the flag over time. You can see at a glance whether a flag is at one percent, ten percent, or one hundred percent. You can see when a flag was changed, by looking for the inflection point in the metric. You can see whether the rollout is consistent with what the configuration says, by comparing the metric to the configured percentage.

The skill also produces trace spans that show which flags were evaluated during a request and what values they returned. The trace span is what lets you debug behavior that depends on flag values. When a user reports an issue, you can pull up their request trace and see exactly which flag values applied to their session.

The observability is what makes the flag system trustworthy. Without it, flag-related bugs are hard to diagnose because you cannot tell what the flag system was doing. With it, the flag system is transparent and bugs are quick to find.

The Cleanup Skill

The biggest source of flag debt is flags that should have been removed but were not. The cleanup skill is what keeps the flag system from growing unbounded.

The skill watches the flag registry for flags that are eligible for cleanup. A flag is eligible if it has been at one hundred percent or zero percent for a sufficient period, with no recent changes. The threshold is configurable, typically a few weeks for fully rolled out flags and a few days for fully rolled back ones.

When a flag is eligible, the skill produces a cleanup PR. The PR removes the flag check from the code, keeping only the active branch. The removed branch is the dead branch, the one not in use. The flag definition is also removed from the registry. The tests are updated to remove the branch coverage that no longer applies.

The cleanup PR goes through normal review. A human looks at it, confirms the removed branch is actually dead, and merges. The merge is what completes the flag lifecycle. The flag existed for as long as it needed to. Now it does not exist and the codebase is simpler.

The skill also surfaces flags that have not had any activity for a long time. These are zombie flags, where the rollout stalled and was never completed. The zombie flag should either be rolled out the rest of the way or rolled back. The skill produces a report that surfaces the zombies and asks for a decision.

The Coordination Skill

Multiple flags interact. A flag for a new payment processor might depend on a flag for the new checkout UI, which might depend on a flag for the new auth flow. The coordination skill manages these interactions.

The skill maintains a dependency graph of flags. The graph encodes which flags depend on which. When you change one flag, the skill checks whether the change is consistent with the dependent flags. Turning on a flag that depends on another flag that is off produces an error.

The skill also handles experiment conflicts. Two experiments running at the same time might target overlapping segments. The skill detects the overlap and warns. Sometimes the overlap is fine because the experiments are independent. Sometimes the overlap is a problem because the experiments interfere with each other.

The skill produces a calendar of flag activity. Each flag rollout shows up on the calendar with its start date and end date. The calendar helps the team see when the system has many things in flight versus when it is quiet. Scheduling a risky rollout for a quiet period reduces the chance of conflicting changes.

The Testing Skill

Flag-gated code has to be tested for both branches. The testing skill ensures this happens automatically.

The skill produces a test matrix for each flag. The matrix enumerates the relevant flag values: off, on, partial. For each value, the existing test suite runs against that flag state. If any test fails for any flag value, the build fails. The matrix ensures that turning the flag on or off does not break the system.

The skill also produces flag-specific tests. These tests target the boundary between the two branches and verify that the boundary works correctly. The transition tests are what catch bugs where the flag check is in the wrong place or where the two branches diverge in unexpected ways.

For more complex flags, the skill produces integration tests that exercise the full flow through both branches. The integration tests are slow but catch the kinds of bugs that unit tests miss. Integration tests run on a schedule rather than on every commit, so the slowness does not block the team.

The skill also handles snapshot testing for user-visible changes. When a flag is going to change a user interface, the skill produces snapshots for both states. The snapshots get reviewed before the flag rolls out, ensuring that the visual change is intended.

How the Skills Compose

The skills compose into a flag lifecycle. A new feature comes in. The creation skill produces the flag-gated implementation. The rollout skill activates the flag gradually. The evaluation skill makes the flag fast at runtime. The observability skill makes the flag visible. The coordination skill keeps the interactions clean. The testing skill catches the branch bugs. The cleanup skill removes the flag when it is done.

The team interacts with the skills through normal PR flow. The creation PR adds the flag and the gated code. The rollout PR sets the schedule. The cleanup PR removes the completed flag. Each PR is small and reviewable. The skills handle the boilerplate but the humans make the decisions.

The result is a flag system that scales to many flags without becoming a source of pain. Flags get created, rolled out, observed, and cleaned up on a regular cadence. The codebase stays manageable. The risk of changes goes down.

What This Costs

The skills took about a month to build. The evaluation skill was the most complex piece because the performance requirements are tight and the consistency requirements are strict. The cleanup skill required the most ongoing tuning to avoid false positives that would propose removing flags that were still in use.

The benefit is in the rate at which the team can ship. Before the flag system, every risky change required a deployment that activated the change immediately. The deployment had to be timed carefully and watched closely. With the flag system, the deployment is decoupled from the activation. Risky changes ship at any time and activate when the conditions are right.

The benefit also shows up in the rate of recovery from problems. A bug that gets discovered after a flag-gated rollout takes seconds to mitigate. The flag flips off and the bug stops happening. The fix can be deployed at a normal pace because the production impact is already neutralized.

What the Skills Do Not Do

The skills do not pick your flag platform. Whether you use a SaaS flag service, an open source flag system, or a homegrown one, the skills produce code that fits the platform's evaluation interface. The platform itself is your choice.

The skills also do not write your feature code. They handle the flag gating and the lifecycle, but the feature behavior is yours to design and implement. The flag is the wrapping around the feature, not the feature itself.

The skills also do not decide which features should be flag-gated. Not every change needs a flag. Small changes, low-risk changes, and changes that have no rollback path do not benefit from flagging. The decision of when to use flags is a judgment call that the skills support but do not make.

Setting Up Your Own Flag System

Start with the registry. The registry is the foundation. Every flag exists in the registry, with its metadata. Without the registry, the flag system is not manageable at scale.

Add the evaluation library next. The library is what application code uses to check flags. The library should be small, fast, and well-tested. It is the most performance-sensitive part of the system.

Add observability third. Get flag evaluations into your logs and metrics. This is what makes the flag system debuggable when it does not behave as expected.

Add the rollout tooling after that. The progressive rollout is what turns flags from a binary on-off switch into a graduated mechanism for managing risk.

Add cleanup last. The cleanup matters but it can wait until you have a few flags in flight and need to start removing them.

The Bigger Picture

Feature flags are one of those tools that pay off enormously when they work and create chaos when they do not. The difference is the infrastructure around them. A flag system without a registry is a mess of strings scattered through the codebase. A flag system without observability is a black box that nobody trusts. A flag system without cleanup is a graveyard of dead branches that accumulate over years.

The pattern in this workflow is the pattern I keep applying. The repetitive parts of flag management get automated. The judgment parts stay human. The infrastructure is what lets the team use flags confidently. Without the infrastructure, flags become risky. With it, flags become routine.

If you have a codebase that needs more sophisticated release management than your current deployment pipeline supports, the answer is probably not to invest in faster rollbacks. The answer is to invest in flags. The flags are what let you take the deployment pressure off, ship more often, and recover faster when things go wrong.

The first concrete step is the registry. Create a single source of truth for flags. Make it impossible to use a flag that is not registered. Once the registry exists, every other piece of the system has a foundation to build on. Without it, the flags drift and the system becomes ungovernable.

Build the registry. Add one flag. Watch it through its full lifecycle. Then add the next one. The discipline compounds quickly.

FAQ

Should every change go behind a flag? No. Small changes and changes that you can revert with a quick deploy do not benefit from the overhead. Use flags for changes where the risk of a bad release is high.

How long should a flag live? As short as possible. A flag that is fully rolled out should be cleaned up within weeks. A flag that is fully rolled back should be cleaned up within days. Flags that live for years are usually a sign of a stalled rollout.

What about flags for permissions? Permission flags are different from rollout flags. Permission flags live forever and are part of the application's permanent logic. The same registry can hold both, but they should be tagged differently and treated differently by the cleanup skill.

What about server-side versus client-side flags? The same patterns apply but the implementation differs. Server-side flags evaluate on every request. Client-side flags evaluate once per session. The evaluation skill handles both modes with the same interface.

What is the biggest mistake to avoid? Adding flags without a plan to remove them. Every flag should have an expected end state. Without a plan, the flag accumulates and the system gets messy.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Observability Stacks: How I Stopped Flying Blind in Production

Nex Tools — Wed, 13 May 2026 11:37:38 +0000

The first real outage I had to debug without proper observability took fourteen hours to resolve. The system was throwing 500s intermittently. Logs showed nothing useful. Metrics showed the error rate climbing but no signal about why. Traces did not exist. I spent the entire day adding log lines, redeploying, watching, and adding more log lines until I finally cornered the root cause.

The fix took eight minutes once I understood what was happening. The other thirteen hours and fifty-two minutes were spent building the observability that should have already been in place.

After that incident, I made a rule. Every service has to have observability built in from day one. Not as a future improvement. Not as something that gets added when there is time. Built in from the first commit. I have kept that rule for years now, and Claude Code is the thing that made it cheap enough to actually follow.

Here is the workflow I use to build and maintain observability across every service I run.

What Good Observability Actually Means

Good observability is the ability to answer questions about your system without having to deploy new code to answer them. When something breaks, you should be able to look at what is already being collected and figure out the answer. When you cannot answer the question with existing data, that is a gap in your observability and the next outage will be the one that exposes it.

The three pillars are logs, metrics, and traces. Each one answers different questions. Logs tell you what happened. Metrics tell you how often it happened and how it changed over time. Traces tell you what was happening at the same time across the rest of the system.

You do not have observability when you have all three pillars deployed. You have observability when you can answer the question you actually need to answer at three in the morning when something is on fire and you have ten minutes to find the root cause. The three pillars are necessary but not sufficient.

The hard part of observability is not picking the tools. The hard part is making sure the right data is being collected in the right shape, with the right labels, at the right cardinality. This is where most observability stacks fail. The tools are deployed but the data is wrong, and at the moment of crisis the answer is not in the system.

The Claude Code workflow targets the data collection problem directly. The skills produce instrumentation that is consistent across services, that captures the right shape of information, and that gets maintained as the services evolve.

The Instrumentation Skill

The instrumentation skill takes a service and adds the structured logs, metrics, and traces that the service needs to be debuggable.

The skill starts by reading the service code and identifying the natural instrumentation points. Public API endpoints get request and response logging with latency metrics and distributed trace spans. Database queries get duration metrics and trace spans with the query type as a label. External API calls get the same treatment, plus retry counts and circuit breaker state. Background jobs get start, complete, and failure events with duration histograms.

The skill applies the same patterns across every service. The endpoint logs have the same fields everywhere. The trace spans have consistent naming. The metrics have a shared set of labels. The consistency is what lets dashboards and alerts work across services without per-service configuration.

The skill avoids over-instrumentation. Adding a log line to every function in a codebase produces enormous volumes of low-value data that drowns out the signals you actually need. The skill instruments at the boundary points where data crosses subsystem lines. Inside a subsystem, only the points that have historically been useful for debugging get instrumented.

The skill also handles the cardinality problem. Metrics with high-cardinality labels explode in storage cost and query latency. The skill identifies labels that should be high-cardinality (request IDs, user IDs) and ensures they go into logs and traces rather than metrics. It identifies labels that should be low-cardinality (endpoint paths, error categories) and uses those for metrics.

The Schema Skill

Logs without a schema are unsearchable. Logs with a schema are queryable like a database. The schema skill makes sure every log line follows the schema for the service.

The schema starts simple. Every log line has a timestamp, a level, a service name, a request ID if one is available, and a message. Every log line is JSON, not free text. The fields beyond these are specific to the event being logged.

The skill captures the per-event fields as it instruments. A login event logs the user ID, the auth method, and whether the attempt succeeded. A database query event logs the query type, the table, the duration, and the row count. The fields are explicit and consistent across the codebase.

The skill produces a schema document that lists every event type, the fields it carries, and what each field means. The document gets checked into the repository and updated whenever a new event type is added. The document is what makes the logs usable by anyone other than the person who wrote them.

The schema also drives the log aggregation pipeline. The aggregator parses the JSON, extracts the fields, and indexes them so they can be queried. The indexing is much faster than searching through unstructured text. A query that takes thirty seconds against unstructured logs takes a few hundred milliseconds against indexed structured logs.

The Trace Sampling Skill

Distributed tracing is valuable but expensive. Tracing every request consumes too much storage and slows down query performance. Sampling solves the cost problem but introduces a bias problem. Random sampling misses the rare failures that you actually want to investigate.

The trace sampling skill implements smart sampling that catches the interesting traces while keeping the cost manageable. Every trace gets a sampling decision based on its characteristics.

Traces from healthy successful requests get sampled at a low rate, maybe one in a hundred. The aggregate behavior is captured but the storage cost is low. Traces from slow requests, where latency exceeds a threshold, get sampled at a higher rate, maybe one in ten. Traces from failed requests, where the response is an error, get sampled at one hundred percent. Every failure is captured.

The skill also implements tail sampling for the cases where the sampling decision has to wait until the trace is complete. A request that started normal but ended in an error needs to be sampled, but the decision can only be made after the error happens. The tail sampler buffers traces in memory and makes the decision when the trace ends.

The result is a trace storage that is dominated by failures and outliers, which is exactly what you want for debugging. The successful requests are represented but not dominant. The cost stays manageable and the data stays useful.

The Alert Generation Skill

Alerts are the part of observability that most teams get wrong. Either there are too many alerts and people stop responding, or there are too few alerts and outages go undetected for hours. The alert generation skill produces alerts that are actionable, specific, and rare.

The skill starts from the service-level objectives. Each service has objectives for availability, latency, and correctness. The alerts measure deviation from those objectives. When the error budget is being consumed at a rate that would exhaust it before the period ends, an alert fires.

The skill avoids the common trap of alerting on raw metrics. An alert that fires when CPU usage exceeds 90% is mostly noise. CPU usage at 90% is not a problem if the requests are still being served fast. The alert should fire on the user-visible effect, not the internal cause.

The skill also avoids per-instance alerts. When a single instance is unhealthy, the load balancer should route around it and the system should self-heal. The alert should fire when the system as a whole cannot self-heal, which is when the redundancy is exhausted.

Each alert produced by the skill includes the runbook link that explains what the alert means, what to check, and how to mitigate. The runbook is generated alongside the alert and updated whenever the alert is modified. The alert without the runbook is useless. The combination is actionable.

The Dashboard Generation Skill

Dashboards are the interface that lets a human understand a system at a glance. Dashboards that have everything on them are unusable. Dashboards that have only one metric are not informative enough. The right level of detail is hard to find.

The dashboard generation skill produces dashboards for each service following a consistent template. The template has four panels at the top showing the four golden signals: latency, traffic, errors, and saturation. Below those, there are panels for the specific behavior of the service.

The skill picks the specific panels based on what the service does. An API service gets per-endpoint latency and error breakdowns. A background worker gets queue depth and processing latency. A database client gets per-query duration and connection pool saturation. The specifics are different but the layout is consistent.

The skill also produces composite dashboards that show multiple services together. When a user-facing feature spans several services, the dashboard for that feature shows the relevant panels from each service on one page. The composite dashboards are what get used during incident response, when you need to see across the whole call chain at once.

The dashboards get committed to the repository as code rather than configured in the UI. The code is reviewable, versioned, and reproducible. When a dashboard changes, the change goes through the same review process as any other code change.

The Correlation Skill

The hardest part of debugging in production is correlating signals across the three pillars. The metric shows the error rate climbing. The logs show a stream of errors. The traces show specific failed requests. Connecting these requires a shared identifier that flows through all three.

The correlation skill ensures that every request gets a request ID that propagates everywhere. The ID is created at the edge of the system, included in the logs, attached to the trace, and used as a label on the relevant metrics. With the ID in place, you can pivot between the three pillars by querying for the same ID.

The skill also adds correlation for asynchronous flows. A background job triggered by a request gets the request ID propagated through the job queue. A retry of a failed operation gets the original request ID so the full retry chain can be traced. A user session ID gets attached to every request in the session so you can see the full user journey.

The correlation is what makes the observability data composable. Without it, each pillar is an island. With it, the pillars become a connected graph that you can navigate based on the question you are asking.

The Cost Control Skill

Observability data is expensive. Storage costs scale with the volume of data. Query costs scale with the cardinality of the labels. A naively instrumented service can produce so much data that the observability bill exceeds the compute bill.

The cost control skill keeps the observability spend in check. The skill watches the ingestion volume per service and alerts when a service starts producing significantly more data than its peers. The alert prompts a review of whether the additional data is valuable or whether it represents an instrumentation mistake.

The skill also implements log level controls per service. Production runs at INFO level by default, which captures the events that matter without the volume of DEBUG. When a service is being actively debugged, the level can be raised to DEBUG for a short window and then dropped back. The temporary verbose period gives you the data you need without paying for it all the time.

The skill manages retention as well. High-cardinality data like traces gets retained for a short window, maybe seven days, because the value of a trace drops quickly after the incident is resolved. Lower-cardinality data like metrics gets retained for a longer window, maybe a year, because long-term trend analysis is valuable. The retention policies match the value of the data.

The cost control skill turns observability from an open-ended expense into a managed one. The spend has a budget. The budget gets allocated across services based on their criticality. The skill makes sure the allocation is being respected.

How the Skills Compose

The skills compose into an observability practice. The instrumentation skill adds the data collection. The schema skill makes the data queryable. The trace sampling skill keeps the volume manageable. The alert generation skill turns the data into actionable signals. The dashboard generation skill turns the data into visual summaries. The correlation skill makes the data connectable. The cost control skill keeps the bill predictable.

A new service comes into the system with all of this from day one. The instrumentation skill runs on the initial codebase. The schema document is generated. The trace sampling is configured. The alerts are generated. The dashboards are created. The service ships with observability built in.

When the service evolves, the skills evolve with it. New endpoints get instrumented as they are added. New event types get added to the schema. New alerts and dashboards appear as the surface area grows. The observability stays current without anyone scheduling observability work.

What This Costs

Building the skills took a few weeks. The instrumentation skill was the largest single piece because it has to understand many different code patterns. The schema and dashboard generation skills came next. The alert generation skill required the most tuning because alert quality is hard to get right.

Once the skills are in place, the cost of adding observability to a new service is close to zero. The skill runs, the output is reviewed and merged, and the service is observable. Compared to the days or weeks of manual work this would have taken, the savings are enormous.

The bigger benefit is the consistency. Every service follows the same patterns. Every dashboard has the same shape. Every alert links to a runbook. When something breaks, the cognitive load of finding the right data is low because the data is always in the same place.

What the Skills Do Not Do

The skills do not pick your observability vendor. Whether you use the open source stack, a commercial platform, or something in between, the skills produce instrumentation that fits the OpenTelemetry standard. The downstream pipeline is yours to configure.

The skills also do not replace the human judgment in incident response. They give you the data, but the data does not interpret itself. When something is breaking, a human has to look at the dashboards, read the logs, and decide what to do. The skills make this easier but do not automate it.

The skills also do not write your service-level objectives. The objectives are a product and business decision. The skill takes the objectives as input and produces alerts and dashboards that measure against them, but the objectives themselves come from you.

Setting Up Your Own Stack

Start with structured logging. Get every service producing JSON logs with a consistent schema. This is the foundation that everything else builds on. Without it, the other pillars cannot connect to the logs.

Add request ID correlation next. Make sure every log line in a request flow carries the same ID. Once the IDs are in place, you can connect logs from different services that participated in the same request.

Add metrics third. Start with the four golden signals per service. Add custom metrics as you discover the need. Resist the temptation to add metrics for everything just because you can.

Add traces fourth. Traces are the most expensive pillar and the one with the highest setup cost, so it makes sense to add them after the cheaper pillars are working. The smart sampling skill keeps the cost manageable.

Add alerts and dashboards last. These depend on the data being clean and the schema being stable. Premature alerting produces noise. Premature dashboards become abandoned.

The Bigger Picture

Observability is the kind of work that pays off enormously but feels invisible when it is working. When the system is healthy, you do not think about your observability stack. When something breaks, the stack is either there to help you or it is not. The investment in observability is paid back in minutes saved during outages, but the savings compound across every outage for the life of the system.

The pattern in this workflow is the same pattern I keep using. The repetitive parts of observability work get automated. The judgment parts remain human. The result is a practice that scales without scaling headcount, and a production environment where outages get resolved in minutes instead of hours.

If you have services in production without proper observability, the answer is not to wait for a quieter quarter. Build the workflow. Add observability to one service. Use the workflow to add it to the next service. After a few services, the workflow is mature and adding observability to the rest is fast.

The first concrete step is structured logging. Every log line as JSON, every log line with a request ID, every service following the same schema. Once that is in place, the rest of the stack starts to make sense. Without it, every additional pillar is harder than it needs to be.

Pick one service. Add structured logging. Verify the logs are queryable. Then move to the next service. The compounding starts immediately.

FAQ

Which observability platform should I use? The skills produce OpenTelemetry-compatible output, which works with most platforms. Pick the platform based on your team's familiarity and your budget.

How much should I spend on observability? A reasonable starting point is between five and ten percent of the compute spend for the service. If you are spending much less, you probably do not have enough observability. If you are spending much more, you probably have too much.

What about security and privacy? Logs and traces can capture sensitive data. The skills include a redaction step that removes known sensitive fields before the data is shipped to the observability platform. Configure the redaction rules for your context.

What about local development? The skills produce the same instrumentation locally as in production. Local logs and traces go to a local collector. This way you can debug observability issues without needing to deploy.

What is the biggest mistake to avoid? Treating observability as something to add later. The data you cannot collect during the incident is data you cannot have. Build it in from the start.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for TypeScript Migrations: How I Converted a 200,000-Line JavaScript Codebase Without Stopping Shipping

Nex Tools — Mon, 11 May 2026 13:18:59 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

The first time I tried to migrate a large JavaScript codebase to TypeScript, I made the classic mistake. I planned a six-week migration project, kicked off with a team meeting, started at the top of the directory tree, and got about 4% through before the project stalled. Other work kept coming in. The migration sat in a long-running branch that diverged from main every day. Three months later, I gave up and merged the partial work back with a lot of any types and a lot of regrets.

The second time I tried, I had learned the lesson. Big-bang migrations do not work in production codebases that have to keep shipping. The migration has to be incremental, and it has to be done in a way that lets the rest of the team keep working without coordination overhead. The tools that exist for this kind of migration are good but require enormous amounts of human time to apply correctly across a large codebase.

This is where Claude Code changed the math for me. I built a migration workflow that turned what would have been a six-month project into a six-week project, and the codebase kept shipping the entire time. Today, the migration is done, the code is fully typed, and the team is faster than they were before. Here is how the workflow works.

Why Most TypeScript Migrations Fail

The reason most TypeScript migrations fail is that they are framed as a project. Projects have start dates and end dates and dedicated resources. Production codebases do not. They have a continuous stream of feature work that cannot stop, and they have a team that cannot pause for weeks to focus on a migration.

When the migration is a project, it competes with feature work for time. Feature work always wins because feature work has external pressure. The migration loses, falls behind schedule, and eventually gets canceled or quietly abandoned.

The migrations I have seen succeed are the ones framed as a continuous activity rather than a project. The work happens alongside feature work. Each commit takes a small bite out of the migration. The bites accumulate. After a few months, the migration is done without anyone having scheduled a migration sprint.

The migration succeeds when it becomes invisible. When the work is so cheap that every PR can include a slice of it without anyone noticing, the migration moves forward at the speed of the regular development cadence. The migration that demands focus is the migration that gets deprioritized.

The challenge is making the work cheap enough. TypeScript migration is not naturally cheap. Adding types to existing code requires understanding the code, the runtime behavior, the patterns of use, and the edge cases. Doing this well across thousands of files takes a long time. Doing this badly produces a codebase full of any types that buys none of the benefits of TypeScript.

The Claude Code workflow makes the work cheap by automating the parts that can be automated and focusing human attention on the parts that cannot.

The Foundation Skill

Before you can migrate any code, you have to set up the foundation. The foundation skill handles the project configuration that makes incremental migration possible.

The skill configures the TypeScript compiler to accept both .ts and .js files. It enables allowJs so existing JavaScript files keep working. It enables checkJs so JSDoc types in JavaScript files get checked. It sets strict mode for new TypeScript files but allows untyped JavaScript files to coexist.

The skill also sets up the build pipeline. The build needs to compile a mix of .ts and .js files. The test runner needs to handle both. The bundler needs to handle both. The CI needs to type-check the TypeScript files and lint everything together. Each of these has small configuration changes that the skill handles in a single pass.

The foundation skill produces a codebase where you can rename a .js file to a .ts file and it still works. That is the precondition for incremental migration. Without it, every renamed file becomes a blocker that breaks the build for everyone.

The Inventory Skill

Once the foundation is in place, the inventory skill maps out what needs to be migrated. The map is the basis for prioritization.

The skill produces an inventory of every JavaScript file in the codebase. Each entry includes the file path, the size in lines, the number of exports, the number of importers, the cyclomatic complexity, and a migration difficulty estimate. The difficulty estimate is based on signals like dynamic property access, runtime type checking, eval usage, and the presence of patterns that are hard to express in TypeScript.

The inventory also includes a dependency graph. For each file, the skill lists which files depend on it and which files it depends on. The graph is what drives the migration order. Files with no dependencies on other JavaScript files are leaves. Leaves can be migrated independently. Files with many JavaScript dependencies are roots. Roots have to wait until the dependencies are migrated.

The output is a prioritized list of migration targets. The top of the list is leaf files with low difficulty and high importance, ranked by impact per hour of work. The bottom of the list is complex roots that depend on many other things being migrated first.

The inventory becomes the migration plan. Instead of asking "what should I migrate next?" I look at the next entry on the list. The list itself is updated as files get migrated, so the next entry is always the right next entry.

The Conversion Skill

The conversion skill handles the actual migration of a single file. The skill takes a JavaScript file and produces a TypeScript file with types added.

The skill starts by reading the file and understanding its structure. It identifies all the exports, the function signatures, the class definitions, the constants, and the patterns of use. It then queries the importers of the file to see how the exports are actually used. The usage tells it what the types should be.

For a function that takes a string and returns a number, the skill can infer the types from the function body if the body is simple enough. For a function that takes an object with various properties, the skill looks at how callers construct the object and what properties they pass.

For exports that are used in multiple places with conflicting types, the skill produces a union type or a generic. The decision depends on the pattern. If the function is genuinely polymorphic across usages, the skill uses a generic. If the function has a few specific usage patterns, the skill uses a union.

The skill avoids any whenever possible. When the type is genuinely unknown, it uses unknown instead, which forces the caller to narrow the type before using the value. When the type is partially known, it uses the most specific type it can derive.

The output is a TypeScript file with types that match the actual usage. The file is not perfect. Edge cases that the skill could not figure out are flagged for review. But the bulk of the work is done.

The Validation Skill

After the conversion skill produces a TypeScript file, the validation skill checks the result. The check has three parts.

The first part is the compile check. The TypeScript compiler runs on the file and reports any type errors. The skill reads the errors and decides whether they are real or whether they reflect places where the inferred types were wrong. The skill can often fix the inferred types automatically if the error is clear.

The second part is the test check. The test suite runs to make sure the migrated file still behaves correctly. If a test fails, the skill correlates the failure with the migration. Most test failures after a migration are caused by overly strict types that rejected runtime patterns the original code allowed. The skill identifies these and proposes a fix.

The third part is the usage check. The skill looks at every importer of the migrated file and verifies that the new types work for them. If an importer was passing an argument that does not match the new type, the skill identifies the mismatch. The mismatch might be a bug in the importer, in which case it should be fixed. Or it might be a sign that the migrated type is too narrow, in which case the type needs to be widened.

The validation skill catches the cases where the migration would have broken something downstream. Without it, a migration that compiles locally can introduce errors that only surface when other files try to use the migrated module. Catching these at migration time is much faster than catching them later.

The Pattern Library

Most JavaScript codebases have repeated patterns. The same idiom for error handling. The same shape of options object. The same approach to async iteration. Once you have migrated one instance of a pattern, the rest of the instances can be migrated mechanically.

The pattern library skill identifies repeated patterns in the codebase and learns how to migrate them. The library starts empty. As migrations happen, the skill notices when a similar pattern appears and asks whether to apply the same migration approach. After a few applications, the pattern is captured and applied automatically.

The pattern library is what makes the migration accelerate over time. The first hundred files are slow because everything is novel. The next hundred files are faster because most of the patterns are already captured. The last few thousand files are fast because almost everything is a known pattern.

The library also handles the codebase-specific idioms. Every codebase has weird things. Custom hooks. Custom decorators. Custom inheritance patterns. The library captures these and applies them consistently across the migration, which means the migrated codebase has consistent type patterns instead of one-off solutions in every file.

The Coordination Skill

The migration happens alongside feature work. Feature work changes files. Migration changes files. When the migration touches a file someone else is also touching, there is potential for conflict.

The coordination skill prevents this. The skill watches the open pull requests across the team and reserves files that are being actively worked on. The migration never touches a file that has an open PR against it. When the PR merges, the file becomes available for migration. When the migration is in progress, the team is notified to avoid that file.

The coordination is what keeps the migration from creating friction for the team. Without it, the migration would be a source of constant merge conflicts. With it, the migration moves through the parts of the codebase that are quiet at any given moment, and the team rarely notices.

The skill also batches the migration into small PRs. Each PR migrates a handful of related files. The small PR size makes the migration changes easy to review and reduces the chance of conflicts. The team reviews the migration PRs the same way they review any other PR, just with the awareness that the changes are mostly mechanical.

The Strictness Ramp

TypeScript has many levels of strictness. Starting with full strict mode in a freshly migrated codebase is too much, because the migration produces types that are correct but not always the strictest possible. The strictness ramp skill increases strictness gradually as the migration matures.

The first level of strictness allows implicit any but checks everything else. The second level disallows implicit any but allows explicit any. The third level disallows any entirely and requires unknown instead. The fourth level enables exhaustive switch checking. The fifth level enables strict null checks. The sixth level enables strict function types. The seventh level is full strict mode.

The skill tracks where the codebase is on each level and surfaces opportunities to advance. When 90% of the files pass a stricter level, the skill suggests turning that level on globally and fixing the remaining 10%. The ramp lets the codebase get to full strict mode in stages instead of trying to satisfy all of strict mode at once.

The strictness ramp also includes per-file overrides. A file that is not yet at the target level has its level set explicitly, and the override is removed when the file reaches the target. This way the codebase can have a mix of strictness levels temporarily while everything converges.

The Regression Detection Skill

A successful migration is one that does not introduce regressions. The regression detection skill watches for cases where the migration changed runtime behavior accidentally.

The skill has three modes of detection. The first mode is type-driven, looking for places where the new types narrowed the behavior compared to what the JavaScript allowed. If the original code accepted a number or a string and the new type only accepts a number, the skill flags it for review.

The second mode is test-driven, watching for tests that started passing or failing after the migration. A test that started failing is an obvious regression. A test that started passing is sometimes a sign that the migration fixed a latent bug, but more often a sign that the test is checking something the migration changed.

The third mode is production-driven, watching for runtime errors that appear after deployment of the migration. The skill correlates production errors with the files that were migrated and surfaces likely regressions. This catches the cases where the type system allowed something the runtime did not, or where the migration introduced a subtle behavior change that only manifests in production.

The regression detection is what makes the migration safe. Without it, a migration that looks good in development can introduce production issues that take weeks to find. With it, regressions are caught quickly and rolled back before they accumulate.

How the Skills Compose

The skills compose into a migration cadence. Each day, the inventory skill identifies the next handful of files to migrate. The coordination skill confirms they are available. The conversion skill produces TypeScript versions. The validation skill checks the result. The regression detection skill watches for issues.

I review the produced PRs, approve them, and merge them. The team reviews them as part of their normal workflow. The pattern library skill captures any new patterns. The strictness ramp skill tracks progress and surfaces opportunities to advance.

The total time I spend on the migration is about 30 minutes per day. That is enough time to review and merge five to ten file migrations. The team spends almost no extra time, because the PRs are small and mechanical.

Over a few months, the migration completes. The codebase moves to TypeScript without anyone scheduling a migration sprint, without anyone feeling like the migration was disruptive, and without any production regressions caused by the work.

What This Costs

The skills took a few weeks to build, mostly because the conversion skill needed a lot of tuning to produce good types instead of any types. The pattern library skill needs a few months of usage to accumulate the patterns specific to your codebase.

The benefit is in the rate of migration. Before this workflow, a TypeScript migration on a 200,000-line codebase would have been a six-month project requiring dedicated headcount. With the workflow, it was a six-week migration that ran alongside normal feature work and required about an hour of my time per day.

The benefit also shows up in the quality of the migration. Codebases that get migrated in a hurry end up with any types scattered throughout, because the team did not have time to do the work properly. Codebases that get migrated with this workflow end up with proper types from the start, because the conversion skill defaults to specific types and only falls back when forced.

What the Skills Do Not Do

The skills do not replace architectural decisions. When the migration reveals a design that does not work in TypeScript, the skills tell you but do not redesign. Some patterns that work in JavaScript do not have clean TypeScript equivalents and require code restructuring. The restructuring is yours to do.

The skills also do not write tests. They check that existing tests still pass and surface places where new tests would be valuable, but they do not write the tests themselves. The test writing is yours.

The skills also do not enforce style decisions. Whether to use interfaces or type aliases, whether to prefer const assertions or explicit types, whether to use enums or string unions, these are style choices the skills are agnostic about. You configure them based on your team's preferences.

Setting Up Your Own Workflow

Start with the foundation skill. Without the foundation, nothing else works. Get the project to a state where renaming a .js file to .ts does not break the build. This is the minimum viable starting point.

Add the inventory skill next. You need to know what you have before you can plan the migration. The inventory tells you whether the migration will take a week or a year.

Add the conversion skill after that. Migrate ten files manually first to see what good output looks like, then build the conversion skill to produce similar output. The first version of the conversion skill will be rough. Tune it on real files until the output is consistently good.

The validation, pattern library, coordination, strictness ramp, and regression detection skills can come later. They are valuable additions but the migration can start without them. Build them as you discover the need.

The Bigger Picture

The pattern in this migration workflow is the pattern I keep seeing in every successful application of Claude Code to a large engineering problem. The work has repetitive parts and judgment parts. The repetitive parts can be automated. The judgment parts cannot. The automation is what makes the work tractable. Without the automation, the work is too expensive to do well. With the automation, the work becomes routine.

TypeScript migration is a particularly good example because the scale is so visible. A 200,000-line codebase is intimidating. Most of the lines do not require any human judgment to migrate, but the few that do require careful thought. Automating the routine 95% lets the human focus on the 5% that needs them.

If you have a JavaScript codebase that you have been meaning to migrate but have been putting off because the project feels too large, the answer is probably not to wait for a quieter quarter. The answer is to build a workflow that makes the migration cheap enough to run continuously. The migration completes eventually, without disrupting anything else, and the codebase ends up in a better place than it would have if you had tried to do the migration as a project.

If you have been reading along, the first concrete step is to configure your build to accept both .ts and .js files. Once that is working, every subsequent step gets easier. The migration becomes a series of small commits instead of a giant lift. The compounding effect of small commits is how big migrations actually get done.

Build the foundation. Run the inventory. Start migrating leaves. The rest follows.

FAQ

How long did the migration actually take? Six weeks of calendar time, about an hour per day of my time, plus normal review time from the team for the migration PRs.

What language version did you migrate to? The most recent TypeScript at the time. The skills do not care which TypeScript version. They produce types compatible with the target version.

What about React components? React components migrate well because the types are mostly mechanical. The skills handle JSX correctly and produce typed props and state.

What about node_modules dependencies that lack types? The skills produce ambient declaration files for dependencies without types. Most popular dependencies have types available on DefinitelyTyped.

What is the biggest mistake to avoid? Trying to migrate the most complex parts first. Leaves before roots. Easy before hard. The accumulation of small wins gives you the momentum to tackle the hard parts.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Dependency Management: How I Stopped Being Afraid of npm Update

Nex Tools — Mon, 11 May 2026 13:13:18 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

Every developer I know has a story about dependency hell. Mine was a Friday afternoon in 2024 when I ran npm update on a project I had inherited, and the entire test suite turned red. Not a few tests. Every single test. The diff was 400 packages changed across the lockfile, and I had no idea which of those changes had broken what. I spent the rest of the weekend bisecting the upgrade manually, package by package, until I found the breaking change buried four levels deep in a transitive dependency of a transitive dependency.

That experience changed how I think about dependency management. The default workflow most teams use is to ignore dependencies until something forces an upgrade, and then panic. The panic upgrade is when security patches pile up and someone finally runs npm audit fix --force at 11 PM the night before an audit. The panic upgrade is also when most production incidents happen, because the gap between the version that worked and the version you are jumping to is measured in months and breaks accumulate silently.

I built a Claude Code workflow that turned dependency management from a periodic crisis into a routine background activity. The workflow is not glamorous. It does not involve any clever AI tricks. What it does is make the work of staying current on dependencies cheap enough that I actually do it, instead of letting it pile up until it explodes.

Here is how the workflow works and why it has saved me hundreds of hours.

Why Dependency Management Goes Wrong

The reason dependency management is hard is not because individual upgrades are hard. Most upgrades are easy. The reason it is hard is because the work is distributed across so many small decisions that no human can keep them all in their head, and the cost of getting any one of them wrong is non-zero.

Every dependency in your project has a release cycle. Most have patch releases monthly, minor releases quarterly, major releases yearly. If you have 50 direct dependencies and 500 transitive dependencies, you are looking at thousands of version changes per year flowing into your project from the outside. Each one of them is a potential surprise.

The way most teams handle this is to ignore the firehose and react to specific events. Security alerts force upgrades. A new feature in a library forces an upgrade. A bug that blocks shipping forces an upgrade. Between those events, dependencies drift further and further out of date, and the cost of catching up grows.

Dependency management is not a project. It is a habit. The workflow that makes the habit cheap is the workflow that gets followed. The workflow that demands a half-day of focus is the workflow that gets skipped.

I needed a workflow that was cheap. Cheap enough that I would actually run it every week. Cheap enough that I would not skip it when I was busy. The Claude Code skills I built are the result of optimizing for cost-to-run, not cost-to-build.

The Audit Skill

The first skill in the workflow is an audit skill. It runs every Monday morning and produces a report on the current state of dependencies across all my projects.

The report has four sections. The first section lists outdated dependencies, sorted by how far behind they are. A package three patch versions behind is a low priority. A package five major versions behind is a flashing red light. The skill annotates each entry with the release date of the current version and the release date of the latest version, so the gap is obvious at a glance.

The second section lists security advisories. The skill queries the security advisory database for every dependency and surfaces anything with a known vulnerability. The advisories include the severity, the affected version range, and the patched version. I see exactly what I need to upgrade and how urgent it is.

The third section lists deprecation warnings. Many packages get deprecated silently. The package still works, but the maintainer has marked it as no longer supported. The audit skill catches these before they become problems.

The fourth section lists dependencies with significant changes. Significant means breaking changes have been released, or the maintainer has been replaced, or the package has been transferred to a new owner. These are the changes that often get missed because they do not show up as version bumps.

The audit skill takes 90 seconds to run across all my projects. It produces a one-page markdown report that I can read in two minutes. The report is what drives the rest of the week's dependency work.

The Categorization Skill

Not all upgrades are equal. The categorization skill takes the audit output and assigns each entry to a category that determines how it gets handled.

The first category is critical. Critical means a security vulnerability with a high severity score, an active exploit in the wild, or a package my code depends on at runtime for something user-facing. Critical upgrades happen the day they are identified, regardless of what else is on the schedule.

The second category is high. High means a security vulnerability with medium severity, a deprecated package that needs replacement before it stops working, or a major version of a key dependency that will be needed for an upcoming feature. High upgrades happen within the week.

The third category is medium. Medium means a major version bump of a non-critical dependency, a deprecation warning that does not have an immediate impact, or accumulated minor version updates of dependencies I want to keep current. Medium upgrades happen monthly.

The fourth category is low. Low means patch versions that have not introduced any changes I care about. Low upgrades happen quarterly, batched together so the upgrade work is amortized.

The categorization is what makes the workflow tractable. Instead of treating every dependency as needing immediate attention, I have a triage system that focuses my time where it matters. The skill does the categorization based on rules I tuned over a few months. The rules are not fancy. They look at vulnerability severity, version distance, package criticality, and a few signals about the package itself.

The Upgrade Skill

The upgrade skill is where the work happens. For each upgrade I need to perform, the skill produces an upgrade plan. The plan includes the specific commands to run, the changes that will be applied, the tests that need to pass, and the rollback procedure if anything goes wrong.

The most useful part of the upgrade plan is the changelog summary. The skill reads the release notes for every version between my current version and the target version, summarizes the breaking changes, and flags anything that might affect my code. If I am jumping from version 3.2 to version 4.5, the summary tells me what changed in 3.3, 3.4, 3.5, 4.0, 4.1, 4.2, 4.3, 4.4, and 4.5. The major version is highlighted because that is where breaking changes live.

The summary is not just a copy of the release notes. The skill reads my code, identifies how I use the package, and tells me which of the changes are likely to affect me. If the changelog says a function I do not use was removed, the summary deprioritizes that. If the changelog says a function I use heavily had its signature changed, the summary flags it prominently.

The flagging is the difference between a 30-minute upgrade and a 3-hour upgrade. Without the flagging, I would have to read every release note and check every change against my code by hand. With the flagging, I read a short summary and know exactly where to focus.

The Test Skill

After every upgrade, the test skill runs. Running the test suite is obvious. What the test skill adds is the intelligence about what to do when something fails.

When a test fails after an upgrade, the test skill correlates the failure with the upgrade. It looks at the test that broke, compares it to the changes in the upgraded package, and tells me whether the failure is likely caused by the upgrade or whether it is unrelated. Most of the time it is the upgrade. Sometimes the test was already flaky and the upgrade just happened to be the moment it failed. Knowing which is which saves me from a goose chase.

When the failure is caused by the upgrade, the test skill produces a hypothesis about what changed. The hypothesis is based on the changelog summary and the actual error. If the changelog says a function signature changed and the test fails with a type error on that function, the hypothesis is clear. If the changelog says a default behavior changed and the test fails with an assertion that depends on the default, the hypothesis is also clear.

The hypothesis is not always right. When it is wrong, I have to debug manually. But when it is right, the upgrade fix is a one-line change instead of an hour of digging.

The Rollback Skill

Some upgrades fail. Either the tests fail in ways I cannot quickly fix, or the upgrade introduces runtime behavior that breaks something not covered by tests. When that happens, I need to roll back fast.

The rollback skill maintains a snapshot of every upgrade. The snapshot includes the previous lockfile, the previous package versions, and the state of any related configuration. Rolling back is a single command that restores the snapshot. Total time to roll back is under 30 seconds.

The rollback is not the end. The rollback skill also produces an analysis of why the upgrade failed and what would need to be true for the upgrade to succeed. Sometimes the answer is a small code change. Sometimes the answer is to wait for a patch release that fixes the issue. Sometimes the answer is to switch to a different package because the current path is no longer viable.

The analysis is what prevents the rollback from being a permanent retreat. Without the analysis, a failed upgrade often turns into a permanent skip. The dependency stays at the old version forever, and the gap grows. With the analysis, I have a concrete plan for when and how to try again.

The Cross-Project Skill

Most of my projects share some dependencies. When a critical update lands on a shared dependency, I need to apply it across multiple projects. The cross-project skill handles this.

The skill identifies all projects that depend on a given package, plans the order of upgrades based on which projects are most critical, and executes the upgrades in parallel where it can. The output is a single report that tells me the status of the upgrade across all projects.

The cross-project view also helps me identify which packages are good candidates for centralization. If five of my projects depend on the same internal utility package, I know I should be tracking that package carefully and consider whether the utility should live in a shared library instead.

The cross-project skill catches the case where a dependency has different versions in different projects. Version drift across projects is a subtle problem. The same bug behaves differently in different projects because they are using different versions of a shared library. The skill flags drift and proposes a unification plan.

The Transitive Dependency Skill

Direct dependencies are visible. Transitive dependencies are not. Most of the packages in your node_modules are not packages you chose. They are packages your packages chose, recursively. When something goes wrong with a transitive dependency, the path from cause to effect is long.

The transitive dependency skill maps out the dependency tree and identifies hotspots. A hotspot is a transitive dependency that many of your direct dependencies depend on, which means a problem with that transitive dependency affects many things at once. The skill ranks the hotspots and tracks them like first-class dependencies, even though I never directly added them.

The skill also identifies transitive dependencies that have known issues. If a transitive dependency has a security advisory, the skill traces it back to the direct dependencies that pulled it in. I get a clear picture of what I would need to change at the direct level to fix the issue at the transitive level.

This skill is the one that prevented my next dependency hell. I have caught two security issues in transitive dependencies that I would not have noticed otherwise. Both were patched within hours of detection because I knew exactly which direct dependency to upgrade.

The Lockfile Hygiene Skill

Lockfiles are easy to get wrong. They commit the wrong way, they get out of sync with the package manifest, and they introduce changes that are not actually changes you made. The lockfile hygiene skill keeps the lockfile sane.

The skill detects unexpected lockfile changes. If a commit changes the lockfile without changing the package manifest, the skill flags it for review. Most of the time the change is legitimate, but sometimes it is a sign that someone ran the package manager in a way that updated something they did not mean to update.

The skill also detects diverged lockfiles. When two branches each modify the lockfile, the merge can resolve in ways that lose updates. The skill catches this by comparing the resolved lockfile to what it should be and flagging discrepancies.

The hygiene skill is the least exciting part of the workflow, but it is the part that prevents the silent bugs. Lockfile drift is one of those problems that produces incidents months later when nobody can figure out why the same build produces different results.

How the Skills Compose

The skills compose into a weekly rhythm. Monday morning, the audit skill runs and produces the report. I spend 10 minutes reading the report and deciding which upgrades to do this week. The categorization skill has already prioritized them, so the decision is mostly which medium-priority items to include alongside the critical and high.

Throughout the week, the upgrade skill produces plans for each upgrade. I review the plan, run the upgrade, and watch the test skill validate the result. If the tests pass, I commit. If they fail, the test skill diagnoses, and I either fix or roll back. The rollback skill makes rollback safe.

The cross-project skill kicks in for shared dependencies. The transitive dependency skill kicks in when something interesting shows up in the dependency tree. The lockfile hygiene skill runs continuously in the background.

The total time I spend on dependency management is about 90 minutes per week, spread across the week. Before this workflow, dependency management was a quarterly all-hands fire drill that consumed two days and produced incidents in the following week. Now it is a routine activity that produces no surprises.

What This Costs

The skills took about a week to build. Most of the time was spent tuning the categorization rules and the changelog summary heuristics. The skills do not require any special infrastructure. They run against the same package manager output that any developer already has.

The benefit is in the rhythm. Once you have a workflow that costs 90 minutes per week, dependencies stop being a thing you are afraid of. You upgrade things as they become available. You catch problems when they are small. You never end up six months behind on a critical dependency because the upgrade work is too daunting to start.

The benefit also shows up in production. The number of production incidents I trace back to a dependency upgrade has dropped to roughly zero. The upgrades I do are small and safe, because they are spread out and tested individually. The upgrades I used to do were large and risky, because they bundled months of changes into a single chaotic push.

What the Skills Do Not Do

The skills do not replace judgment. They produce reports, plans, and hypotheses. I am still the one who decides what to upgrade, when, and how. The skills make the decisions faster and better informed, but the decisions are still mine.

The skills also do not handle every edge case. When a dependency has been abandoned and needs replacement, the skill tells me but does not pick the replacement. When a major upgrade requires architectural changes to my code, the skill identifies the changes but does not write them. The hard parts are still hard.

What the skills do is make the easy parts trivial. The cumulative effect of trivializing the easy parts is that I have time and energy for the hard parts when they come up.

Setting Up Your Own Workflow

Start with the audit skill. It is the cheapest to build and produces the most value per hour of effort. You will get a weekly report that tells you the state of your dependencies. That alone changes how you think about them.

Add the upgrade skill next. The upgrade plans cut the time for individual upgrades by half. You will feel the difference within a week.

Add the test skill after that. The diagnosis when something breaks is where you save the most time per incident. Without it, a failed upgrade can eat hours. With it, most failures are resolved in minutes.

Build the rollback skill once you have done a few upgrades. You need the snapshots in place before you need to roll back, because trying to capture state in a panic is not reliable.

The other skills are useful but optional. The cross-project skill matters if you have multiple projects. The transitive dependency skill matters if you have a deep tree. The lockfile hygiene skill matters if you have multiple committers.

The Bigger Picture

The pattern in this workflow is the same as in every other Claude Code workflow that has worked for me. Repetitive work gets automated. Judgment-heavy work stays with the human. The automation makes the repetitive work cheap enough that it actually happens, instead of being skipped and accumulating into a crisis.

Dependency management is the canonical example. The work is repetitive. There is a lot of it. Each individual piece is small. The accumulated weight is what breaks teams. Automating the repetitive parts and triaging by judgment is the right shape of the solution.

If you have a project that has not had its dependencies looked at in six months or more, you have technical debt that is compounding silently. The way to stop the bleeding is to build a workflow that makes the maintenance cheap. The way to make it cheap is to automate the boring parts so you can focus the human time on the parts that need a human.

If you have been reading along and recognizing your own situation, the first step is to run an audit on one project. Pick the project with the most direct dependencies. See what the audit tells you. Once you see the report, you will know whether you have a manageable situation or a five-alarm fire. Either way, you are better off knowing than not knowing.

Build the audit skill. Run it weekly. Decide what to do based on the report. The rest of the workflow grows from there.

FAQ

How long does it take to build the audit skill? A few hours for a basic version. A day if you want it polished. The polished version pays for itself in the first week.

Does this work for languages other than JavaScript? Yes. The patterns translate to any ecosystem with a package manager. The audit query is different for Python or Rust or Go, but the workflow is the same.

What about monorepos? Monorepos make the cross-project skill more important and the audit skill more interesting because the report has to handle multiple packages. The basic structure is the same.

How do I get my team to adopt this? Run the audit yourself for a few weeks. Bring the reports to standups. The team will see the value when they see the reports identify real issues before they become incidents.

What is the biggest mistake to avoid? Trying to upgrade everything at once when you start. Build the workflow first. Use it to triage. Upgrade in order of priority. Resist the urge to do a giant catch-up upgrade.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Incident Response: How I Cut My Mean Time to Recovery in Half

Nex Tools — Sun, 10 May 2026 10:49:40 +0000

It is 3 AM. PagerDuty is screaming. Production is down. You are half-awake, half-dressed, and trying to figure out which of the 47 dashboards in your monitoring system is showing the actual problem versus a downstream symptom of the actual problem. Your team is asking what they can do to help. Customers are tweeting. The status page is still green because nobody has had time to update it.

If you have been on call for any length of time, you have lived this scene. The first 15 minutes of an incident are chaos, not because the people responding are incompetent, but because the cognitive load of an incident is much higher than the cognitive load of normal work, and humans degrade under that load in predictable ways.

I started using Claude Code during incidents because I noticed that the same patterns repeat every time. Run these queries. Check these logs. Look at these dashboards. Update the status page. Notify the right stakeholders. The patterns are predictable enough that they could be partially automated. So I automated them. The result is that my mean time to recovery has dropped from a median of 38 minutes to a median of 17 minutes, and the incidents themselves feel less like trauma and more like a process.

Here is the workflow.

What Goes Wrong in the First 15 Minutes

The first 15 minutes of an incident is where most of the damage happens, and it is also where most of the recovery time gets wasted. The recovery time wasted is not because the responder does not know what to do. It is because the responder is operating at 30% of their normal cognitive capacity and has to do everything from scratch.

The things a responder needs to do in the first 15 minutes are mostly the same across incidents. They need to confirm the incident is real. They need to identify the affected service or services. They need to find the most likely cause. They need to notify stakeholders. They need to update the status page. They need to start a timeline. They need to coordinate with anyone else who has been paged. They need to begin investigation while keeping the rest of the team informed.

In a calm moment, this list is manageable. At 3 AM with PagerDuty screaming and adrenaline running, this list is overwhelming. The responder ends up doing some of these things and forgetting others, and the incident drags on while small mistakes compound.

The first 15 minutes of an incident is the highest-leverage time you have, and it is also the time when you are least equipped to use it well. Anything you can pre-load into automation is time you do not have to spend thinking when thinking is hardest.

Claude Code is a way to pre-load that automation. The patterns that repeat every incident can be encoded as skills. The skills run when an incident is declared, gather the information that is always needed, and present it in a format the responder can read in 30 seconds. That changes the first 15 minutes from chaos to a checklist.

The Triage Skill

The first skill I wrote is a triage skill. It runs when I declare an incident and gathers the information I always need to confirm whether the incident is real and what is affected.

The skill checks the status of every critical service by running its health check, queries the error rate for each service for the last 15 minutes and compares it to the rolling baseline, looks at the latency percentiles for each service and flags anything that has degraded, queries the deployment log for any deploys in the last hour, and checks for any infrastructure events from the cloud provider that might be relevant.

The output is a one-page summary that tells me which services are degraded, by how much, and what changed recently that might explain the degradation. The summary takes about 90 seconds to generate. It replaces about 10 minutes of manual dashboard navigation that I would otherwise do half-asleep.

The triage skill has caught two incidents that I would have misdiagnosed without it. In one case, the alerting service paged me about a database problem. The triage skill showed that the database was fine and the actual issue was a load balancer misconfiguration that was causing the alert. In another case, the page was about a single endpoint, but the triage skill showed that the underlying issue was affecting three other services that had not paged yet. Knowing this earlier let me get ahead of the cascading failures.

The Deploy Correlation Skill

The second skill correlates the incident with recent deploys. About 60% of production incidents are caused by a recent deploy, but identifying which deploy and which change is harder than it sounds, especially in environments where multiple services deploy independently.

The deploy correlation skill queries the deployment log for the last 24 hours across all services, identifies which deploys overlap with the incident timeline, retrieves the changes included in each candidate deploy, and ranks the candidates by how likely each change is to be related to the symptoms.

The ranking uses heuristics like whether the change touches the affected service, whether it changes any code paths in the failing endpoints, whether it modifies dependencies or configuration, and whether the deploy completed shortly before the incident started. The ranking is not always right, but it is right often enough to give me a strong starting point for investigation.

When the deploy correlation skill identifies a likely culprit, the next question is whether to roll back. Rolling back is a high-stakes decision because it can introduce new problems and it costs time. The skill produces a rollback plan with the specific commands to run, the expected downtime, and the rollback risk assessment. I make the call, but I make it with all the relevant information in front of me, not in my head.

The Communication Skill

The third skill handles communication. Communication during an incident is critical and almost always done badly. Stakeholders need to know what is happening. Customers need to know what is happening. The status page needs to reflect reality. Internal channels need updates. The on-call engineer needs to coordinate with anyone else who is involved.

The communication skill drafts the messages. It produces a status page update appropriate for customers, a Slack message for internal channels, an email for the executive notification list if the severity warrants it, and a customer support brief for the support team to use when responding to inquiries.

Each message is drafted from a template and filled in with the specific incident details. The templates are tuned to communicate the right amount of information for each audience. Customers get plain language about what is affected and when we expect it to be resolved. Internal channels get more detail, including what has been ruled out and what is being investigated. Executives get a brief that matches the format they expect.

The skill produces drafts. I review and send. The review takes 30 seconds per message, compared to several minutes of writing each message from scratch while my brain is still booting up.

The Timeline Skill

The fourth skill maintains the timeline. Every incident needs a timeline that captures what happened, when, and what was done about it. The timeline is what feeds the post-mortem, and a post-mortem with a sparse timeline is a post-mortem that misses lessons.

Capturing the timeline in real time is hard. The responder is busy responding. They make notes in Slack or in their head and intend to write up the timeline later, except later they have forgotten the details and the timeline ends up incomplete or wrong.

The timeline skill captures events automatically. It watches the incident channel and pulls out timestamped events. It watches the alert system and captures every alert fire and resolution. It watches the deploy log and captures every deploy and rollback. It produces a structured timeline that I can edit and annotate during the incident or after.

The result is a timeline that is comprehensive without me having to do the bookkeeping. When I sit down to write the post-mortem the next day, the timeline is already there. I just need to add the narrative.

The Hypothesis Skill

The fifth skill is the one that does the most work during an incident. It is a hypothesis skill that takes the symptoms, the recent changes, and the system architecture and proposes hypotheses about what might be wrong.

The skill reads the symptom description, looks at the recent changes from the deploy correlation skill, queries the relevant logs and metrics, and produces a ranked list of hypotheses. Each hypothesis includes what it would predict about the symptoms, what evidence would confirm or refute it, and the next investigation step.

The hypothesis skill is the part of the workflow that feels most like working with a senior engineer who happens to have read every line of the codebase recently. It is not always right. The hypotheses are sometimes wrong, and the ranking is sometimes off. But it produces useful starting points faster than I can think of them on my own, and during an incident the time savings is the entire point.

The skill handles the cognitive load that I cannot reliably handle at 3 AM. It generates the hypotheses I should be considering. It identifies the evidence I should be looking for. It tells me which dashboard would confirm or refute each hypothesis. I do the actual investigation, but the framing is provided.

The Coordination Skill

The sixth skill handles coordination when multiple people are involved. Big incidents pull in multiple responders. Each responder needs to know what the others are doing. Without coordination, two people end up investigating the same thing while a third thing goes uninvestigated.

The coordination skill maintains a live document that lists who is on the incident, what each person is investigating, what has been ruled out, and what is still open. The document updates from the incident channel automatically. The responders can see at a glance who is doing what.

The skill also enforces handoff protocol. When the primary responder needs to step away, the skill produces a handoff document that captures everything the next responder needs to know to take over. The handoff document includes the current hypotheses, the evidence collected, the actions taken, and the open questions. The handoff that used to take 10 minutes of conversation now takes 2 minutes of reading.

The Post-Mortem Skill

The seventh skill writes the post-mortem. The post-mortem is the deliverable that comes out of the incident, and writing it is most teams' weakest link. Post-mortems are tedious to write, they are painful to read, and they often skip the lessons that would actually prevent the next incident.

The post-mortem skill produces a draft. It uses the timeline from the timeline skill, the hypotheses from the hypothesis skill, the actions taken from the coordination skill, and the resolution from the responders. It structures the draft using the post-mortem template the team has agreed on, with sections for what happened, what went well, what went badly, what we are going to change, and what we are not going to change.

The draft is rarely the final post-mortem. It is missing the deeper analysis that requires actual reflection on what went wrong and why. But it captures all the facts, the timeline, and the obvious lessons, so the work I have to do is the reflection rather than the bookkeeping. The post-mortem that used to take three hours now takes one hour, and the one hour is the hour where the actual learning happens.

How the Skills Compose

The skills are designed to compose during an incident. When I declare an incident, the triage skill runs immediately and gives me the lay of the land. The deploy correlation skill runs in parallel and identifies likely culprits. The communication skill produces draft messages while I am reviewing the triage output. The timeline skill starts capturing events.

As the incident progresses, the hypothesis skill generates investigation directions. The coordination skill tracks who is doing what. After resolution, the post-mortem skill drafts the writeup.

The composition is what matters. Any single skill helps a little. All the skills together transform the experience of being on call. The cognitive load drops. The mistakes drop. The mean time to recovery drops. The job becomes sustainable rather than corrosive.

What This Costs

I built the skills over about two weeks of evenings, mostly while my brain was still warm from a recent on-call rotation that had reminded me how miserable incident response can be. The initial versions were rough. I have tuned them based on what worked and what did not over the last several incidents.

The maintenance cost is low. The skills change when the system changes, but most of the patterns are stable. New runbooks get added when new failure modes are encountered. The cost of maintenance is much lower than the cost of working without the skills.

The benefit is real. My mean time to recovery has dropped by about half. The communication during incidents is consistently better. The post-mortems are more thorough because the timeline is captured automatically. On-call no longer feels like the worst week of the rotation. It feels like work that is hard but doable.

What the Skills Do Not Replace

I want to be clear about the limits. The skills do not replace the judgment of a competent on-call engineer. They produce drafts, hypotheses, and summaries. The engineer decides what to do with them.

When the skills are wrong, the engineer needs to recognize that and override them. When the situation is novel, the skills will not have a useful pattern to apply, and the engineer has to fall back on first principles. When the impact assessment is wrong, the engineer has to correct it.

The skills make the routine parts of incident response faster. They do not make the hard parts easier. The hard parts are still hard, and they still require human judgment. What the skills do is free up the cognitive bandwidth that would otherwise be spent on routine work, so the human can apply judgment where it matters.

Setting Up Your Own Workflow

If you want to build something similar for your team, the place to start is to look at the last five incidents and identify the patterns. What did the responder do every time? What information did they need to gather? What messages did they need to send? Those are the patterns that can be encoded.

Pick the one that takes the most time and automate it first. The triage skill is usually a good starting point because it is the highest leverage. Once that is working, add the next one. Build the workflow incrementally. Do not try to build everything at once, because you will not know which parts you actually need until you have used the early skills in a real incident.

The most important property of the workflow is that it actually runs during incidents. A skill that exists but does not run during the chaos of an actual incident is worthless. The way to make the skills run is to integrate them into the incident response runbook so that running them is the first step rather than an optional step. When PagerDuty fires, the responder runs the triage skill before doing anything else. That is the muscle memory you want to build.

The Bigger Picture

There is a pattern that runs through this whole approach, and it is the same pattern that runs through every other workflow I have built with Claude Code. The pattern is that high-stakes work tends to have repeatable parts and judgment-heavy parts. The repeatable parts can be automated. The judgment-heavy parts cannot. Most of the value of automation comes from removing the cognitive cost of the repeatable parts so that the human can focus on the judgment.

Incident response is high-stakes work with a lot of repeatable parts. The triage, the communication, the timeline, the post-mortem are all repeatable. The hypothesis generation has a repeatable scaffold. The coordination has a repeatable protocol. Automating these parts is not a replacement for the engineer. It is a way to make the engineer more effective when it matters most.

If you are on call for a system that you care about, the cost of building this workflow is much smaller than the cost of one bad incident. The math is overwhelming. The only thing stopping you is the time it takes to start, and the way to deal with that is to start with one skill and grow from there.

If you have read this far, you are probably someone who has been on the receiving end of a bad incident response and is looking for a way to do it better. The way is to stop trying to handle the chaos with raw human cognition and start offloading the mechanical parts to automation. Claude Code is one tool for doing this. There are others. The point is that the workflow is the answer, not the tool.

Build the workflow. Run the workflow. Improve the workflow. The next 3 AM page will go better than the last one.

FAQ

How long does it take to build this workflow? The initial set of skills takes about two weeks of evenings. You can get started with just the triage skill in a day.

Does this work for small teams? Yes. Small teams benefit even more, because they cannot afford the time cost of bad incident response.

What about incidents that are not in the patterns? The skills handle the routine 80%. The novel 20% still requires human judgment. The skills free up cognitive bandwidth so the human can focus on the novel parts.

How do I get my team to actually use this? Make running the skills the first step in the incident runbook. Update the runbook so that the very first thing the responder does is invoke the triage skill. Build the muscle memory.

What is the biggest mistake to avoid? Trying to automate the judgment parts. The skills should produce drafts, hypotheses, and summaries. The engineer decides what to do with them. Do not build skills that try to make decisions on behalf of the responder.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Security Audits: How I Catch Vulnerabilities Before They Cost Me

Nex Tools — Sun, 10 May 2026 10:43:50 +0000

Three years ago a junior engineer on a team I was advising committed an environment file to a public GitHub repository. The file contained an AWS access key with admin permissions on a production account. The key was harvested by an automated scanner within four minutes of the commit. By the time the team noticed, an attacker had spun up 200 EC2 instances mining cryptocurrency. The bill for those four hours was $14,000.

The team had a security checklist. The checklist included a line that said "do not commit secrets to git." The line had been on the checklist for two years. It had been read by every engineer on the team. None of that mattered, because security checklists do not run themselves, and the moment of committing a file is exactly the moment when nobody has the bandwidth to consult a checklist.

I started using Claude Code for security audits because I wanted the checklist to run itself. Not as a replacement for human review, but as the first pass that catches the obvious mistakes before they reach a human reviewer or, worse, production. Here is the workflow that has caught real vulnerabilities in real codebases.

Why Security Audits Get Skipped

Most teams have security checklists. Most teams do not run them consistently. The reason is not that engineers do not care about security. The reason is that security audits feel like a tax that gets paid out of the same time budget as shipping features, and the visible reward for shipping a feature is much higher than the visible reward for catching a vulnerability that would not have been exploited for another six months.

This math is wrong, but it feels right in the moment. The cost of a missed vulnerability is theoretical and deferred. The cost of pausing to audit is concrete and immediate. So the audit gets skipped, and the vulnerability accumulates, and six months later somebody pays the deferred cost in cash and reputation.

The second reason security audits get skipped is that they are tedious. A real audit means reading every line of new code with a paranoid mindset. It means thinking about what an attacker could do with each input, each query, each file path. It means imagining failure modes that have not happened yet. This is exhausting work, and humans are bad at sustaining it for long stretches.

A security audit is the highest-leverage hour you can spend on a codebase, and it is also the hour engineers are least motivated to spend, because the work is invisible when it succeeds and only visible when it fails.

Claude Code does not get tired. Claude Code does not get bored. Claude Code can read every line of a diff with the same paranoid mindset on the hundredth file as on the first. That is exactly the kind of work where automation pays off.

The Pre-Commit Audit Skill

The first skill I built is a pre-commit audit. It runs on the staged diff before I commit and flags anything that looks like a security risk. The skill has a list of patterns it looks for and a list of file types it pays extra attention to.

The patterns it looks for include hardcoded credentials of any kind, calls to dangerous functions like eval and exec with user input, SQL queries built by string concatenation, file paths constructed from user input without validation, deserialization of untrusted data, and authentication checks that are missing, bypassable, or applied inconsistently.

The file types it pays extra attention to include environment files, configuration files, anything that looks like it might contain credentials, anything in an authentication or authorization module, and anything that handles user uploads.

When the skill flags something, it explains what the risk is and what the fix would look like. It does not block the commit. It just tells me what it found, and I decide whether to address the issue or proceed. Most of the time the issue is real and worth fixing. Sometimes the issue is a false positive, and I commit anyway. The skill is calibrated to err on the side of flagging too much rather than too little, because a false positive costs me 30 seconds and a missed vulnerability could cost me $14,000.

The skill caught a hardcoded API key in a test file last month. The test file was meant to use a mocked credential, but somebody had pasted a real key into the test while debugging and forgotten to remove it. The commit would have gone to a public repository. The skill flagged it before I pushed, and I cleaned it up.

The Dependency Audit Skill

The second skill audits dependencies. Modern applications include hundreds or thousands of transitive dependencies, and any of them could be compromised. The dependency audit skill cross-references my package manifest against published vulnerability databases and flags packages with known issues.

This is not a novel idea. Tools like npm audit and pip-audit do something similar. What the Claude Code version adds is context. When npm audit tells me there is a high-severity vulnerability in a transitive dependency, I have to figure out whether the vulnerable code path is actually reachable from my code, whether the fix requires a major version bump that will break things, and whether the risk is actually material to my application or just theoretical.

Claude Code reads the vulnerability description, looks at how the dependency is used in my code, and gives me an honest assessment. Sometimes the answer is "this is a real risk, fix it now." Sometimes the answer is "this vulnerability requires the attacker to control the input to a function you do not call, so it is not exploitable in your application." Sometimes the answer is "this vulnerability is real and exploitable, but the fix requires upgrading three other packages first, so you should plan a separate sprint."

The contextual assessment is the part that matters. A list of vulnerabilities is overwhelming. A prioritized list of vulnerabilities with reasoning attached is actionable.

The Authentication Flow Skill

The third skill audits authentication and authorization flows. This is the highest-stakes area of most applications and the area where mistakes are most likely to happen, because authentication code looks similar across applications and engineers tend to copy patterns from previous projects without checking whether the patterns still apply.

The authentication audit skill looks at every endpoint and asks: who is allowed to call this endpoint, and how is that enforced? It traces the authentication middleware, looks at the authorization checks, and verifies that the checks are present, correct, and not bypassable.

Common issues the skill catches include endpoints that are missing authorization checks entirely, endpoints where the authorization check uses the wrong identifier, endpoints where the authorization check happens after a side effect has already occurred, endpoints where the authorization logic is correct in one place but wrong in another, and endpoints where the authorization can be bypassed by malformed input.

I run this skill against every authentication-related PR. It has caught issues that would have shipped to production in two of the last twelve PRs. Both issues were the result of an engineer copying a pattern from a different endpoint without realizing that the new endpoint had different authorization requirements. Both would have been hard to catch in code review because the code looked correct.

The Secrets Scan Skill

The fourth skill scans the entire repository for secrets. This is more aggressive than the pre-commit audit, which only looks at the staged diff. The secrets scan looks at every file, every commit in the history, and every branch.

The skill looks for high-entropy strings that match known credential patterns, environment files that have been committed even if they are now gitignored, hardcoded passwords in test data, API keys in documentation examples, and credentials embedded in deployment scripts.

When the skill finds something in git history, the fix is more involved than just removing the file. The credential needs to be rotated, because anyone who cloned the repository while the credential was visible could still extract it. Then the history needs to be cleaned up, which requires a force push and coordination with everyone who has the repository checked out.

The skill produces a report with the findings sorted by severity and a runbook for each finding that explains what to do. The runbook includes the rotation procedure, the history cleanup procedure, and a list of stakeholders to notify. This is the kind of detail that a generic secrets scanner does not include, and it is the part that turns a finding into a fix.

The Input Validation Skill

The fifth skill audits input validation across the application. The skill identifies every place where the application accepts external input and verifies that the input is being validated before it is used.

External input includes HTTP request parameters, file uploads, environment variables, configuration files loaded at runtime, message queue payloads, and data read from third-party APIs. Each of these is a place where untrusted data enters the system, and each needs to be validated before it is used in a sensitive operation.

The skill looks for input that flows into database queries, file system operations, command execution, deserialization, template rendering, and HTTP requests to other services. For each flow, the skill verifies that the input has been validated against an explicit schema and rejected if it does not match.

The most common issue the skill catches is input that is validated in one path and not in another. An engineer adds a new endpoint that calls an existing function. The existing function assumes its input has already been validated, because the original caller validated it. The new endpoint does not validate, because the engineer assumed the function would handle it. The result is an injection vulnerability that could not have been caught by reading either function in isolation.

The Configuration Audit Skill

The sixth skill audits configuration. Configuration is where security defaults turn into security disasters, because configuration changes do not go through the same review as code changes and the people who make them often do not understand the implications.

The configuration audit skill looks at infrastructure as code, deployment manifests, environment configuration, feature flag definitions, and any file that controls how the application behaves at runtime. It checks for common misconfigurations like overly permissive IAM policies, public S3 buckets that should be private, security groups that allow access from anywhere, debug mode enabled in production, default credentials that have not been changed, and encryption disabled where it should be enabled.

The skill is calibrated for the specific cloud provider and infrastructure stack I use, so it understands the difference between a configuration that is correct for development and one that would be a disaster in production. When it flags something, it tells me whether the issue is hypothetical or material, and what the fix looks like.

How the Skills Compose

The skills are designed to compose. I run the pre-commit audit on every commit. I run the dependency audit weekly. I run the authentication flow audit on every PR that touches auth-related code. I run the secrets scan monthly across the full history. I run the input validation audit on any PR that adds new endpoints. I run the configuration audit before any deployment to production.

This composition is the part that matters. A single audit run catches the issues that are present at one moment. A continuous audit pipeline catches issues as they are introduced, before they accumulate into a backlog that nobody has time to address.

The pipeline has a meta-rule attached. If any audit flags something at high severity, the relevant deployment is blocked until the issue is addressed or explicitly waived. The waiver requires a written explanation of why the issue is acceptable, which goes into a record that gets reviewed periodically. This means that when an issue is waived, it is waived deliberately, not by accident.

What the Skills Do Not Catch

I want to be honest about the limits. The skills catch the kind of issue that has a known pattern and shows up in a recognizable shape. They do not catch novel vulnerabilities, business logic flaws, or issues that require deep understanding of the application's threat model.

Examples of what the skills miss include race conditions in business logic that allow value extraction, authorization checks that are technically correct but enforce the wrong policy, side channels that leak information through timing or error messages, and chained vulnerabilities where each individual issue is low severity but the combination is high severity.

For these classes of issue, you still need human review. What the skills do is reduce the volume of low-hanging issues so that human review can focus on the hard problems. If a human reviewer spends 80% of their time catching missing semicolons in the security checklist, they have 20% left for the issues that actually require their judgment. Flip that ratio, and the audit becomes valuable.

Setting Up the Skills

If you want to build something similar, the structure is straightforward. Each skill is a markdown file that describes what to look for, what to flag, and how to format the report. The skill reads the relevant inputs, looks for the patterns, and produces a report.

The skills are stored alongside the codebase and version-controlled. When the codebase changes in a way that affects the security model, the skills change too. When a new attack surface is added, a new skill is added. When an existing skill produces too many false positives, it is tuned. The skills are living documents, not a one-time setup.

The most important thing is to run the skills consistently. A skill that runs every commit catches issues. A skill that runs once a quarter catches a backlog. The whole point of automation is to remove the human decision about whether to run the audit, and that only works if the audit runs every time.

What This Workflow Costs

The skills took about a day to write initially. Tuning them took another two days spread over the first month, as I saw which patterns produced false positives and which patterns missed real issues. Maintenance takes about an hour a month.

The time saved is harder to measure, because the value of catching a vulnerability is the cost of the breach that did not happen, and you cannot measure something that did not happen. What I can measure is that I no longer skip security audits, because the cost of running them is now measured in seconds rather than hours. The audits have caught real issues that would have shipped to production. The math is overwhelming, in the same way it always was, except now the math actually plays out in practice.

The Bigger Pattern

There is a bigger pattern here that goes beyond security audits. The pattern is that any kind of work that is high-stakes and tedious tends to get skipped, and the skipping accumulates costs that show up later. Code review skipped because it is tedious leads to bugs. Documentation skipped because it is tedious leads to onboarding pain. Security audits skipped because they are tedious lead to breaches.

The pattern for fixing this is the same in each case. Find the part of the work that is mechanical and automate it. Use the time saved to do the part that requires human judgment. Refuse to skip the work entirely, because the math is overwhelming if you account for the deferred costs.

Claude Code is a tool for executing this pattern. It is not a replacement for engineering judgment. It is a way to make sure the tedious 80% of the work gets done so that the engineering judgment can be applied to the 20% that needs it.

If you want to apply this pattern to your own codebase, the place to start is to pick one audit skill and run it. Pick the one that matches your biggest current risk. If you have ever committed a secret, start with the secrets scan. If your authentication is complex, start with the auth flow audit. If your dependency tree is deep, start with the dependency audit. Run it once. See what it finds. Fix what is real. Then schedule it to run continuously.

The first audit will probably find issues that have been sitting in your codebase for months. That is normal. The second audit will find fewer. By the third or fourth iteration, the audit becomes a regular checkpoint rather than a fire drill, and that is when the workflow starts paying back the time you put into it.

FAQ

How do I get started? Pick one audit skill that matches your biggest risk. Write a markdown file that describes what to look for and what to flag. Run it on your codebase. Tune it based on the results.

Do I still need professional security testing? Yes. The audit skills catch the patterns that are easy to encode. They do not catch novel vulnerabilities or business logic issues. Use them as the first line of defense, not the only line.

What about false positives? False positives are a cost. The way to reduce them is to tune the patterns, narrow the scope, and add suppression rules for known-safe cases. Aim for high precision over high recall on issues that block deployment.

How often should I run the audits? Pre-commit audits should run on every commit. Dependency audits weekly. Secrets scans monthly. Configuration audits before every production deployment.

Will this work for my language and framework? The pattern works for any language. The specific patterns depend on the language and framework. Customize the skills for your stack.

If you found this useful, follow for more posts about practical Claude Code workflows. I write about how I run a multi-product business with AI agents handling most of the operational work.

Claude Code for Documentation Generation: How I Stopped Shipping Code Nobody Could Read

Nex Tools — Fri, 08 May 2026 09:06:52 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

Two years ago I inherited a project from an engineer who had left the company. The codebase was clean. The test coverage was reasonable. The architecture was defensible. The documentation was a single README that said "TODO: write docs." There were 200 commits, three deployment environments, a set of cron jobs, and a database schema with 47 tables. None of it was documented.

I spent six weeks figuring out how the system worked before I felt comfortable making changes. Six weeks. The original engineer had probably written the whole thing in three months. I lost a sixth of his entire build time to the absence of a document he could have written in an afternoon.

That experience changed how I think about documentation. Documentation is not a nice-to-have that you write when you have time. Documentation is a force multiplier for everyone who comes after you, and the math for whether it is worth writing is almost always overwhelming. The reason most teams ship without documentation is not that the math is bad. It is that writing documentation is tedious, and the people who would benefit from it are not in the room when the decision is made.

Claude Code changed this for me. Documentation that used to take an afternoon now takes 15 minutes. Documentation that I would have skipped because the cost was too high now gets written because the cost is trivial. Here is the workflow.

Why Documentation Goes Unwritten

Most engineers do not skip documentation because they think it is unimportant. They skip it because the cost feels disproportionate to the benefit at the moment they would have to write it. You just shipped a feature. You are tired. The next feature is already lined up. The documentation is for some hypothetical future engineer who probably will not need it. You skip it.

Six months later, you are that engineer. You stare at the code you wrote and try to remember why a particular decision was made. You cannot. You spend an hour reverse engineering your own thinking. The cost was real. It was just deferred.

The second reason documentation goes unwritten is that the kind of documentation engineers can write quickly is the kind of documentation that nobody reads. Inline comments are easy and largely useless. JSDoc blocks that restate the function signature are easy and largely useless. The documentation that actually helps people is the documentation that captures intent, context, and tradeoffs. That kind of documentation is hard to write because it requires you to step out of implementation mode and think about what someone else would need to know.

The third reason documentation goes unwritten is that there is no obvious place to put it. Should it go in the code as comments? In a docs folder as markdown? In a wiki? In a knowledge base? Each option has tradeoffs and most teams pick one and then regret it later. The friction of figuring out where the documentation belongs is enough to make people skip writing it.

The cost of documentation feels high in the moment of writing it and low when reading it. The cost of missing documentation feels low in the moment of skipping it and high every time someone has to reverse engineer the missing context.

Claude Code does not change the math on whether documentation is worth writing. The math was always overwhelming. Claude Code just makes the writing fast enough that the in-the-moment cost stops being a barrier.

The Module Documentation Skill

When I finish a module, I run the module documentation skill. The skill takes the module source code and produces a markdown document with the following sections.

The first section is what this module does, written in two to four sentences. Not what each function does. What the module as a whole accomplishes. This is the section that future engineers read first to decide whether this module is the one they need to be looking at.

The second section is the public interface. What can callers do with this module? What are the inputs and outputs? What are the error conditions? This section is what Claude Code generates well from code, because the public interface is mostly mechanical.

The third section is the design choices. Why was this module structured this way? What alternatives were considered? What tradeoffs were made? This section is the one that requires actual thought, and it is the one Claude Code does not generate automatically. I write this section as a prompt for Claude Code to fill in based on context I provide. Sometimes I dictate a paragraph and ask Claude Code to clean it up. Sometimes I ask Claude Code to read the code and propose what the design choices probably were, which I then correct.

The fourth section is the gotchas. What surprised me about this module? What is non-obvious? What edge cases caused bugs that I had to fix? This section is the most valuable for future maintenance and the easiest to forget to write, because the gotchas seem obvious to me right after I have just dealt with them.

The fifth section is the change history. Major versions, the reasons for them, and links to the PRs. This is what tells future engineers whether the current behavior is the original intent or a deliberate departure from it.

The skill produces a draft of all five sections. I review the draft, fix the parts Claude Code got wrong, fill in the parts Claude Code could not infer, and commit the file alongside the module. The whole process takes 15 minutes for a module that took me a day to write. The ratio is right.

The README Skill

Every repository should have a README that someone unfamiliar with the project can read in five minutes and walk away with a working mental model. Most repositories do not have this README. They have either a stub README that says "this is the [project name] repository" or a sprawling README that tries to be comprehensive and ends up being unreadable.

The README skill takes the repository structure, the package configuration, the recent commit history, and any existing documentation, and produces a draft README with these sections.

A one-paragraph description of what the project is and who it is for. The audience matters more than the description. A README that does not tell me whether I am the intended audience is a README I will skim and forget.

A quick start guide that walks through the most common setup path. Not every possible setup path. The one that 80 percent of new contributors will use. The other paths can have their own dedicated documentation pages.

A high-level architecture overview. Three to five sentences about the major components and how they fit together. This is the section that helps somebody figure out where to look when they want to make a change.

A pointer to the deeper documentation. The README is a starting point, not a comprehensive guide. It should make it easy to find the deeper material when the reader needs it.

A contribution guide. How are issues tracked? What is the PR process? What conventions does the team follow? This section is what makes the difference between a repo that strangers can contribute to and a repo where strangers bounce off without contributing.

The skill produces a complete first draft. I edit it, sometimes substantially, and commit. The README that used to take a half day to write now takes 30 minutes including my edits. More importantly, the README actually exists, which is a meaningful improvement over the previous baseline.

The Claude Code memory files workflow is what makes the README skill produce useful output instead of generic boilerplate. Claude Code reading the project context once and remembering it across documentation tasks is what changes the output quality.

The Architecture Decision Record Skill

Some decisions deserve a permanent written record. Not every decision. The decisions where future engineers might wonder "why did we do it this way" and where the answer is non-obvious. Architecture Decision Records (ADRs) are the standard format for this kind of documentation, and they are profoundly underutilized.

The ADR skill takes a brief description of a decision, the context that led to it, the alternatives considered, and the tradeoffs accepted, and produces a properly formatted ADR. Each ADR has a number, a title, a status, a date, the context, the decision, the consequences, and the alternatives.

The reason ADRs are underutilized is that the format feels heavyweight relative to the value of any individual decision. Engineers think "this decision is not big enough to deserve an ADR" and so the ADR does not get written. Six months later the decision turns out to have been bigger than they thought, and now there is no record.

The skill changes this calculus. Writing an ADR no longer takes 30 minutes. It takes five. The threshold for "big enough to deserve an ADR" can drop accordingly. I now write ADRs for decisions I would have left undocumented two years ago, and the ADRs are paying off in conversations where I can point to the document instead of trying to reconstruct the reasoning.

The format I use:

# ADR 042: Use cursor pagination for the orders API

Status: Accepted
Date: 2026-04-15

## Context
The orders API returns lists of orders to mobile clients. Order
volume is high enough that offset pagination causes issues at
high page numbers (slow queries, inconsistent results across
pages when new orders are inserted).

## Decision
Use opaque cursor-based pagination. Cursors are base64-encoded
JSON containing the last-seen order id and timestamp.

## Consequences
- Clients cannot jump to arbitrary pages, only navigate forward
- Cursors are stable across data changes
- Cursor format is not part of the public contract and may change
- Migration from offset pagination requires a deprecation window

## Alternatives considered
- Offset pagination: rejected due to performance and consistency
- Keyset pagination with exposed keys: rejected due to leaking
  internal id format to clients
- Time-based pagination: rejected because orders within the same
  millisecond can collide

This format is short enough that writing it does not feel like a chore. It is structured enough that future readers can find the parts they care about quickly. The skill produces drafts in this format from a brief verbal description of the decision.

The API Documentation Skill

API documentation is its own discipline. Module documentation tells you how a piece of code works internally. API documentation tells you how to call a piece of code from outside it. The two have different audiences and different requirements.

I covered API documentation in detail in my Claude Code for API design article. The short version is that API documentation should be generated from specifications, not from code, and the specifications should be written before the code. Claude Code makes both halves of that workflow practical.

The relevant skill for this article is the one that takes existing code that does not have specifications and reverse-engineers documentation from it. This is what you do when you inherit an undocumented API and need to bootstrap documentation without rewriting everything from scratch.

The skill reads the route handlers, the request validation, the response shapes, and the tests, and produces a draft specification document for each endpoint. The draft is incomplete because the code does not always tell you the full story. Authentication requirements might be enforced by middleware that is not visible in the route handler. Idempotency behavior might be implicit in the database constraints. Error responses might depend on conditions the code only handles indirectly.

I review the drafts and fill in the gaps. The drafts get me 70 percent of the way there. Closing the last 30 percent is the part that requires my judgment. But starting from a 70 percent draft is dramatically faster than starting from nothing.

The Tutorial Skill

Reference documentation tells you what is possible. Tutorials tell you how to actually do something useful. Most projects have reference documentation and no tutorials, which is why most projects have a steep onboarding curve.

The tutorial skill takes a goal ("connect this service to a Postgres database with TLS," "set up authentication with custom JWT claims," "deploy this service behind a load balancer") and produces a step-by-step tutorial with code examples, explanations, and troubleshooting tips for the common failure modes.

The tutorials are not autogenerated content with empty filler. They are actual narratives that walk a reader from a starting state to a completed setup, with the reasoning visible at each step. The skill produces these narratives by reading the code, the existing documentation, and the issue tracker (where troubleshooting tips often live as resolved tickets).

I edit the tutorials before publishing. Sometimes I add screenshots. Sometimes I correct steps that Claude Code got slightly wrong because the documentation was outdated. But the structure is sound and the content is mostly correct, which is what matters. Tutorials I would not have written because the cost was too high now exist because the cost is trivial.

If you are starting a new project and want documentation built into the workflow from day one, the CLAUDE.md context file pattern is how you make Claude Code understand your project well enough to produce documentation that does not feel generic.

The Inline Comment Skill

Inline comments are a paradox. Most inline comments are noise. Comments that restate what the code already says are worse than no comments because they take up space and rot when the code changes. But the inline comments that explain non-obvious decisions are gold. The trick is writing the second kind without writing the first.

The inline comment skill reads code and proposes inline comments only for the lines where context is genuinely missing. Hidden constraints. Subtle invariants. Workarounds for specific bugs. Behaviors that would surprise a reader. Things that an engineer reading the code six months from now would wonder about.

The skill is conservative by design. If it is not sure that a comment adds value, it does not propose one. The proposed comments are short, factual, and focused on the why rather than the what.

I review the proposals and accept the ones that make sense. Usually I accept three or four out of every ten proposed. The rest I either reject (the comment was redundant) or modify (the comment had the wrong emphasis). The result is that the code has comments where comments are useful and is comment-free where comments would be noise.

This is the kind of detail work that I would never have time to do manually but that meaningfully improves the readability of code I revisit months later.

The Changelog Skill

Changelogs are documentation that nobody writes and everybody wants. Users want to know what changed in the version they just upgraded to. Maintainers want to remember why they made certain changes when they look back at the version history. Both groups are usually disappointed.

The changelog skill takes the commit history between two release tags and produces a human-readable changelog with sections for new features, improvements, bug fixes, breaking changes, and deprecations. The classification is based on the commit messages and, when those are inadequate, the actual code changes.

The skill is not magic. It cannot tell you which changes are exciting and which are boring. But it can produce a complete first draft that captures the structural changes accurately. I edit the draft to add commentary, group related changes, and highlight the things users actually care about. The whole process takes 20 minutes per release. Without the skill, it would take two hours, which is why I used to skip it.

The Cost of This Workflow

The total time investment to set up the module documentation, README, ADR, API documentation, tutorial, inline comment, and changelog skills was about two days. Most of that was iterating on the prompts to produce output I trusted. The ongoing cost is essentially zero. The skills run as part of my normal development flow.

The benefit is that the projects I work on now have documentation. Not perfect documentation. Not comprehensive documentation. But the kind of documentation that makes a difference for the next engineer who has to work on the codebase. The README explains what the project is. The module documentation explains how the modules work. The ADRs capture the major decisions. The tutorials cover the common workflows. The changelog tracks the releases. The inline comments illuminate the non-obvious lines.

Six weeks of context recovery, like the project I inherited two years ago, would not happen with this workflow. The original engineer would have run the skills as part of finishing the project, the documentation would have been comprehensive enough that I could have onboarded in days rather than weeks, and the company would have gotten back five weeks of my time that they instead spent on me reading code.

The Bottom Line

Documentation is a leverage activity that most engineers skip because the in-the-moment cost feels too high. The cost was always lower than the benefit. Claude Code makes the cost actually low, which removes the last excuse for skipping it.

If you have ever inherited an undocumented codebase, you know how much time gets lost to the absence of context. The engineers who came before you were not lazy or careless. They were busy and the documentation was the thing they could safely skip. Claude Code removes "safely skip" as an option by making documentation cheap enough that there is no longer a reason to skip it.

If this resonates and you want to build a documentation pipeline into your team's workflow, the Claude Code skills guide shows how to package these workflows so that every engineer on the team gets the leverage automatically. The hardest part of documentation is making it routine. Skills make it routine.

The codebases I am proudest of are the ones future engineers will actually be able to read. Claude Code is what makes that possible.

Claude Code for API Design: How I Stopped Shipping Endpoints I Regret Six Months Later

Nex Tools — Fri, 08 May 2026 09:00:57 +0000

Originally published on Hashnode. Cross-posted for the DEV.to community.

The first public API I designed had 47 endpoints. Eight months later, 31 of them were either deprecated, broken, or quietly ignored by the only client that ever consumed them. Two were so badly named that we shipped a v2 just to rename them. One returned a different shape depending on which day of the week you called it, because of a bug nobody had caught in code review. The whole thing was a monument to what happens when an engineer designs an API by writing endpoints in the order they get requested.

That was 2021. Since then I have shipped four more public APIs and a handful of internal ones. Three of them I actually like. The other one is fine. None of them have the embarrassing mid-stream redesigns that the first one had. The difference is not that I got smarter. The difference is that I stopped designing APIs by typing route handlers and started designing them by having a conversation with Claude Code about what the API is for.

This is a workflow article, not a theory article. I am going to walk you through how I use Claude Code to design APIs from a blank slate, how I review existing APIs for design problems before they ship, and how I evolve APIs without breaking the clients that depend on them. The patterns are language and framework agnostic. I have used them for REST, GraphQL, and gRPC services. The tooling matters less than the discipline.

Why API Design Goes Wrong

Most APIs that age badly share a common failure mode. The team designs the API by following the immediate request pattern. The first client wants a way to fetch users, so we add GET /users. The second client wants a way to fetch a user by id, so we add GET /users/:id. The third client wants to filter users by status, so we add GET /users?status=active. Six months later we have 40 endpoints, three different ways to filter, two different pagination strategies, and an inconsistent response envelope.

The problem is not that any individual decision was wrong. Each endpoint, viewed in isolation, made sense at the time it was added. The problem is that nobody designed the API as a whole. The API emerged from accumulated requests, and emergent designs almost always have rough edges.

The second failure mode is designing for the present implementation instead of the future contract. The team exposes the database schema directly because that is what the implementation looks like today. Six months later the database schema changes and now the API has to either change too (breaking clients) or include a translation layer that nobody has time to maintain. Either way, somebody is unhappy.

The third failure mode is the optimistic naming problem. The team names the endpoint after what it does today, not what it represents conceptually. POST /sendWelcomeEmail becomes a problem the moment the product team decides welcome emails should sometimes be SMS messages. Now you have an endpoint named after a transport when the actual concern is the welcome flow.

APIs are contracts. Contracts are about what they promise, not how they are currently fulfilled.

I have made all three of these mistakes more than once. The reason I have stopped making them is that Claude Code now flags them before they ship.

The Design Conversation Skill

Before I write any route handlers, I have a design conversation with Claude Code about what the API is for. This is not a casual chat. It is a structured process that produces a markdown document I commit to the repo before the first endpoint exists.

The conversation has five sections.

The first section is the actor inventory. Who calls this API, and what are they trying to accomplish? Not what features they need, but what jobs they are trying to do. A mobile app trying to render a user profile is doing a different job than a backend service trying to validate a webhook signature, even if both involve the same user record. Most APIs are easier to design when you know the jobs first.

The second section is the resource inventory. What are the nouns this API talks about? Not the database tables. The conceptual nouns. Sometimes a database has six tables but the API only has two resources because the other four are implementation details. Sometimes the database has one table but the API has three resources because what looks like one entity to the storage layer is three different concepts to the consumer.

The third section is the operation inventory. For each resource, what operations does the API support? Create, read, update, delete are the obvious ones, but most APIs need more. Listing with filters. Bulk operations. State transitions that are not just updates. Search. Subscriptions. Each one needs to be explicit so that nothing surprises us later.

The fourth section is the consistency rules. How does this API handle pagination? Errors? Idempotency? Versioning? Authentication? These are the cross-cutting concerns that, if left ad-hoc, end up inconsistent across endpoints. Decide them once and document them.

The fifth section is the explicit non-goals. What is this API not for? Which use cases are out of scope? This section saves more arguments than any other section in the document. Six months from now when somebody asks "can we add a search endpoint to this API," the non-goals section gives a principled answer.

The skill takes my rough description of what I am building and produces a draft of all five sections. I review it, push back on the parts that feel wrong, and iterate until I have a document I would be willing to defend in a design review.

The Endpoint Specification Skill

Once the design document is settled, I move to specifying individual endpoints. The endpoint specification skill takes a resource and an operation and produces a complete specification including the URL path, the HTTP method, the request shape, the response shape, the error responses, the idempotency behavior, and the authentication requirements.

This is where most of the bugs in API design get caught. Writing a specification forces you to confront edge cases that are easy to ignore when you are typing route handlers. What happens if the request body is malformed? What happens if the resource does not exist? What happens if the user is authenticated but lacks permission? What happens if a required field is empty versus missing?

The specification format I use looks like this:

POST /api/v1/orders
Auth: Bearer token (scope: orders:write)
Idempotency: Idempotency-Key header (UUID, 24h retention)

Request:
  {
    customer_id: string (required)
    line_items: [{
      product_id: string (required)
      quantity: integer (required, min: 1, max: 999)
      unit_price_cents: integer (optional, defaults to product price)
    }] (required, min: 1, max: 50)
    shipping_address: AddressObject (required)
    notes: string (optional, max: 500)
  }

Response 201:
  Order resource (full shape)
  Location: /api/v1/orders/{id}

Response 400:
  ValidationError with field-level details

Response 409:
  IdempotencyConflict if same key seen with different body

Response 422:
  BusinessLogicError (e.g., product out of stock)

The skill produces these specifications for every endpoint in the API. I commit them to a specs/ directory. They become the contract that the implementation has to match and that the tests verify.

The discipline of writing specifications first sounds bureaucratic. It is not. It is faster than typing route handlers and discovering the design problems through bugs. The specifications take maybe an hour each. The bugs they prevent take days each.

Want the playbook for setting up Claude Code skills like the API design conversation skill? It is in the Claude Code skills guide. Start with one skill and add more as you find friction in your workflow.

The Consistency Audit Skill

Designing endpoints in isolation is how inconsistencies creep in. The third endpoint uses created_at and the fourth uses createdAt and the fifth uses creationDate. The first list endpoint uses cursor pagination and the second uses offset pagination. The first error response includes a code field and the second does not. Each individual decision is fine. The collection is a mess.

The consistency audit skill takes the full set of endpoint specifications and produces a report of inconsistencies. Naming conventions, pagination strategies, error formats, authentication patterns, response envelopes. Anything that varies across endpoints when it should not.

I run this skill at three points in the API lifecycle. First, after the initial design pass, before any code is written. The earliest fixes are the cheapest. Second, after every batch of new endpoints is added. The skill catches drift from the original conventions. Third, before any major release. The skill provides a final sanity check.

The audit report is brutal in a useful way. Last month it told me I had three different pagination strategies in an API I thought was internally consistent. I had been pattern matching against whichever endpoint I was looking at most recently. The skill noticed what my tired eyes had missed.

The Versioning Strategy Skill

Versioning is where most APIs go to die. The team picks a strategy that sounds reasonable, ships v1, and discovers six months later that the strategy does not work for the actual changes the API needs. By then there are clients depending on v1 and changing the strategy means breaking them.

The versioning strategy skill takes the API design document and produces a versioning plan that covers four scenarios. How do additive changes get versioned? How do breaking changes get versioned? How long does each version stay supported? What is the deprecation process?

The reason this matters is that different versioning strategies suit different APIs. URL path versioning (/api/v1/, /api/v2/) is simple but creates massive code duplication if you maintain multiple major versions. Header versioning is more flexible but harder for clients to discover. Date-based versioning works well for SaaS APIs where clients pin to a specific release. Each strategy has tradeoffs and the right choice depends on the API.

The skill walks through the tradeoffs for the specific API and recommends a strategy with reasoning. I have never accepted the first recommendation without modification, but the recommendation is always close enough to argue with productively.

The Client SDK Generation Skill

Most APIs are easier to use through an SDK than through raw HTTP calls. The problem is that maintaining SDKs in five languages is more work than most teams can sustain. So the SDK either does not exist, or exists in only one language, or exists in five languages but only two are kept up to date.

The client SDK generation skill takes the endpoint specifications and produces SDK code for whatever languages I need. TypeScript, Python, Ruby, Go, Java. The generated SDKs include type definitions, error handling, retries, idempotency key generation, and pagination helpers. They are not as polished as a hand-written SDK by an expert in that language, but they are 80 percent of the way there and they stay in sync with the API automatically.

The trick is that the SDK generation reads the same specifications that the implementation tests verify. If the implementation drifts from the spec, the tests fail. If the spec is updated, the SDK regenerates. The whole pipeline is connected.

This is the kind of thing that used to require a dedicated developer experience team. Now it requires a half day of skill setup and an evening of polish per language.

The same spec-driven workflow applies to internal team APIs too. If you are building a service mesh of internal APIs, the Claude Code memory files approach gives every service a shared context that makes cross-service design conversations dramatically easier.

The Breaking Change Detector

The single most expensive class of API mistake is the accidental breaking change. You think you are making a backwards-compatible change. You are not. A client breaks in production. You roll back. The team loses a day. Trust in the deployment process drops.

The breaking change detector takes the current API specification, the proposed API specification, and produces a report of every change classified as additive, breaking, or ambiguous. Adding a new optional field is additive. Removing a field is breaking. Changing a field from optional to required is breaking. Changing a field from required to optional is technically additive but might break clients that expect the field to always be present.

The ambiguous category is the interesting one. There are changes that are technically backwards-compatible but practically break some clients. Changing the order of fields in a list response. Changing the precision of a float. Changing the timezone of a timestamp. The detector flags these explicitly so I can make a deliberate decision rather than discovering the breakage in production.

I run the detector on every PR that touches the specifications. It is wired into CI. If the PR introduces a breaking change without a corresponding version bump, CI fails. The discipline is enforced by tooling rather than memory.

The Documentation Generation Skill

API documentation is the work that nobody has time for and that everybody complains about when it is missing. Most teams ship documentation that is either nonexistent, out of date, or autogenerated from code in a way that makes it technically complete but practically unusable.

The documentation generation skill takes the endpoint specifications and produces documentation that is more useful than autogenerated reference material. It includes example requests and responses. It includes common workflows that span multiple endpoints. It includes troubleshooting guides for common error conditions. It includes migration guides between versions.

The trick is that the documentation reads the same specifications that everything else reads. So the documentation never drifts from the actual API behavior. If the spec changes, the documentation regenerates. If there is a bug in the documentation, fixing it usually means fixing the spec, which means fixing the implementation, which means the bug gets fixed everywhere at once.

This is one of the workflows where the leverage from Claude Code is most obvious. Documentation that used to take a week to write and that nobody trusted now takes an afternoon to generate and reflects reality.

What This Workflow Has Cost Me

Setting up the design conversation, endpoint specification, consistency audit, versioning strategy, SDK generation, breaking change detector, and documentation generation skills took me about three days. Most of that was iterating on the prompts to produce output I trusted. The skills themselves are short. The expertise is in knowing what good API design looks like, which is not something Claude Code can give me but is something Claude Code can amplify.

The ongoing cost is essentially zero. I run the skills as part of my normal development flow. They produce artifacts I would have wanted to produce anyway. The friction of using them is lower than the friction of skipping them.

The benefit is that I have not shipped a regrettable API since I started using this workflow. The APIs I ship are more consistent, better documented, and more pleasant to consume. The clients that integrate with them complain less. The teams that maintain them complain less. The on-call burden from API issues is lower.

There is a version of this article that is about specific tools. This is not that article. The tools I use are not magic. The discipline is.

The Bottom Line

API design is a leverage activity. The decisions you make in the first week of an API live with you for years. The cost of getting them right early is dramatically lower than the cost of getting them wrong and discovering the mistake six months later when there are clients depending on the wrong shape.

Claude Code does not turn me into a better API designer. It turns me into the API designer I would be if I had infinite patience for writing specifications, running consistency audits, and producing migration guides. Most engineers know what good API design looks like. Most engineers do not have time to do all the work that good API design requires. Claude Code closes that gap.

If you are about to design a new API, do not start by typing route handlers. Start by writing the design document. Then write the endpoint specifications. Then audit them for consistency. Then implement. The implementation will be faster because you will not be redesigning while you type, and the API will be better because you will have thought it through before you committed to it.

If this workflow resonates, the team workflows guide shows how to scale these patterns across an engineering team. The hardest part is not the tooling. The hardest part is the cultural shift from typing first to thinking first.

The APIs I am proudest of were the ones I designed slowest. Claude Code makes slow design fast.

Claude Code for Refactoring Legacy Code: How I Modernize Codebases I Did Not Write

Nex Tools — Wed, 06 May 2026 14:34:18 +0000

The first legacy refactor I owned was a 14,000 line PHP file that handled checkout for an ecommerce site I had been hired to maintain. The original author had left two years before. The variable names were a mix of Hungarian notation, abbreviations, and what I can only describe as personal grudges. There were comments that referenced bugs that no longer existed and tickets in a tracker that had been decommissioned. I asked the team where to start. They said "do not." That was the strategy.

That advice was correct in 2018. In 2026 it is not. Refactoring legacy code used to be the kind of project that took a senior engineer three months and ended in a failed migration. Now the same work takes weeks, and the failure rate is dramatically lower. Claude Code is the difference. Not because it does the refactor for me, but because it does the research that used to make refactoring legacy code intractable.

Here is the workflow I use to modernize codebases I did not write, on a deadline I cannot move, with a team that does not have time to help.

Why Legacy Refactoring Fails

Most legacy refactor projects fail in one of three ways.

The first failure mode is scope creep. The team starts with "let us clean up the checkout module" and ends with "let us rewrite everything." Three months in, nothing ships. Six months in, the project gets cancelled. The original code is still in production.

The second failure mode is regression. The team makes changes that look right but break behavior that was load-bearing. The bug surfaces in production three weeks later when a customer hits an edge case nobody knew existed. The team rolls back. Trust in refactoring drops. Future refactors get harder to justify.

The third failure mode is abandonment. The team starts the refactor, hits an unexpected complication, and pauses. The pause becomes permanent. Six months later there is half-refactored code in production, harder to maintain than the original because now it has two patterns instead of one.

All three failure modes have the same root cause. The team did not understand the code well enough before changing it. Legacy code looks simple from the outside and is not. Every line in a legacy codebase is there for a reason. Some of those reasons are good. Some are obsolete. Some are workarounds for bugs in dependencies that have since been fixed. You cannot tell which is which without reading the code carefully, and reading 14,000 lines of legacy PHP carefully is a job nobody wants.

Legacy code is not bad code. It is code that survived. Survival is information. Most refactors fail because they discard the information.

The Archaeology Skill

Every legacy refactor starts with archaeology. The archaeology skill takes a file or a module and produces a markdown report with the following sections:

Surface summary - what the code appears to do
Hidden behavior - subtle behaviors that are not obvious from naming
Dependencies - what it depends on, what depends on it
Historical context - patterns that suggest old constraints
Risk hotspots - code that handles edge cases or error paths

The historical context section is the one that surprised me most when I started using this workflow. Claude Code can often infer why code was written a certain way by reading the code carefully and recognizing patterns. A weird retry loop with a 30 second backoff is probably a workaround for a flaky upstream service. A function that handles three different argument shapes was probably called from three different places at different times. A comment that says "TODO: fix this" with no date is probably 5 years old and the author is no longer at the company.

The archaeology report becomes the briefing document for everything that follows. I commit it to a docs folder and reference it in every PR description. New team members read the archaeology before they touch the code. The cost of a careful read once is dramatically lower than the cost of repeated shallow reads forever.

The Behavior Spec Skill

Once I understand what the code is, the next step is documenting what the code does. The behavior spec skill takes a module and produces a markdown specification of its observable behavior, including all the edge cases and error paths.

This is not the same as documentation the original author would have written. It is a reverse-engineered specification based on reading the code as it actually is. The spec includes things the original author may not have intended but that are now load-bearing because callers depend on them.

A typical behavior spec for a 500 line module is about 80 lines and reads like a contract:

"Function processOrder(order) returns one of: {success: true, id: string}, {success: false, error: 'inventory'}, {success: false, error: 'payment'}, {success: false, error: 'unknown', code: number}. The unknown case occurs when the upstream service returns a 5xx response or times out. The code field is the HTTP status code when the response is 5xx, or the value -1 when the call timed out."

That last sentence is the kind of detail that lives only in the code until someone writes it down. Once it is written down, refactoring becomes safe because the new code can be checked against the spec. Without the spec, you are checking against your assumption of what the code does, and your assumption is usually wrong in at least one place.

If you want the actual archaeology and behavior spec skills I use, the setup is documented at nextools.hashnode.dev. Adapt them to your stack and start treating legacy code as a research problem.

The Test Generation Pattern

Behavior specs are useful but they are not enforceable. Tests are. The test generation pattern takes a behavior spec and produces a test suite that exercises every documented behavior.

The pattern has three steps:

Generate tests from the behavior spec
Run tests against the existing code and confirm they all pass
Use the tests as the safety net for the refactor

Step 2 is the critical one. If the generated tests do not all pass against the existing code, then either the behavior spec is wrong or the tests are wrong. Both are common on the first iteration. I usually run two or three iterations of "fix the spec, regenerate the tests, run them" before everything passes. Once everything passes, I have a test suite that locks down the existing behavior. From there, refactoring is safe in a way it never was before.

The test generation pattern produces tests that are usually about 70% as good as tests an engineer would have written by hand, but at about 5% of the cost. The 30% gap is mostly in test naming and organization. The actual coverage is comparable. For legacy refactors, this trade is overwhelmingly worth it.

The Incremental Refactor Loop

Refactors fail when they are big bang releases. They succeed when they are incremental. The incremental refactor loop is the workflow that keeps a refactor incremental even when it spans weeks.

The loop has four steps that repeat:

Pick the smallest unit that can be refactored independently
Refactor that unit, keeping the behavior spec satisfied
Run the test suite, ship the change
Move to the next unit

The "smallest unit" definition is what makes this work. In a 14,000 line file, the smallest unit might be a single function. In a 100,000 line module, the smallest unit might be a single file. In a 5,000 file system, the smallest unit might be a single module. The skill is recognizing what counts as small enough to ship in one PR without losing the team's attention.

I usually aim for PRs that are 200 to 400 lines, take 2 to 4 hours to review, and ship within 24 hours of opening. Anything larger gets split. Anything smaller usually means the refactor is not making meaningful progress.

Refactoring at the speed of small PRs is dramatically faster than refactoring at the speed of big PRs, because small PRs do not stall in review and do not introduce conflicts.

The Strangler Fig Pattern

For systems that cannot be refactored in place, the strangler fig pattern is the standard approach. You build new code alongside the old code, route traffic gradually from old to new, and delete the old code once everything has migrated.

Claude Code makes the strangler fig pattern dramatically easier because it can keep the new code and the old code in sync as the migration progresses. The strangler skill takes a behavior spec and a target architecture and produces a migration plan with:

The new code structure
The routing logic that decides old or new per request
The metrics that confirm equivalent behavior
The rollback plan if metrics drift

I have used the strangler skill three times in the last year. Each migration took weeks instead of months, and none of them caused production incidents. The metric-driven routing is the part that makes the difference. You can roll new code out to 1% of traffic, watch the metrics for a day, and either expand or roll back. Without that gradient, you are deploying changes in a single step and hoping.

The Deletion Pattern

The most underrated refactoring move is deletion. Most legacy codebases have dead code, unused features, and abandoned experiments that nobody has the courage to remove. The deletion pattern uses Claude Code to identify safely deletable code with high confidence.

The pattern works in three layers:

Static analysis to find code that is never called
Runtime analysis using production logs to find code that is called less than once a month
Behavioral analysis to find code that is called but whose results are never used

The third layer is the surprising one. Most teams find dead code through static analysis. Behavioral analysis catches code that is technically reachable but functionally inert. A function whose return value is always discarded. A logging call that writes to a destination nobody reads. A side effect that nobody depends on.

Deletion typically removes 10 to 30% of a legacy codebase without changing any observable behavior. That is 10 to 30% less code to maintain, less code to refactor, less code to test, less code to onboard new engineers into. The compound benefit is large.

What I Got Wrong

Three lessons from the first refactor I did with this workflow.

The first lesson is that I tried to refactor too much at once. The archaeology report told me everything that was wrong. I wanted to fix all of it. I started a refactor that spanned eight modules and stalled at module four. The remaining four modules sat in a half-refactored state for two months before I split them into separate projects and shipped them one at a time. Lesson learned: scope is the most important variable in refactoring. Pick small. Always smaller than you think.

The second lesson is that I underestimated how much old code is load-bearing in subtle ways. I deleted a function that the deletion pattern flagged as dead. It was not dead. It was called from a cron job that ran once a quarter to generate a report nobody asked for but that the CFO read every quarter. Three weeks after I deleted it, the CFO asked where the report was. Lesson learned: cross-check deletion candidates with cross-functional stakeholders before deleting. The static and runtime analysis cannot see organizational dependencies.

The third lesson is that I trusted the behavior spec too much without testing it manually. The spec said the function returned a list of orders sorted by date. The function returned a list of orders sorted by date in some cases and unsorted in others depending on a flag I had missed. The tests passed because the test cases happened to match the sorted case. Production users hit the unsorted case immediately. Lesson learned: spec coverage and test coverage are not the same. Spend time exercising the spec manually before locking it down.

FAQ

How big does a legacy codebase need to be before this workflow is worth it?

Around 5,000 lines. Below that you can read it in an afternoon and refactor it without much process. Above that the cost of misunderstanding starts to dominate.

What about codebases in unusual languages?

Claude Code handles most languages well. The archaeology skill is language-agnostic. The test generation pattern depends on having a usable test framework, which is the bigger constraint than the language itself.

Do I need to convince my team to use this workflow?

Not initially. Run the workflow on your own pieces, ship clean refactors, and let the results speak. Teams that see refactors landing without incidents adopt the workflow on their own. Teams that have been burned by refactor projects need evidence before process.

What about codebases with no tests at all?

That is the standard case for legacy code. The test generation pattern is designed for this. You generate tests as part of the refactor process, not before it. The test suite grows alongside the modernization, not as a prerequisite.

The Bigger Picture

Legacy code is a permanent feature of software engineering. Every codebase becomes legacy code eventually. Every engineer eventually inherits code from someone who is no longer around. The question is not whether to deal with legacy code. The question is whether you can deal with it efficiently.

For most of the history of software, dealing with legacy code efficiently was a senior engineer skill. The people who could do it had built up years of pattern recognition, archaeological intuition, and sheer patience. Junior engineers were told to avoid legacy code because the failure modes were so expensive.

Claude Code is changing this. The archaeology skill produces the briefing document a senior engineer would have produced after a careful read. The behavior spec skill captures the contracts a senior engineer would have inferred. The test generation pattern produces the safety net a senior engineer would have demanded before starting the refactor. The result is that refactoring legacy code becomes accessible to anyone who is willing to follow the workflow.

The teams that adopt this workflow first will modernize their codebases faster than the teams that do not. Faster modernization compounds. Modern codebases attract better engineers, ship features faster, and have fewer incidents. The gap between teams that can refactor and teams that cannot is going to widen sharply over the next few years.

If you want to see the exact archaeology, behavior spec, test generation, and incremental refactor skills I use, my full Claude Code legacy refactor setup is documented at nextools.hashnode.dev. Steal them, adapt them, and start shipping refactors that used to be impossible.

The cost of legacy refactoring is collapsing. The codebases that survive the next decade will be the ones that get refactored continuously, in small increments, with the help of AI that does the research that used to be too expensive. Start with the archaeology. The rest follows.

Claude Code Performance Optimization Patterns: How I Cut Production Latency 60% Without Touching Most of My Code

Nex Tools — Wed, 06 May 2026 14:28:58 +0000

The first time I tried to optimize a production service, I read three blog posts about caching, added Redis to one endpoint, and broke production for 40 minutes. The endpoint was faster. Everything else was slower because the connection pool was now exhausted. My CTO at the time had a phrase for this kind of work: "performance theater." You make the metric you watched go down, and three other metrics you did not watch go up.

That experience taught me something that took me years to articulate. Performance optimization is a research problem, not a coding problem. The coding part is easy. The hard part is figuring out which 5% of the code accounts for 80% of the latency, and which 2% of changes will not regress something else.

I run an ecommerce stack now where the slowest endpoint went from 2.4 seconds to 940 milliseconds in three weeks. I touched maybe 200 lines of code total. The other 19,800 lines stayed exactly the same. Claude Code did the research that made the targeted changes possible. Here is the workflow.

Why Most Performance Work Goes Sideways

The default approach to performance work is roughly this: notice something is slow, guess at the cause, change the suspected code, deploy, see if it helped. If the metric moves, declare victory. If not, repeat with a new guess.

This works in toy systems where there is one obvious bottleneck. In real systems it is dangerous. Real systems have layers, and the slow layer is rarely the one you suspect first. You can spend a week optimizing a database query when the actual bottleneck is a JSON serializer in the middleware. You can add caching to the wrong endpoint and exhaust a connection pool that the right endpoint depended on.

The pattern that works is the opposite. You measure first, you hypothesize second, you change third, and you measure again. The measurement step is the one most engineers skip because it feels like overhead. It is not overhead. It is the only thing standing between you and performance theater.

Optimizing without measurement is gambling. Optimizing with measurement is engineering. Most engineers are gamblers who think they are engineers.

The Profile Skill

Every performance investigation starts with a profile. The profile skill takes a service, an endpoint, and a load profile, and produces a flame graph plus a markdown summary of the top 20 hottest call paths.

The skill does three things that I used to do by hand:

Spins up a profiler appropriate for the runtime
Runs a representative load against the target endpoint
Aggregates the results into a flame graph and a ranked list of hot paths

The output looks like a normal profile, but the markdown summary is the part that matters. It says things like "37% of latency is in JSON.stringify called from serializeOrder, called 14 times per request, called from the response middleware." That sentence is a complete diagnosis. You know what is slow, where it is called from, and how often. Most of the work of optimization is producing sentences like that.

The first time I ran the profile skill on my checkout endpoint, the top hot path was something I would never have suspected. It was a logging library serializing a deep object graph for every log line. I thought logging was free. It was 22% of my checkout latency.

If you want the actual profile skill files I use, the setup is documented at nextools.hashnode.dev along with the rest of my Claude Code workflow. Adapt them to your runtime and start measuring.

The Hypothesis Skill

A profile tells you where time is being spent. It does not tell you what to do about it. The hypothesis skill takes a profile and produces a ranked list of optimization hypotheses with cost and risk estimates.

A typical output:

"Hypothesis 1: Replace JSON.stringify with a precompiled serializer. Cost: 4 hours. Risk: low. Expected gain: 15-25% latency reduction on checkout endpoint. Affected packages: 1."

"Hypothesis 2: Cache the order detail object for 30 seconds. Cost: 6 hours. Risk: medium (cache invalidation). Expected gain: 30-40% latency reduction on checkout endpoint. Affected packages: 3."

"Hypothesis 3: Move logging serialization off the hot path. Cost: 2 hours. Risk: low. Expected gain: 18-22% latency reduction on every endpoint. Affected packages: all."

Three hypotheses, three sets of tradeoffs. The hypothesis skill does not pick. It presents. I pick. The pattern that emerged is that the cheapest, lowest-risk hypothesis usually wins, because the highest-risk hypotheses tend to introduce regressions that eat the gains.

Hypothesis 3 in that example is what I shipped first. Two hours of work for an 18% gain across every endpoint, with low risk. That is the trade nobody offers you in performance theater because nobody profiled enough to find it.

The Diff and Verify Loop

Once I have picked a hypothesis, the diff and verify loop takes over. The pattern is the same every time:

Implement the change
Re-run the profile
Compare before and after
Decide ship or revert

The compare step is critical. The profile skill stores baselines. The diff and verify loop produces a side-by-side comparison showing whether the targeted hot path improved, whether other hot paths regressed, and whether the overall endpoint latency moved in the expected direction.

About one out of five changes I make introduces a regression somewhere I did not expect. The diff and verify loop catches them before they ship. That is the whole point. You are not optimizing one number. You are optimizing one number while not regressing the other 50 numbers that nobody is watching.

A change that improves what you measured and regresses three things you did not measure is a net loss. The discipline of measuring everything is what separates real optimization from theater.

The Database Pattern

Database optimization deserves its own pattern because it is the source of more performance bugs than any other layer in most stacks. The database skill is a stack of small skills working together:

Slow query collector - pulls the top 50 slow queries from the database log
Query analyzer - explains each query with the actual execution plan
Index recommender - suggests indexes that would help
Risk checker - flags index changes that would make writes slower

The fourth skill is the one most teams skip. They read about an index that would speed up a read query and add it without checking what it does to writes. The risk checker catches this. It also catches the opposite, indexes that exist but are never used and are silently slowing down every write.

The output of the database skill stack is usually a small set of changes. Drop two unused indexes, add one composite index, rewrite one query to avoid a full scan. Total impact on read latency: 40-60% improvement. Total impact on write latency: usually neutral or slightly improved because the unused indexes are gone.

If you have not audited your indexes in the last six months, that is probably the highest-leverage performance work you could do this quarter. It does not take a senior database engineer. It takes Claude Code, the database log, and a couple of hours.

The N Plus One Detector

The N+1 query problem is the second most common performance bug in web applications, after missing indexes. It happens when code loads a list of records and then loads related records one at a time, producing one query per record. A page that should run two queries runs 200.

The N+1 detector skill takes an endpoint and a request trace, and flags every place where the request issues queries inside a loop. The output is a list of code locations with the suspected N+1 pattern, along with a suggested fix using batch loading or eager loading.

I run the N+1 detector on every new endpoint before it ships. About one in three endpoints has at least one N+1 issue when first written, even by experienced engineers. Catching them in development costs minutes. Catching them in production costs days, because by then they are usually entangled with other code that depends on the lazy loading behavior.

The Cache Pattern

Caching is the optimization most likely to backfire. A bad cache makes performance worse and harder to reason about. A good cache makes performance dramatically better at the cost of some additional complexity. The difference between a bad cache and a good cache is mostly invalidation.

The cache pattern I use with Claude Code starts with three questions:

What is the read-to-write ratio for this data?
What is the staleness tolerance?
What invalidates the cache and how?

If the read-to-write ratio is below 5 to 1, caching is probably not worth it. If the staleness tolerance is zero, caching is dangerous. If the invalidation strategy is not clear before I start, I do not start.

The cache design skill walks through these questions for any proposed cache and produces a design document with the tradeoffs. I review the document before writing any code. About 30% of the caches I considered did not survive this review, which means 30% of the performance work I almost did was performance theater I avoided.

What I Optimize First

When I take over a system or start a new performance project, the order of operations is roughly:

Profile the slowest endpoint
Audit the database indexes
Look for N+1 patterns
Check the logging hot path
Audit the response serialization
Consider caching last, only after the above are clean

The order is deliberate. Caching is the optimization with the highest variance in outcomes. Doing it last, after everything else is clean, means the cache is solving a real problem, not masking a different problem. Doing it first, as most teams do, means the cache often hides issues that get worse over time and surface during incidents.

If you start with profiling and database, you usually do not need much else. The most expensive optimizations rarely justify themselves once the cheap optimizations are done.

What I Got Wrong

Three lessons from the first month I worked this way.

The first lesson is that I trusted the hypothesis skill output too literally early on. The skill produced a hypothesis that sounded great. I shipped it. It regressed an unrelated endpoint. The hypothesis was technically correct but the skill did not have visibility into the dependency I broke. Now I treat hypotheses as starting points for human judgment, not as decisions.

The second lesson is that I ignored variance. A single profile run is noisy. If you compare before and after based on one run each, you might be reading noise. The diff and verify loop now runs each profile five times and reports the median plus the spread. Reading the spread tells me whether a 10% improvement is real or just within the noise floor.

The third lesson is that I underestimated how much performance work is about removing code rather than adding it. The fastest code is no code. Most of the wins on my checkout endpoint came from deleting things, not adding things. Removing a serialization layer that was never used. Removing a middleware that was logging everything. Removing a caching layer that was making things worse. The profile told me what to delete. Without the profile, I never would have known.

FAQ

How long does it take to set up the profile skill?

For a Node.js or Python service, about an afternoon. For a JVM service, a day. For a polyglot service, longer because each runtime needs its own profiler integration. The setup cost pays back the first time the skill catches a hot path you did not suspect.

Do I need a load testing setup?

Yes, but it does not need to be elaborate. A simple script that hits the endpoint at a representative rate is enough. The point is not stress testing. The point is producing realistic profiles.

What about cold start performance?

Different problem, different skills. Cold start optimization is mostly about reducing initialization work and lazy loading dependencies. The profile skill works for cold starts but the analysis pattern is different.

How do I prioritize when everything is slow?

Start with the endpoint that costs the most aggregate latency, which is endpoint latency multiplied by request rate. The slowest endpoint is not always the most impactful one. A 200 ms endpoint called a million times a day matters more than a 10 second endpoint called twice a day.

The Bigger Picture

Performance optimization used to be a senior engineer specialty. Junior engineers were warned away from it because the failure modes were so expensive. The people who could do it well had spent years building the intuition for which guesses would pay off and which would backfire.

Claude Code is collapsing that gap. The profile skill produces the same data the senior engineer would have demanded. The hypothesis skill produces the same options the senior engineer would have generated. The diff and verify loop catches the regressions that used to require code review from someone who had been burned by them before.

The result is that performance work becomes accessible to anyone who is willing to follow the workflow. You do not need years of intuition. You need the discipline to measure first, hypothesize second, change third, and measure again. The skills enforce the discipline. The discipline produces the results.

If you want to see the exact profile skill, hypothesis skill, and diff and verify loop I use, my full Claude Code performance setup is documented at nextools.hashnode.dev. Take what is useful, leave what is not, and ship faster code than you are shipping today.

The cost of performance work is collapsing. The systems that act on this first will be measurably faster than the ones that do not. Start with the profile. Everything else follows.