Nex Tools

Posted on May 6 • Originally published at nextools.hashnode.dev

Claude Code Performance Optimization Patterns: How I Cut Production Latency 60% Without Touching Most of My Code

#productivity #ai #performance #webdev

The first time I tried to optimize a production service, I read three blog posts about caching, added Redis to one endpoint, and broke production for 40 minutes. The endpoint was faster. Everything else was slower because the connection pool was now exhausted. My CTO at the time had a phrase for this kind of work: "performance theater." You make the metric you watched go down, and three other metrics you did not watch go up.

That experience taught me something that took me years to articulate. Performance optimization is a research problem, not a coding problem. The coding part is easy. The hard part is figuring out which 5% of the code accounts for 80% of the latency, and which 2% of changes will not regress something else.

I run an ecommerce stack now where the slowest endpoint went from 2.4 seconds to 940 milliseconds in three weeks. I touched maybe 200 lines of code total. The other 19,800 lines stayed exactly the same. Claude Code did the research that made the targeted changes possible. Here is the workflow.

Why Most Performance Work Goes Sideways

The default approach to performance work is roughly this: notice something is slow, guess at the cause, change the suspected code, deploy, see if it helped. If the metric moves, declare victory. If not, repeat with a new guess.

This works in toy systems where there is one obvious bottleneck. In real systems it is dangerous. Real systems have layers, and the slow layer is rarely the one you suspect first. You can spend a week optimizing a database query when the actual bottleneck is a JSON serializer in the middleware. You can add caching to the wrong endpoint and exhaust a connection pool that the right endpoint depended on.

The pattern that works is the opposite. You measure first, you hypothesize second, you change third, and you measure again. The measurement step is the one most engineers skip because it feels like overhead. It is not overhead. It is the only thing standing between you and performance theater.

Optimizing without measurement is gambling. Optimizing with measurement is engineering. Most engineers are gamblers who think they are engineers.

The Profile Skill

Every performance investigation starts with a profile. The profile skill takes a service, an endpoint, and a load profile, and produces a flame graph plus a markdown summary of the top 20 hottest call paths.

The skill does three things that I used to do by hand:

Spins up a profiler appropriate for the runtime
Runs a representative load against the target endpoint
Aggregates the results into a flame graph and a ranked list of hot paths

The output looks like a normal profile, but the markdown summary is the part that matters. It says things like "37% of latency is in JSON.stringify called from serializeOrder, called 14 times per request, called from the response middleware." That sentence is a complete diagnosis. You know what is slow, where it is called from, and how often. Most of the work of optimization is producing sentences like that.

The first time I ran the profile skill on my checkout endpoint, the top hot path was something I would never have suspected. It was a logging library serializing a deep object graph for every log line. I thought logging was free. It was 22% of my checkout latency.

If you want the actual profile skill files I use, the setup is documented at nextools.hashnode.dev along with the rest of my Claude Code workflow. Adapt them to your runtime and start measuring.

The Hypothesis Skill

A profile tells you where time is being spent. It does not tell you what to do about it. The hypothesis skill takes a profile and produces a ranked list of optimization hypotheses with cost and risk estimates.

A typical output:

"Hypothesis 1: Replace JSON.stringify with a precompiled serializer. Cost: 4 hours. Risk: low. Expected gain: 15-25% latency reduction on checkout endpoint. Affected packages: 1."

"Hypothesis 2: Cache the order detail object for 30 seconds. Cost: 6 hours. Risk: medium (cache invalidation). Expected gain: 30-40% latency reduction on checkout endpoint. Affected packages: 3."

"Hypothesis 3: Move logging serialization off the hot path. Cost: 2 hours. Risk: low. Expected gain: 18-22% latency reduction on every endpoint. Affected packages: all."

Three hypotheses, three sets of tradeoffs. The hypothesis skill does not pick. It presents. I pick. The pattern that emerged is that the cheapest, lowest-risk hypothesis usually wins, because the highest-risk hypotheses tend to introduce regressions that eat the gains.

Hypothesis 3 in that example is what I shipped first. Two hours of work for an 18% gain across every endpoint, with low risk. That is the trade nobody offers you in performance theater because nobody profiled enough to find it.

The Diff and Verify Loop

Once I have picked a hypothesis, the diff and verify loop takes over. The pattern is the same every time:

Implement the change
Re-run the profile
Compare before and after
Decide ship or revert

The compare step is critical. The profile skill stores baselines. The diff and verify loop produces a side-by-side comparison showing whether the targeted hot path improved, whether other hot paths regressed, and whether the overall endpoint latency moved in the expected direction.

About one out of five changes I make introduces a regression somewhere I did not expect. The diff and verify loop catches them before they ship. That is the whole point. You are not optimizing one number. You are optimizing one number while not regressing the other 50 numbers that nobody is watching.

A change that improves what you measured and regresses three things you did not measure is a net loss. The discipline of measuring everything is what separates real optimization from theater.

The Database Pattern

Database optimization deserves its own pattern because it is the source of more performance bugs than any other layer in most stacks. The database skill is a stack of small skills working together:

Slow query collector - pulls the top 50 slow queries from the database log
Query analyzer - explains each query with the actual execution plan
Index recommender - suggests indexes that would help
Risk checker - flags index changes that would make writes slower

The fourth skill is the one most teams skip. They read about an index that would speed up a read query and add it without checking what it does to writes. The risk checker catches this. It also catches the opposite, indexes that exist but are never used and are silently slowing down every write.

The output of the database skill stack is usually a small set of changes. Drop two unused indexes, add one composite index, rewrite one query to avoid a full scan. Total impact on read latency: 40-60% improvement. Total impact on write latency: usually neutral or slightly improved because the unused indexes are gone.

If you have not audited your indexes in the last six months, that is probably the highest-leverage performance work you could do this quarter. It does not take a senior database engineer. It takes Claude Code, the database log, and a couple of hours.

The N Plus One Detector

The N+1 query problem is the second most common performance bug in web applications, after missing indexes. It happens when code loads a list of records and then loads related records one at a time, producing one query per record. A page that should run two queries runs 200.

The N+1 detector skill takes an endpoint and a request trace, and flags every place where the request issues queries inside a loop. The output is a list of code locations with the suspected N+1 pattern, along with a suggested fix using batch loading or eager loading.

I run the N+1 detector on every new endpoint before it ships. About one in three endpoints has at least one N+1 issue when first written, even by experienced engineers. Catching them in development costs minutes. Catching them in production costs days, because by then they are usually entangled with other code that depends on the lazy loading behavior.

The Cache Pattern

Caching is the optimization most likely to backfire. A bad cache makes performance worse and harder to reason about. A good cache makes performance dramatically better at the cost of some additional complexity. The difference between a bad cache and a good cache is mostly invalidation.

The cache pattern I use with Claude Code starts with three questions:

What is the read-to-write ratio for this data?
What is the staleness tolerance?
What invalidates the cache and how?

If the read-to-write ratio is below 5 to 1, caching is probably not worth it. If the staleness tolerance is zero, caching is dangerous. If the invalidation strategy is not clear before I start, I do not start.

The cache design skill walks through these questions for any proposed cache and produces a design document with the tradeoffs. I review the document before writing any code. About 30% of the caches I considered did not survive this review, which means 30% of the performance work I almost did was performance theater I avoided.

What I Optimize First

When I take over a system or start a new performance project, the order of operations is roughly:

Profile the slowest endpoint
Audit the database indexes
Look for N+1 patterns
Check the logging hot path
Audit the response serialization
Consider caching last, only after the above are clean

The order is deliberate. Caching is the optimization with the highest variance in outcomes. Doing it last, after everything else is clean, means the cache is solving a real problem, not masking a different problem. Doing it first, as most teams do, means the cache often hides issues that get worse over time and surface during incidents.

If you start with profiling and database, you usually do not need much else. The most expensive optimizations rarely justify themselves once the cheap optimizations are done.

What I Got Wrong

Three lessons from the first month I worked this way.

The first lesson is that I trusted the hypothesis skill output too literally early on. The skill produced a hypothesis that sounded great. I shipped it. It regressed an unrelated endpoint. The hypothesis was technically correct but the skill did not have visibility into the dependency I broke. Now I treat hypotheses as starting points for human judgment, not as decisions.

The second lesson is that I ignored variance. A single profile run is noisy. If you compare before and after based on one run each, you might be reading noise. The diff and verify loop now runs each profile five times and reports the median plus the spread. Reading the spread tells me whether a 10% improvement is real or just within the noise floor.

The third lesson is that I underestimated how much performance work is about removing code rather than adding it. The fastest code is no code. Most of the wins on my checkout endpoint came from deleting things, not adding things. Removing a serialization layer that was never used. Removing a middleware that was logging everything. Removing a caching layer that was making things worse. The profile told me what to delete. Without the profile, I never would have known.

FAQ

How long does it take to set up the profile skill?

For a Node.js or Python service, about an afternoon. For a JVM service, a day. For a polyglot service, longer because each runtime needs its own profiler integration. The setup cost pays back the first time the skill catches a hot path you did not suspect.

Do I need a load testing setup?

Yes, but it does not need to be elaborate. A simple script that hits the endpoint at a representative rate is enough. The point is not stress testing. The point is producing realistic profiles.

What about cold start performance?

Different problem, different skills. Cold start optimization is mostly about reducing initialization work and lazy loading dependencies. The profile skill works for cold starts but the analysis pattern is different.

How do I prioritize when everything is slow?

Start with the endpoint that costs the most aggregate latency, which is endpoint latency multiplied by request rate. The slowest endpoint is not always the most impactful one. A 200 ms endpoint called a million times a day matters more than a 10 second endpoint called twice a day.

The Bigger Picture

Performance optimization used to be a senior engineer specialty. Junior engineers were warned away from it because the failure modes were so expensive. The people who could do it well had spent years building the intuition for which guesses would pay off and which would backfire.

Claude Code is collapsing that gap. The profile skill produces the same data the senior engineer would have demanded. The hypothesis skill produces the same options the senior engineer would have generated. The diff and verify loop catches the regressions that used to require code review from someone who had been burned by them before.

The result is that performance work becomes accessible to anyone who is willing to follow the workflow. You do not need years of intuition. You need the discipline to measure first, hypothesize second, change third, and measure again. The skills enforce the discipline. The discipline produces the results.

If you want to see the exact profile skill, hypothesis skill, and diff and verify loop I use, my full Claude Code performance setup is documented at nextools.hashnode.dev. Take what is useful, leave what is not, and ship faster code than you are shipping today.

The cost of performance work is collapsing. The systems that act on this first will be measurably faster than the ones that do not. Start with the profile. Everything else follows.

DEV Community