Faisal Dilawar

Posted on Mar 14

Investigating Performance Issues In An Existing System: 101

#systemdesign #performance #latency #programming

Part 1 of 2 — This piece covers deployed applications and services. Part 2 covers library projects, which come with a different set of constraints.

Where Most People Go Wrong

You join a project and someone says "We also have performance issues". Where do you look first? Someone files a ticket: "The system feels slow" And someone comes in and asks have you looked at database connection pool settings, tweaking thread counts, adjusting timeout configs or look at optimizing queries. Sounds familiar?

Two weeks later, latency dropped by 5%. Everyone claps. But the system still feels slow.

Figure 1: "If you torture the data long enough, it will confess to anything." — Ronald Coase

What happened? We as a developer walked in with a theory and found evidence to support it. The DB query was slightly inefficient. We also increased thread counts just in case. Maybe even increase some resources. Fixing these did help a little. But the real bottleneck was a missing cache on the most used workflow that would have taken two days to fix.

This is the trap. And it's remarkably easy to fall into — even for experienced engineers.

The goal of this article is to give you a systematic approach so you're following data, not intuition. It's not a playbook but sort of starting point. Each performance issues are almost always unique. And no system is perfect. You can always find small issues in every system. But fixing them may not yield the desired results.

Figure 2

Before You Start: These Are The 5 Things You Absolutely Need

You cannot do a meaningful performance investigation without these. If any are missing, get them first — otherwise you're guessing in the dark.

1. Access to the codebase
You need to be able to trace execution paths, not just read dashboards. Dashboards tell you that something is slow. The code tells you why.

2. A monitoring system
Even basic metrics — request latency, error rate, CPU usage — are non-negotiable for a deployed service. Without them, you're navigating blind. (For libraries, this is different — we cover that in Part 2.).
If it's not in place as is case in some systems, create one. You need concrete proof of what you have achieved with your changes. It may be you have messed everything up. A monitoring system is the mirror to tell you the truth regarding your changes.

3. Understanding of the codebase, or access to a subject matter expert (SME)
This is the one people underestimate most. You cannot optimize code or fix a system you don't understand. If it's not your codebase, find the person who knows it and treat them as a key collaborator.
Hot tip: Use AI agents to analyze your codebase if it's possible and generate a comprehensive design of each flow even if you know the code base or have a SME at hand. (Use AI as a starting point, but trust your own tracing more. Also, ensure your organization is comfortable with an AI agent analyzing their codebase)

Figure3: You can't fix a system you don't understand

4. Knowledge of the most-used workflows
Not every feature gets equal traffic. And fixing performance issue in a very rarely used workflow may not be worthwhile right now. A bug in the login flow matters more than a bug in the settings page. Your monitoring system will usually tell you this directly — look at request frequency, not just latency.

5. Defined performance targets
"Fast" is not a target. "P99 latency under 200ms for search requests under normal load" is a target. Without a specific number, you can't declare victory and you can't prioritize.
In case it is not defined work with someone to come to a number which should be achievable. You cant do 10 DB queries and achieve a 10ms latency. This number will be your true north to guide you towards the end goal.

Think of these as your entry conditions. Once you have them, several other things become discoverable through investigation rather than needing to be handed to you upfront:

Infrastructure topology — visible from deployment configs, cloud console, or a conversation with DevOps. How many instances are deployed. What kind of resources is there in the pod/DB. 1 pod with 2GB RAM and 2 core CPU will not perform equal to 2 pods with 1GB RAM and 1 core CPU each.
Dependency performance map — which DBs, caches, queues, and external APIs does this service call, and what are their typical latencies? You can usually get this from code and configuration files. But if it's documented, nothing like it.
Data characteristics — volume, growth rate, and shape of data flowing through the system. Processing 100Kb messages is different that processing 10Gb message. What works for 10,000 requests per hour may that same configuration can be completely useless for 10million messages per hour.
A reproducible test scenario — more on this below

Figure 4: System Performance Framework

The First Thing You Build: A Reproducible Scenario

Before touching a single line of code or configuration, build a controlled test that demonstrates the performance problem.

This sounds obvious. Most people skip it.

Here's why it matters: without a reproducible scenario, you can't verify that anything you did actually helped. You might deploy a fix, check production metrics an hour later, and see latency improved. But was that your fix? Or lower traffic? Or a cache that warmed up? You don't know.

The scenario is your measuring stick. It's the equivalent of a failing test in TDD — you're not done until it passes, and you can't call it passing if you can't run it.

A good scenario answers:

What operation are we measuring? (e.g., GET /patients?name=smith)
Under what load? (e.g., 50 concurrent users)
With what data? (e.g., 1 million patient records in the DB)
What does "passing" look like? (e.g., P95 < 150ms)

The Investigation Process

Step 1 — Measure First, Theorize Later

Pull up your monitoring and answer these questions with data:

Which endpoints or operations are slow? If there are multiple operations which dont meet SLA pick the one with highest delta between SLA and actual performance. (Look at latency percentiles, not averages)
Is it constant or spiky? Spiky usually points to GC pauses, lock contention, or cache misses. Constant usually points to an algorithmic or query problem. That would help you focus on real issue. (Spiky latency can also be caused by Network Jitter or Cold Caches. But lets ignore that for now.)
Is it correlated with load? If latency is fine at 10 req/s but degrades at 100 req/s, you most probably have a concurrency or resource saturation problem.
When did it start? A sudden change usually means a deployment or a data volume threshold was crossed. Or a configuration change. Could be some change in 3rd party services or upgrade to a newer version of library.

Figure 5: **Do not form a hypothesis yet.* Just collect facts.*

Step 2 — Identify the Hot Path

Not everything in the system is equally important. Find the operations that are:

Called frequently
Slow (high latency)
High impact to the user

The holy union of those three is where you focus. A rarely-called admin endpoint that takes 2 seconds is less important than a core API that takes 300ms and is called 500 times per second.

Figure 6: AI generated this messy image. Still learning how to give good prompt to generate relevant image. (This line is not generated by AI :stuck_out_tongue)

Step 3 — Trace the Request End to End

For the hot path you identified, trace a single request through every layer:

Client → Load Balancer → App Server → [Business Logic] → Database/Cache/External API → Response

Figure 7: Usual path of a single request

At each layer, ask: how much time does this layer contribute? Is it acceptable?

Distributed tracing tools (Jaeger, Zipkin, Datadog APM) show you this as a flame graph or waterfall. If you don't have these, maybe your logs will tell you this. If even that is not possible add logs to get these details. Again, dont assume that my Business Logic is not consuming time, it can only be DB or 3rd party API.

What you're looking for is where time is actually spent, not where you assume it's spent.

A common finding: 80% of latency is in one DB query. Another common finding: 30% is in serialization you'd never have guessed. Another: a slow 3rd party API call sitting in the middle of what should be a fast operation.

Figure 8: Time breakdown across layers

Once your trace tells you WHICH layer is slow, you need to look at the 'shape' of that slowness to categorize it.

Step 4 — Categorize the Bottleneck

Once you've found where time is spent, categorize it. Each category needs very different solution.

CPU-bound
The service is doing heavy computation.
Symptoms: High CPU utilization, scales linearly with load.
Example: Running validation or transformation on every request without caching the result where possible.
I/O-bound
Time is spent waiting on DB, network, or disk.
Symptoms: CPU is low but latency is high, thread pool exhaustion under load.
Example: An N+1 query — fetching a list of 100 items then making 100 individual DB calls for related data.
Memory / GC pressure
Lots of object allocation causing garbage collection pauses.
Symptoms: Latency spikes rather than constant slowness, heap usage that grows and drops periodically.
Example: Creating large intermediate collections in a loop that runs thousands of times per request.
Concurrency / contention
Threads waiting on each other.
Symptoms: High thread count, low CPU, latency that gets much worse under concurrent load.
Example: A shared resource protected by a synchronized block that every request needs to acquire.
Data volume
Queries or algorithms that worked at 10k records fall apart at 10M.
Symptoms: Gradual degradation over time, correlated with data growth.
Example: A missing index, a full table scan, or an in-memory sort of a result set that used to be small.

Above are just some usual categories. Not an exhaustive list

Figure 9: Categorize the issue

Step 5 — Validate Before You Fix

Before writing a single line of fix code, validate your hypothesis:

Can you reproduce the slow behavior in your reproducible scenario?
Can you explain why this specific thing is causing the slowness?
Does the data support it? (e.g., slow query logs, profiler output, GC logs)

If you can answer yes to all three, you've found the root cause. Now fix it.

If not, go back to Step 3 and keep tracing. At this stage you may end up finding multiple issues. Not a single Root cause. Use your judgement to pick your fights. Your primary focus is the root cause.

Figure 10: Validate **before* you fix*

The Prejudice Problem

There is one common trap that I see very commonly (Although I did the same when I was naive).

You start a performance investigation already believing you know the answer — "it's the DB", "it's the thread pool", "it's the network", "it's the 3rd party api" — you will almost always find evidence to support that belief. No system is perfect. If you look hard enough at any layer, you'll find something to improve. And improving it will most likely help a little.

But "a little" is not the same as fixing the root cause. And chasing the wrong thing costs weeks of effort while users continue to experience slowness.

The discipline is to stay in data-collection mode until the data points clearly at something. Your hypothesis should be the last thing that forms, not the first.

Figure 11: Tackling low hanging fruits may not be the best solution for performance enhancements

A Note on Performance Targets

One thing that kills performance investigations: nobody defined what "good" looks like. You fix something, latency improves, but no one knows "Is this enough?".

Before you start, establish numbers. Some useful ones (Look these terms up if you are not sure):

P50 / P95 / P99 latency — average hides outliers; percentiles don't
Throughput at peak load — requests per second the system must handle
Error rate under load — a system that's fast but drops 2% of requests isn't performing well
Resource utilization ceiling — at what CPU/memory level does performance degrade?

These become your success criteria. The reproducible scenario you built in step one should be testing against these.

Summary

Performance investigation done right is less glamorous than people expect. It's mostly measurement, tracing, and resisting the urge to jump to a solution.

The process, stripped down:

1. Establish the 5 prerequisites before starting
2. Build a reproducible scenario first
3. Measure — let data tell you where time is spent
4. Identify the hot path
5. Trace end to end across layers
6. Categorize the bottleneck type
7. Validate your hypothesis before fixing
8. Verify the fix using your reproducible scenario

The mindset that makes this work: follow the data, not your gut. Your intuition about where the problem is might be right. But until the data confirms it, it's just a theory.

Part 2 covers the same topic for library projects — where you don't have a deployment, monitoring is your responsibility to build, and "production" is someone else's process.

DEV Community