Faisal Dilawar

Posted on Apr 7

Investigating Performance Issues In A Library project

#performance #softwaredevelopment #softwareengineering #tutorial

│ Part 2 of 2 — This piece covers library projects. Part 1 covers deployed applications and services, which come with a different set of constraints.

The Fundamental Difference

In Part 1, we talked about investigating performance in a deployed system — one where we control the runtime, monitoring and are able to trace requests end to end.

Libraries are different beasts altogether. We ship code. Someone else runs it.

We don't control the thread pool size, the hardware, or how many times our function gets called. We don't have dashboards. Usually we don't have logs. And the person filing the bug report often says
something in tune of "your library is slow" — with no reproducible scenario, no profiler output, and no context about how they're
using it.

This is the library performance problem. And it requires a different mindset.

Don't try to fix it....

Before we go any further I would like to put it out there "If you don't own a library code and have no access to an SME. And on top of that you don't have access to prod data like logs and monitoring then Don't attempt to fix the performance issues. Most probably you will fail in finding and fixing the root cause.
If you are in a pressure situation where you have to fix a bleeding without above tools: This article won't help you. Say a prayer and start debugging things blindly and hopefully you find a band-aid to stop immediate bleeding.
In this article I will mention a few conditions where its better to stop and ask for more details.

Where Most People Go Wrong

Just like part 1, the instinct is to open the codebase and start looking for "obviously slow" things. Maybe there's an allocation in a hot loop. Maybe a regex is being compiled on every call. You find something, fix it, release a patch, and close the issue (you missed saying a prayer in this case).

Two weeks later, the user says it's still slow.

What happened? You probably optimized a piece of code that wasn't the bottleneck in their specific usage pattern. Your benchmark showed improvements. Their workload did not.

The trap is the same as Part 1 — you acted on intuition instead of data. But in a library, the data is harder to get,
which makes the trap easier to fall into.

The Prejudice Problem (Library Edition)

The same trap from Part 1 applies here, but with an extra layer: you're tempted to assume the problem is in the client's
code, not yours.

"They must be calling it wrong." "They're not reusing the object." "Their environment is misconfigured."

Sometimes that's true. But start with the assumption that the problem is real and in your library. Prove otherwise with
data.

Before You Start: The 5 Things You Need (Library Edition)

These are different from Part 1. Some overlap, but the constraints change what's actually achievable.

A clear problem statement from the reporter. "Your library is slow" is not actionable. You need: Which API? What input size? What does slow mean — latency, throughput, memory? Push back until you have specifics. A good problem statement is the foundation of everything that follows.
If clear problem statement is not available, don't proceed.
A reproducible scenario you control
Unlike Part 1, you probably can't look into someone else's production environment. You need to build the scenario yourself — a
benchmark or test that demonstrates the reported problem under controlled conditions. If you can't reproduce it, you can't
fix it and you can't verify the fix. This is always better than asking the users to basically test your changes and then finding whether the changes have worked or not.
It's a not a blocker, but is very vital to have confidence in your fix without resorting to gut feeling.
Understanding of your own library's design
This sounds obvious? It isn't. Libraries accumulate complexity. The person investigating may not be the original author.
Know the hot paths — the APIs that get called most frequently, the ones that process large inputs, the ones that are called in loops. These are your candidates.
Here an SME can be really helpful.
Knowledge of common usage patterns
You don't control how clients uses your library, but you can study it. If possible look at your documentation examples, your issue
tracker, your GitHub discussions. How do people actually call your APIs? What input sizes are typical? What do they call
in loops? This shapes where you look.
This usually reduces your debug time.
Defined performance targets
Same as Part 1 — "fast" is not a target. Define what acceptable looks like: throughput at a given input size, memory
allocation per operation, latency at P99. Without this, you can't declare that you have achieved your target.
This will be your goal post.

Once you have these, several other things become discoverable:

Typical input characteristics — size, shape, edge cases. A library that handles 1KB payloads efficiently may fall apart at 100MB.
Call frequency patterns — is your API called once at startup or thousands of times per second in a hot loop (A heavily executed block of code that repeats rapidly, where even tiny inefficiencies multiply into significant performance bottlenecks.)? The answer changes what matters. Like Part 1, we don't worry too much about the one call at startup for performance issues.
Runtime environment assumptions — JVM version, GC settings, available memory. You can't control these, but you can document what you've tested against and what you assume. It also helps if you document known issues with some runtime environments.

The First Thing You Build: A Reproducible Benchmark

Before touching any code, build a benchmark that demonstrates the problem like we discussed in pre-requisites.

This is your equivalent of the reproducible scenario from Part 1 — but in a library context, it's entirely your
responsibility to construct. The reporter won't hand it to you.

A good benchmark answers:

Which API are we measuring? (e.g. Parser.parse(input))
With what input? (e.g. a 10MB JSON document was the input)
Under what call pattern? (e.g. called 1,000 times in a loop)
What does passing look like? (e.g. throughput > 500 ops/sec)

Use a proper benchmarking tool — JMH for Java, timeit/pytest-benchmark for Python.

Hot Tip: Warm up the runtime before measuring. JIT compilers, class loaders and caches all affect early measurements. You would be surprised how skewed your benchmark will be otherwise.

The Investigation Process

Step 1 — Again Reproduce First, Theorize Later

Run your benchmark. Confirm the problem exists under controlled conditions.

If you can't reproduce it, you have three options:

Go back to the reporter and get more detail about their environment and usage pattern
Expand your benchmark to cover more scenarios until you find the one that triggers it
Don't proceed with optimization.

Do not skip this step. Do not start reading code looking for problems until you have a benchmark that shows the problem.
Otherwise you're optimizing in the dark.

Step 2 — Profile, Don't Guess

Once you can reproduce the problem, profile it. Don't read the code — profile it.

Attach a profiler to your benchmark run and look at where time is actually spent. e.g. JFR (Java Flight Recorder) for Java/Kotlin or py-spy, cProfile for Python.

What you're looking for is a flame graph (A visual representation of a call stack where the width of each block shows exactly how much CPU time a function and its children consumed) or call tree that shows you which functions consume the most time. The thing you thought was slow may not be. The thing you never suspected could be.

Figure 1: Flame Graph

Step 3 — Identify the Hot Path in Your Library

From the profiler output, identify which internal functions are on the critical path. These are the ones worth optimizing.

Ask:

Is the time in your code, or in a dependency you're calling?
Is it CPU time (computation) or wall time (waiting on I/O, locks, or allocations)?
Is it one slow call, or many fast calls that add up?

Last one is very common in libraries. A single call to your API might look fine. But if the client calls it
10,000 times per second, a 50-microsecond allocation per call becomes 500ms of GC pressure (The performance penalty caused by the Garbage Collector frequently pausing the application to clean up a high volume of rapidly created, short-lived objects.) per second.

Step 4 — Categorize the Bottleneck

Same categories as Part 1, but with library-specific nuances:

CPU-bound: Heavy computation per call. Common in parsing, serialization, cryptography, compression. Look for algorithmic improvements — better data structures, avoiding redundant work, caching computed results.
Allocation / GC pressure: Creating too many short-lived objects. This is the most common library performance problem. The client pays the GC cost, not you. Look for object pooling, reusable buffers, or returning primitives instead of boxed types.
I/O-bound: Less common in pure libraries, but relevant if your library wraps file, network, or database access. Look at whether you're doing unnecessary I/O or whether async patterns would help.
Concurrency / thread safety overhead: If your library uses locks to be thread-safe, those locks may be contention points under concurrent load. Look at whether the locking granularity is appropriate, or whether lock-free structures are viable.
Initialization cost amortization (Paying a heavy, one-time execution cost upfront—like building a lookup table or parsing a configuration—so that all subsequent calls process much faster.): Some libraries do expensive work at construction time (loading configs, compiling regexes, building lookup tables). If clients are constructing your objects in a loop instead of reusing them, the fix might be documentation, not code — or making the expensive object clearly reusable.

Step 5 — Validate Before You Fix

Same discipline as Part 1. Before writing a fix:

Can your benchmark reproduce the problem consistently?
Can you explain why this specific thing is causing the slowness?
Does the profiler output support it?

If yes to all three — fix it. If not, keep profiling.

One extra check for libraries: make sure the fix doesn't break correctness. Performance optimizations in libraries could
involve caching, mutability, or reduced copying — all of which can introduce subtle bugs. Your fix needs to pass the full
test suite, not just the benchmark.

Step 6 — Verify and Document

Run your benchmark again after the fix. Measure the delta. Does it match your expectation?

Then document it:

What was the problem?
What was the fix?
What input sizes and call patterns does the improvement apply to?
Are there any trade-offs? (e.g., higher memory usage for better throughput)

This matters because library users need to understand when they'll see the benefit. A fix that helps at 10MB inputs may
not matter at 1KB inputs. Be honest and realistic about the scope.

Getting Closer to Production Visibility (Optional, But Powerful)

One of the hardest parts of library performance work is that you're investigating blind. The client has the production
environment. You have a benchmark. There's a gap between those two things, and that gap is where a lot of investigations
stall.

There are a few ways to close it.

Build optional diagnostic logging into your library.
Most logging frameworks support a concept of named loggers at configurable levels. If your library uses one (like SLF4J in Java) clients can enable debug-level output from your library without changing your code. Use this. Log things that matter for performance: input sizes, time spent in expensive operations, cache hit/miss rates, retry counts. Keep it off by default. But make it easy to turn on.
When a client reports a performance issue, your first ask can be: "Can you enable debug logging for our library and share the output?" That single step can replace hours of guessing.
Expose timing hooks or callbacks. Some libraries go further and expose explicit instrumentation hooks — callbacks or interfaces that clients can implement to receive timing data. This lets clients pipe your library's internal timings directly into their existing monitoring system — the same dashboards they use for everything else. You get visibility into their production environment without needing access to it. They get metrics without having to instrument your code themselves. Something like:

library.setMetricsListener(event -> {
    myMonitoringSystem.record(event.operationName(), event.durationMs());
});

Provide a built-in diagnostic mode (optional but useful).

A step beyond logging: a mode that, when enabled, collects and reports a structured summary of what the library did —
operations performed, time spent, allocations made, retries triggered. Think of it as a flight recorder. The client runs
their workload with diagnostic mode on, exports the report, and sends it to you.

This is more work to build, but for libraries where performance is a core concern, it's worth it. It's the closest thing
you'll get to having your own monitoring in someone else's production.

The key principle: you can't add monitoring to a client's production environment, but you can make your library observable
enough that the client can do it for you. Design for observability from the start — it's much harder to retrofit.

The Unique Challenge: You Can't See Their Production

Figure 2: Production environment is a black box for library project

The hardest part of library performance work is that you're always working with incomplete information. The reporter's
production environment is a black box.

A few things that help:

Ask for a heap dump or profiler output from their side. Even a rough flame graph from their environment is worth more than your best guess.
Provide a diagnostic mode or logging hooks. This is especially valuable for intermittent issues you can't reproduce.
Test against a range of environments. Different JVM versions, GC algorithms, and OS schedulers behave differently.
Be explicit about your performance contract. Document what you've benchmarked, under what conditions, and what the expected characteristics are.

Summary

Library performance investigation is harder than service performance investigation because you don't own the runtime. But
the discipline is the same: follow the data, not your gut.

The process:

Get a clear problem statement — which API, what input, what "slow" means
Build a reproducible benchmark before touching any code
Profile the benchmark — don't read code looking for problems
Identify the hot path from profiler output
Categorize the bottleneck type
Validate your hypothesis before fixing
Verify the fix with the benchmark
Document the improvement, its scope, and any trade-offs

The mindset shift from Part 1: you can't observe production, so your benchmark and profiler are your only sources of truth. Invest in making them accurate.

Part 1 covers the same topic for deployed services — where you have monitoring, distributed tracing, and control over the runtime.

DEV Community