David

Posted on Apr 9

Expense Tracker: Learning Performance Testing and Continuous Profiling

#cpu #performance #monitoring #springboot

(Originally posted on Medium, Photo by Matheus Bertelli: https://www.pexels.com/photo/close-up-of-cooling-system-inside-gaming-pc-34552800/)

In this article, I learned how to precisely identify CPU bottlenecks in my code using everything I’ve learned so far, including a new tool: Pyroscope.

Introduction

As a brief review, Expense Tracker is my personal solution to recording my expenses, where I took the opportunity to learn about unit testing, continuous integration, and observability. It exists as a standalone Android app, but this series is about the client-server version with Spring Boot. After I was done with the basic features of the app, I decided to find out how it would perform in the real world, and so I came across performance testing. The two types of performance tests I’m concerned with are Load and Stress tests. There are other types like soak and chaos tests, but I chose to focus on the basics. Load tests are to anticipate the resources needed for standard load conditions, while stress tests are meant to find out how much pain a system has to endure before it caves in entirely, and if it does so gracefully.

Load Testing vs Stress Testing: The Main Differences | BrowserStack

Understand the key differences between load testing and stress testing and why both are crucial for optimizing your application's performance.

browserstack.com

TL;DR

The Bottleneck: Identified a critical N+1 query issue causing 80s+ latencies and 4,000+ spans in a single trace.
The Fix: Reduced latency by 97% using database JOINs and in-memory Hash Map caching.
Thread Pool Tuning: Optimized JWT generation by aligning Tomcat thread pools with v-core counts to minimize context-switching.
The Reality Check: Concluded that BCrypt slowness is a non-negotiable security feature, choosing to rate-limit the app rather than compromise hashing strength.
Final Verdict: Stabilized the system at 10 RPS with a 100-thread cap for a 2 v-core staging environment.

Preparation

My initial objective was simple:

What are the limits of my system?

To find that out, I made sure to observe the Grafana dashboards I built when I was learning about metrics, including:

JVM metrics:

MySQL Metrics:

Request, Error, Duration (RED) dashboard

I also followed this article as a guideline and decided to come up with a basic test script based on a typical user flow:

Test Script

Log in
- POST /auth/login
Enter the draft screen
- POST /transactions/draft
- GET /transactions/draft
- GET /categories
Change the account from the default
- GET /accounts
- PUT /transactions/draft
Attach the relevant documents
- POST /transactions/draft/documents
Modify the items
- PUT /transactions/draft/items/{uuid}
Save the draft and view the main page
- POST /transactions
- GET /transactions
View statistics according to certain criteria
- POST /statistics/... (5 different endpoints)
View the filtered transactions themselves
- POST /statistics/transactions

Testing

My tool of choice is Apache JMeter, and for all my tests, I used the Constant Throughput Timer, which essentially dictates my Requests per Second, or RPS, so long as the server can keep up. I was iteratively building my JMeter script as I slowly discovered my system limits each run. Some key points include:

My MySQL never showed any real signs of being anywhere near breaking — the memory, row locks, and slow queries were consistently okay
I never encountered any memory issues for my JVM. The graphs all pretty much looked like this:
Earlier on in my tests, I also noticed that my CPU kept peaking at certain points, but I didn’t think much of it. I later learned why that was important.

VM CPU:

Host CPU:

I noticed various things and changed a number of other things over time, but I’ll leave them out from here for brevity.

Developing Acceptance Criteria

During my tests and research, and after being comfortable with gauges and counters, I discovered the usefulness of the third metric type in Prometheus: histograms. They help to answer the question:

What latency did 90% of my users experience?

The graph above is capped at 10 here because those are the settings prescribed by the OpenTelemetry semantic conventions. Histograms work with buckets, and so precise detail can be hidden — it’s possible that a majority of requests take significantly longer than 10 seconds, or alternatively, it’s likely that a large portion of request durations are towards the tail end of a bucket boundary. Errors of estimation are actually expected because of this, and one can choose their own bucket boundaries to fit their use case. I stuck with the defaults because 10 seconds sounds long enough to me for a bad user experience.

I initially focused on averages, but I later realised why histograms are worth considering. Histogram quantiles are more helpful in my case because of the possibility of outliers skewing the results.

With this new information, I decided to rework my dashboards and make use of Grafana’s thresholds:

Response time p95 latency:
- 0 to 3 seconds: OKAY/green;
- 3 to 7 seconds: FAIR/yellow;
- 7 to 10 seconds: BAD/red
Memory and CPU: 90%
Storage: 70%
Server error rate (HTTP 4xx): 10%
Client error rate (HTTP 5xx): 30%

These numbers come from what I’ve decided to be an acceptable user experience. With the thresholds established, my JVM dashboard looks like this:

And my RED dashboard looks like this:

At that point, my objective changed:

Optimize the bottlenecks

There were three offenders:

Upload Document
Get transactions
Login

The document endpoint was mostly fixed by changing my hardware from an HDD to an SSD, but the others were a bit more involved

Optimizations

GET /transactions

Problem

I was making unnecessary round-trip queries to the database, severely increasing the latency

Solutions

Use a JOIN query to fetch all the needed data at once
For nested entities, also use hash maps to cache them and make “in-memory” joins.
Improvement: my worst-case trace went from 1 minute 21 seconds down to 2.2 seconds (97% improvement)

Details

My test script essentially acted like a mass insertion script, so I wanted to see how the app would handle the roughly 777 transactions I had inserted by that point, but after logging in to the test user account and opening the screen, it crashed. After looking at Tempo, I found out why:

My endpoint took 1 minute and 20 seconds, but the app crashed due to a 10-second timeout, which is OkHttp’s default. It took so much time because I made an excessive number of round-trips to the database to fetch nested entities. That’s why this trace has almost 4,000 spans. On Markus Winand’s website for database indexing, this problem is called the N+1 query problem, where 1 is the initial query to fetch the data and N is the number of rows fetched. The solution is to use a join query. Once I made the change, the time drastically improved — the same endpoint went down in duration by 97% to 2.2 seconds.

POST /auth/login

Problems

JWT generation was stressing the CPU
Password hashing was also stressing the CPU

Solutions

Limit JWT generation to a CPU-sized thread pool
No more software optimizations to be made. Get better hardware
Token generation was improved by 64% from 1 minute 8 seconds to 24 seconds in my worst case, but password hashing still took the bulk of the time

Details

When I checked my traces — which are usually more accurate than histograms due to their different use cases — I saw that the time would get as bad as 37 seconds, so I had a good amount of trouble with this one. I thought that maybe private key and JWT generation were computationally expensive and just took that much time, and I initially didn’t know how to verify that. After adding timer logs and searching with Loki, I saw something interesting:

There’s a point where it suddenly goes from taking over 13 seconds to less than 1 second to generate that JWT. Quite confusing. I didn’t know how to diagnose the problem when the output was so divided.

I eventually discovered profiling as a potential tool to help, and I chose Grafana Pyroscope as my tool. I initially tried VisualVM, but setting each one up required roughly the same amount of effort, so Pyroscope won out since it’s part of the Grafana ecosystem I’m already invested in.

With Pyroscope set up, I searched for my JWT method, createUserToken, and found that it had taken over 1 minute of CPU time out of the 23 minutes that the newly profiled test ran for. After looking into it and asking Gemini, I discovered that context-switching was my main issue: CPU-bound tasks shouldn’t be performed across a lot of threads, as the overhead is high. They should only have as many threads as there are cores. Spring’s default thread count for Tomcat is 200. This pool size isn’t a problem for I/O threads, which in this case is for server requests, but by letting each worker thread call the JWT function, I was slowing everything down.

I further confirmed this suspicion by returning to my CPU graphs

VM:

JVM:

Host (the VM had 4 out of 8 cores):

Once I looked at the timings and compared them with my RPS for the login endpoint, I saw that they were closely aligned, which proved to me that the endpoint’s volume accounted for the increased CPU load.

The improvement was made by delegating the task to a thread pool sized with Runtime.availableProcessors() (link) and verified by testing with JUnit. After deploying my change, I saw that the optimization worked. Here are the before and after screenshots:
Before:

After:

An improvement from 1 minute 8 seconds to 24 seconds. However, the endpoint still had a bad p95, so after more digging and sorting the table by Self instead of Total, and that’s where I found the worst offender:

Password hashing with BCrypt

What I tried:

Creating a custom subclass of the Password Encoder in order to delegate it to a thread pool to achieve the same speed-up effect as JWT generation. After testing it with 5 consecutive JUnit runs, my best time was roughly 26 seconds whereas the target was 10.
Rate limiting with the endpoint with Bucket4J. I started trading latency for throughput and couldn’t really find a good middle ground

After searching around for other ways to optimize my code and not finding anything, I decided to stop for 3 main reasons:

The slowness of password hashing is a security feature. I shouldn’t work against that
My hardware has reached its limit
A massive simultaneous login isn’t a regular user pattern

The Verdict

I had tried various other conditions, including:

Increasing the RPS of the load test
Increasing the Tomcat thread count
Varying the user count in my JMeter script
Using my staging VM to determine my final resources

The VM I chose has 2 v-cores and 2 gigabytes of RAM, and factoring that in, my final decisions include:

Rate limit the app to 10 requests per second
The Tomcat thread count should be capped at 100

Conclusion

The project was a fun learning experience overall, and now I have a good idea of what it’s like to architect an app from end to end with performance, automation, and code stability in mind. It was especially nice learning the hard limits and what I could and could not improve.

Thank you for your time

DEV Community

Expense Tracker: Learning Performance Testing and Continuous Profiling

Introduction

Load Testing vs Stress Testing: The Main Differences | BrowserStack

TL;DR

Preparation

Test Script

Testing

Developing Acceptance Criteria

Optimizations

GET /transactions

Problem

Solutions

Details

POST /auth/login

Problems

Solutions

Details

The Verdict

Conclusion

Top comments (0)