Ernesto Enriquez

Posted on Mar 12

How much torment can my little homelab take? Part 1.

#linux #sre #devops #performance

My setup ain’t much. I have a laptop running Arch and a desktop running Debian. I'm worth a grand total of 32 gigs of ram, 24 CPU cores, and 6 feet of a cat8 Ethernet Cable.

There’s a gnarly little question gnawing at my nucleus accumbens . How many requests per second can my $700 setup handle? What about reads per second, or writes!?

Assuming a Java web application and relational database, can it handle, say, 10,000 of each?

Probably not! In fact, it’s a ridiculous suggestion. I mean, what am I– crazy? Naive? Blissfully unaware of the economic state of consumer hardware? Well, I’m going to try it anyway!

Fun fact: 10k req/sec is about four times what stack overflow was doing back in 2016 with bare-metal, enterprise level hardware.
Fun fact 2: One of my homelab’s fans doesn’t work. Thought you might find that mildly amusing.

The architecture is simple. Spring pet clinic is a sample MVC app that uses PostgreSQL for storage. Here is what it looks like deployed. I’ll be using Grafana K6 for stress testing.

I’ll be targeting the following endpoints:

GET “/” – requests per second.
GET /owners/{ownerID} – triggers a read to Postgres.
POST /owners/new – triggers a write to Postgres.

Given that the JVM is notorious for having a cold start problem¹, we’ll be including a ramp up period for each stress test. We’ll start with 1% max rps -> 10% max rps -> 50% max rps -> 100% max rps. We’ll spend 30 seconds ramping up to each stage and hold our max stage for 60 seconds.

For example, at 1k rps:

I decided to containerize everything from the get go and deploy it on Minikube, but I’ll eventually move to bare metal or k3s since it’s more resource optimized. Also, I’ll start with a single node at first and add a second machine later (probably in continuation post). It’s a journey, after all.

Anyways, did you catch all that? Here, look some Excalidraw:

I’ll be writing down findings, headaches, and optimizations I make along the way.

Let’s see, am I forgetting anything before we start?

[x] State goal
[x] Describe application and infrastructure
[x] Scatter droll remarks across the article and mention that I use Arch Linux at least twice.

Oh yeah! Our only SLO is 0% request drop rate under the target load. In other words, a single queued request not handled by the end of a run yields a big fat failure.

1 X per second

As you might have guessed, this was pretty much smooth sailing. I’m going to show the 9 experiments back to back for this one, since the throughput is so low.

The first three tests are req/sec, then reads, then writes.

That latency spike at the start was the JVM warming up. Two more (smaller) latency spikes follow when we switch to reads and writes. They’re different code paths, after all!

Here are the Postgres metrics:

Now, let’s turn up the heat.

1000 X per second

Requests

All systems are nominal. Take a look at those latency scores, though! Each time we started the load test, you’d see a hit to performance, and then a massive, sharp improvement. They don’t call it Hotspot for nuthin.

Next, let’s do some reads.

All systems nominal part 2.

Honestly I didn't think I'd make it this far with the default Minikube limits (2 cores and 2GB of memory). 1k requests per second is a little over 86 million requests per day. Pretty grande numero, compadre.

Writes:

It didn't work! Of the 3 experiments, only the second managed to write the 60k records into Postgres.

So, why did run 1 and 3 fail? Two metrics immediately stand out.

First, take a look at the thread state (chart 1 row 2 column 3). For both of the poo-poo runs, the number of threads in the “Timed waiting” state hit 200.

Second, the average memory usage (chart 2 row 1 column 2) for the database dashboard tells a similar story, peaking at around the same time our number of timed waiting threads hit 200.

We know that Java Spring defaults to Tomcat as its web server. Out of the box, Tomcat limits the maximum number of Worker Threads to 200. Each request gets its own worker thread. Worker threads are a wrapper for the OS threads. We’re hitting that limit around the same time everything breaks.

Hypothesis #1: We can solve this problem by increasing Tomcat’s thread limit.

If I had to be perfectly honest, increasing the number of worker threads smells a little funky. Kinda feels like slapping a bandaid on a gash that needs stitches. Or, it might be closer to slapping a band aid on a radiation burn. More operating system threads means having to worry about the cost of thread context switches, which means higher cpu utilization, which introduces latency, which as we’ve discussed before makes puppies cry². You wouldn’t want to make puppies cry, would you?

There is a more modern take on thread pool exhaustion. Virtual threads are a Java 21+ feature. Here, read the friendly manual to learn more.

Hypothesis #2: We can solve this problem by introducing virtual threads.

Another reason may be the nature of the work itself. What if our threads are presently caught up in some sort of I/O? Say a request comes in. One of our threads might be like “i got this bro”. That thread then goes to the connection pool for postgres and opens a connection. Nine more threads do just this, exhausting our default connection pool of 10.

Requests are still coming in, though. Any thread without access to a connection to postgres goes into the TIMED_WAITING state. So even if we increase the number of threads or switch to virtual threads, this would not help as much as, say, increasing the size of the connection pool. I could be wrong though, let’s throw some scientific method at it and see what happens!

Hypothesis #3: We can solve this problem by increasing the HikariCP connection pool size.

Now, say none of these work. In which case, the problem is most likely CPU throttling. Postgres is trying to commit a thousand records per second on two cpu cores, the poor little guy. I’d prefer to delay throwing more compute at the problem for as long as possible. But if we simply can’t handle the load, then we’ll brute force our way through and leave more time consuming optimizations for later.

Hypothesis #4: We can solve this problem by increasing the number of cpus (cores).

Pause here and take a crack at guessing what happens when we:

(1) increase the maximum number of tomcat worker threads 200->400
(2) switch to virtual threads
(3) increase the hikari cp connection pool size 10->50
(4) increase the number of cores 2->12

Did you take a guess? Please take a guess. Please. Come on, man. Think of the kids.

Hypothesis #1 result:

it didn’t work! Oddly enough, we ended up using the same number of threads. I did a little digging and it turns out the threads state metric captures tomcat workers + JVM internal threads. In other words, we either weren’t reaching our limit on the Tomcat side at all, Tomcat decided it didn’t need to spawn more worker threads, or something else. I humbly regret to inform you that I'm leaning towards something else. It seems figuring out exactly why this happens requires further observability/instrumentation (with my luck, it will turn out to be something quite obvious) and knowing the cause (probably) wouldn’t get us a boost in performance, so I'm moving on!

Hypothesis #2 result:

Well, we fixed the thread pool exhaustion issue. But no cigar. We’re still not hitting those 60k writes. Interestingly enough, our p99 latencies are looking pretty scrumptious compared to using the non-virtual threads. Seems virtual threads may be part of a late game meta?

Hypothesis #3 result: Great news everyone! It didn’t work!

I think this may have actually been our worst performance. The mechanics of why that happened is actually pretty interesting, albeit beyond the scope of this article. But here, if you want to learn more, click me! Or me! Or even me!

TLDR; You should set the number of connections within a pool to

(cpu cores * 2) + effective spindle count

If you make it any bigger, then you’re increasing the thread count to your detriment. In other words, context switches are gonna get cha’. Sometimes, less is more!

Hypothesis 4 result: Okay, this one actually worked!

It seems like 1k writes per second is too much for Minikube’s default 2 cpu cores after all. Whooda thunk?

And as a final sanity check

Perfect!

Okay, let’s do 2.5k now. We’ll keep 12 cores and up the memory to 12GiB. We’ll also stick to the default thread settings for now. I want to see how far that’ll take us.

2500 X per second

Requests

Utter failure.

Notice anything different about these charts than our initial results for 1k writes per second? First, we're capping out a little over 2.2-ish thousand requests per second; that is, we never actually hit our goal. Second, our thread states are spiky. During writes, we’d plateau at our limit of 200 threads in the timed-waiting state. Third, our latency distributions are suspiciously consistent. Not really something you’d expect from a system crumbling under an unbearably girthy throughput.

Another thing you might notice from our thread states is the new spike of threads in the “Runnable” state.

I give up.

I don’t think we’re going to hit 2500 RPS today, but I want to leave things off with a few observations and next steps.

There are a couple of questions whose answers would greatly help going into part 2. I ran 3 more tests for each category and here are the numbers (worst case):

Requests

What is our p(99) at a manageable load (1000 RPS)?

-> 15.26ms

What is the max latency at a manageable load (1000 RPS)?

-> 191.84ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 2083 requests per second

Reads

What is our p(99) at a manageable load (1000 Reads per second)?

-> 10.58ms

What is the max latency at a manageable load (1000 Reads per second)?

-> 27ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 1818 reads per second

Writes

What is our p(99) at a manageable load (1000 writes per second)?

-> 10.25ms

What is the max latency at a manageable load (1000 writes per second)?

-> 209ms

At what throughput does the request queue start growing (i.e, the arrival rate > queue service rate)?

-> 2041 writes per second

On growing queues

We know the request queue is growing by the number of virtual users (VUs) that dynamically spin up during our load tests. Virtual users are an abstraction provided by Grafana k6 that simulates requests from real users. I used Little’s law to set the initial number of VUs. Grafana K6 dynamically increases the number of VUs up to some max number (which I set to 500). My guess is they use Little’s law or something similar. In other words, more latency -> more VUs dynamically allocated. Under the hood, virtual users are goroutines. Goroutines are similar to Java’s virtual threads in their purpose – concurrency through abstraction. The cost of too many goroutines (and therefore VUs) is memory. All this to say I’m monitoring the logs from K6 for spikes in the number of dynamically allocated VUs as a sign/symptom that the number of requests are greater than what the Java application can handle. For example, within a matter of 5 seconds, the number of VUs shoots from 3 to our maximum of 500. Across three runs, the experiment below began to queue around the 1950 iterations per second mark.

Moving forward

I think I’ve run out of low hanging fruit. I’m planning on making the following changes for part two:

Increase to two nodes. I’ll run pg_bench on a containerized postgres instance for both of my machines. Whichever SSD of the two performs better in terms of Transactions per second (TPS) will host postgres.
Switch off minikube to a lighter distribution. Namely, k3s, which is made for IoT, edge devices, and sad, penurious homelabs :(
Come in with some application level data on where the bottlenecks are at runtime. I’ll be using asyncprofiler.
Generally drink more water and get more sleep
Go back to virtual threads.

Until next time!

At runtime, the JVM’s JIT Compiler turns bytecode into “optimized” native machine code. Optimized processes are good because they utilize less CPU cycles to perform certain instructions. Meaning, we have more cpu cycles left over to perform more work. However, If we bombard the REST API with an onslaught of requests before this optimization takes place then the cpu utilization skyrockets, leading to CPU throttling -> higher latency -> unhappy users -> and crying baby puppies. ↩
See footnote one right above me. What, you don't read footnotes? Do you think you're better than me or something? ↩

DEV Community

How much torment can my little homelab take? Part 1.

My setup ain’t much. I have a laptop running Arch and a desktop running Debian. I'm worth a grand total of 32 gigs of ram, 24 CPU cores, and 6 feet of a cat8 Ethernet Cable.

1 X per second

1000 X per second

Requests

Next, let’s do some reads.

Writes:

2500 X per second

Requests

I give up.

Requests

Reads

Writes

On growing queues

Moving forward

Top comments (0)