I’m crash testing FiloraFS-Lite under load (p95, pressure, failures)

#springboot #java #performance #systemdesign

I started running crash tests on FiloraFS-Lite to see how it actually behaves under pressure.

Not benchmarks. Not ideal conditions.
Real stress.

The focus is simple:

where it starts breaking
how early signals show up (p95, pressure)
what fails first under sustained load

What I’m testing right now

increasing RPM until the system shows pressure
tracking how latency (p95) degrades
observing write pressure under continuous load

I already have logs from initial runs.

But instead of rushing conclusions, I’m validating signals properly before sharing anything.

No assumptions. Just observed behavior.

Why this matters

Small-scale tests looked fine.

But real systems don’t fail in clean scenarios.

They fail when:

load spikes
resources get constrained
edge cases stack together

That’s the environment I’m trying to simulate.

What I’ll share next

Once analysis is done, I’ll publish:

what broke first
early warning signals
what actually mattered vs noise
what needs to change

If you’ve done similar crash or stress testing:

What signal usually shows up first for you under load?

Top comments (2)

buildbasekit • Apr 27

Ran a 10 min load test before pushing further.

Interesting part: p95 latency started drifting earlier than expected, even though overall system looked stable. Write pressure also kept building quietly in the background.

Now moving into full crash tests to see where it actually breaks and whether these signals consistently show up beforehand.

buildbasekit • Apr 27

Update from crash test runs: clear pattern emerging now.

p95 stays stable (~250–300ms) up to ~1500 RPM, then starts degrading rapidly as load increases. By ~6000+ RPM, latency crosses 1s+ and spikes toward ~1.8s, even though there are still zero errors.

Bottleneck is showing up in disk I/O (upload path), while other APIs remain stable. So the system doesn’t “fail” first, it just slows down heavily under write pressure.

This is interesting because the failure signal is purely latency-driven, not resource exhaustion or errors.

I’ll publish a detailed breakdown soon with full analysis, graphs, and what actually caused the tipping point.