Jonathan Duberville

Posted on Dec 11, 2025

3 Days to Save the Release: From 3.15s to 233ms

#learning #performance #mongodb #api

Lessons from a successful optimization mission — by a non-expert

Some Context

The API is central to a migration program. Problem: critical latencies are blocking its release to production.

The symptoms: MongoDB CPU at 150%, connections jumping from 400 to 1500, high error rate, requests exceeding 30 seconds.

Here's the approach I followed and what I learned from it.

How to Tackle This Kind of Problem?

Before diving in, I gathered as much information as possible: existing audits, interviews, monitoring dashboards, bug history. I complemented this with a code review to assess complexity, spot anti-patterns and identify quick wins.

🤖 AI Accelerator: The agent scanned the codebase, identified anti-patterns and documented the structure. It helped me build a macro view of the system.

Using the Cynefin framework, I identified that we were in the Complex domain: multiple interconnected factors, impossible to predict the impact of an optimization without trying it.

Hence the approach: Probe → Sense → Respond — experiment, observe, adjust.

An Iterative Strategy: Probe → Sense → Respond

No tooling, no data. No data, no strategy. I used a load testing tool and several monitoring solutions to collect as much information as possible.

Smoke Test

Probe: Each test uses a simple and identical profile: gradual ramp-up to 500 virtual users, 15-minute plateau, then ramp-down.

Sense: Too many 504 errors, unstable MongoDB — unusable results. Root causes: the scenario reused the same identifiers (write conflicts) and the code contained queries without indexes.

Result: Unusable (too many errors to measure)

Respond:

Created indexes on critical fields
Improved test scenario (randomization, realistic distribution)
Created a dedicated test environment

🤖 AI Accelerator: The agent identified the fields to index and generated the test scripts.

Establishing a Baseline

Probe: Load test with the corrected scenario on the dedicated environment.

Sense: First usable results. High latency, but reliable measurements.

Result (baseline): 101 req/s, 2.56% errors, P99 9.9s, average 3.15s

Respond:

Adjusted Kubernetes autoscaling (HPA)
Increased replica count

Testing Horizontal Scaling ⚠️

Probe: New load test.

Sense: Pods saturate before autoscaling kicks in.

Result: 82 req/s, 4.17% errors, P99 12.3s, average 4.09s ⚠️

⚠️ Regression! More replicas = more concurrent MongoDB connections = more contention. We're just moving the problem around.

Respond:

Increased CPU and RAM resources per pod

Testing Vertical Scaling

Probe: New load test.

Sense: Observability reveals Out Of Memory errors and frequent restarts.

Result: 162 req/s, 0.68% errors, P99 5.86s, average 1.58s

✅ Throughput +60%, errors -73%. But latency remains high — the problem is now in the code.

Respond:

Optimized MongoDB queries (removed expensive aggregations)
Refactored critical algorithms
Optimized write parameters

🤖 AI Accelerator: The agent generated regression tests and assisted with code refactoring.

Testing Application Optimizations

Probe: New load test.

Sense: The optimizations are paying off.

Final result ✅: 200 req/s, ~0% errors, P99 0.67s, average 233ms

✅ Throughput doubled compared to baseline, P99 dropped from 9.9s to 0.67s.

Exit — Knowing When to Stop

200 req/s, P99 under one second, near-zero errors: release constraints are met.

In Cynefin terms, this is the transition from the Complex domain to the Complicated domain: we've learned enough, future optimizations become predictable. Without production data, going further would be premature optimization.

Conclusion

This mission shows that a Probe → Sense → Respond approach delivers fast gains on a complex system. AI agents didn't replace expertise — it's still humans who observe and decide — but they accelerated each cycle. The faster you iterate, the faster you learn.

Key Takeaways

Measure before and after each change. Intuition isn't enough, and an optimization can make things worse.
Get your test scenario right. An unrealistic scenario produces misleading results.
Navigate across layers. Infra, code, and database form a system. Being able to read Kubernetes metrics, profile MongoDB, and refactor code in the same day is a decisive advantage.
Get the right tools. Without load testing, observability, and profiling tools, optimization is blind.
Use AI agents to accelerate. They don't replace expertise, but they amplify execution velocity.
Define an exit criterion. SLOs define when it's "good enough". Without them, you risk over-optimizing.

Bonus: a mature Platform Engineering setup remains a prerequisite for these gains to materialize.

DEV Community

3 Days to Save the Release: From 3.15s to 233ms

Some Context

How to Tackle This Kind of Problem?

An Iterative Strategy: Probe → Sense → Respond

Smoke Test

Establishing a Baseline

Testing Horizontal Scaling ⚠️

Testing Vertical Scaling

Testing Application Optimizations

Exit — Knowing When to Stop

Conclusion

Key Takeaways

Top comments (0)