Our API was drowning under 50ms P99 latencies. I rewrote everything in Rust expecting miracles. Got 8ms response times and three months of…
I Fired My Entire Node.js Stack — Rust Rebuilt It in 3 Weeks (The Ugly Truth)
Our API was drowning under 50ms P99 latencies. I rewrote everything in Rust expecting miracles. Got 8ms response times and three months of hell I didn’t budget for.
Rust promised performance and delivered — but the transition cost included rewriting every assumption about how backend systems should work, from memory management to error handling.
Our billing API hit 200k requests per minute during peak hours. Node.js was handling it fine — until it wasn’t.
P99 latencies spiked to 850ms. Event loop blockages cascading through the system. Memory usage climbing 40% week-over-week despite zero traffic growth. I profiled everything. Optimized database queries. Threw more instances at it.
The core problem: garbage collection pauses during high-throughput operations were killing us.
I made the call: migrate to Rust. Three weeks, I told my team. We’ll rewrite the critical path, keep everything else in Node.
That timeline was… optimistic.
Why I Actually Cared (Beyond the Benchmarks)
We were bleeding $12k monthly on extra EC2 instances just to handle GC pauses. Our SLA promised 100ms P99s. We were hitting that maybe 92% of the time. Each breach risked contract penalties.
But the real issue: I couldn’t predict when the next spike would hit. Traffic patterns looked normal. Then suddenly — boom — event loop stalls for 300ms. Users see timeout errors. Support tickets flood in.
Node’s non-deterministic performance made capacity planning impossible. I was over-provisioning by 60% just to have headroom for unexpected GC pauses.
That’s not engineering. That’s gambling with infrastructure costs.
The Misconception That Broke First
I assumed Rust would be “Node.js with better performance.”
It’s not. It’s a completely different mental model.
Node lets you write async code that looks synchronous. Rust makes you explicitly handle every possible error state, every lifetime, every ownership transfer. You can’t just await somePromise() and move on. You have to prove to the compiler that your async future is Send, that the data will live long enough, that concurrent access is safe.
I spent the first week fighting the borrow checker on code that would’ve taken 20 minutes in TypeScript.
The compiler was right every time. That didn’t make it less frustrating.
What Rust Actually Gave Us (The Numbers)
After the full migration:
- P99 latency : 850ms → 8ms (yes, really)
- P50 latency : 45ms → 2ms
- Memory usage : 4GB per instance → 180MB
- Instance count : 32 nodes → 4 nodes
- Monthly compute cost : $12k → $900
The performance gains were absurd. Not 2x. Not 10x. In some endpoints, 100x faster.
But here’s what the benchmarks don’t tell you.
The Development Velocity Cliff
Our team shipped features in Node.js in days. Quick prototype, test in dev, deploy.
Rust development timeline for the same features: 3–4x longer.
Not because Rust is slow to write. Because it forces you to handle edge cases upfront that Node.js lets you ignore until production. Every possible error state needs explicit handling. Every data structure needs defined lifetimes. Every async operation needs careful consideration of Send/Sync bounds.
// Node.js version (works until it doesn't)
async function processPayment(userId, amount) {
const user = await db.getUser(userId);
const result = await stripe.charge(user.cardToken, amount);
await db.updateBalance(userId, result.amount);
return result;
}
// Rust version (verbose but bulletproof)
async fn process_payment(
pool: &PgPool,
stripe: &StripeClient,
user_id: Uuid,
amount: Decimal,
) -> Result<ChargeResult, PaymentError> {
let user = sqlx::query_as::<_, User>(
"SELECT card_token FROM users WHERE id = $1"
)
.bind(user_id)
.fetch_optional(pool)
.await?
.ok_or(PaymentError::UserNotFound)?;
let result = stripe
.charge(&user.card_token, amount)
.await
.map_err(|e| PaymentError::StripeError(e))?;
sqlx::query("UPDATE users SET balance = balance + $1 WHERE id = $2")
.bind(result.amount)
.bind(user_id)
.execute(pool)
.await?;
Ok(result)
}
The Rust version is 3x longer. But it handles: missing users, database failures, Stripe errors, transaction rollbacks. The Node version? It’ll crash on any of those until you hit them in prod.
After each code block, this matters because: Node optimizes for speed of writing code. Rust optimizes for correctness. You pay upfront in dev time to avoid paying in production incidents.
The Moment I Realized We’d Miscalculated
Four weeks in, we had the core API working. Fast as hell. Rock solid. Ready to ship.
Then I looked at our monitoring stack. All JavaScript. The admin dashboard? React + Node backend. The analytics pipeline? Node consumers reading from Kafka. Our internal tools? All TypeScript.
We’d rewritten 20% of the codebase and created a Frankenstein system where Rust services talked to Node services through JSON APIs, losing half the performance gains to serialization overhead.
The real migration timeline wasn’t three weeks. It was six months to rebuild everything that touched the critical path.
I didn’t budget for that.
What Nobody Tells You About Async Rust
The async ecosystem is fragmented. Tokio vs async-std. Different HTTP clients (reqwest, hyper). Different database drivers. Not all libraries support async. Some block the executor.
We chose Tokio because it had the most mature ecosystem. Then discovered our chosen Postgres driver (diesel) didn’t support async well. Switched to sqlx. Had to rewrite every database call.
Found an auth library we liked. Wasn’t Send-safe. Couldn’t use it in async handlers. Built our own.
The Node ecosystem has one event loop, one async model. Rust has… opinions. Many opinions. And you’ll accidentally mix them wrong and spend hours debugging why your futures deadlock.
Token Limits Hit Different in Compiled Languages
Speaking of constraints, deployment in Rust is actually easier than Node. No dependency hell. No node_modules bloat. You ship a single binary. 18MB. That’s it.
But iteration speed? Compile times killed us. Small change in a core module? 90 seconds to recompile everything. In Node, it’s instant.
We set up incremental compilation, cargo workspaces, separated into smaller crates. Got it down to 30 seconds for typical changes. Still 30x slower than Node’s hot reload.
Actually, most people don’t realize this affects how you write code. In Node, I’d experiment freely — try something, see if it works, iterate. In Rust, I’d think harder before compiling because each test cycle cost me a minute.
That cognitive shift changed our development culture. More planning, less “let’s just try it.”
When Rust Actually Saves Money (And When It Doesn’t)
The infrastructure savings are real. We cut compute costs by 92%. But:
Hidden costs:
- Senior Rust developers: 30–40% higher salaries than Node devs
- Training existing team: 3 months to productivity
- Slower feature velocity: -60% for first 6 months
- Tooling gaps: had to build our own admin tools, no equivalent to NestJS or Express ecosystem richness
ROI calculation (12 months):
- Saved: $130k in compute
- Cost: $180k in additional dev time (2 senior Rust hires, training, slower shipping)
- Net first year: -$50k
Year two projections look better. Once the team is trained and core infrastructure stabilized, the compute savings compound while dev costs normalize.
But if you’re a startup iterating rapidly? The velocity hit might kill you before you see ROI.
The Debugging Moment That Changed Everything
Six weeks post-migration, we saw weird latency spikes. Not Node-level bad, but 8ms endpoints suddenly hitting 45ms randomly.
Profiled everything. Database was fine. Network fine. CPU usage low.
Then I checked: memory allocations.
We were using .clone() everywhere because fighting the borrow checker was hard. Each clone copies data. We were cloning entire request payloads, user objects, session data—sometimes 5-6 times per request.
Rust’s performance advantage comes from zero-copy operations. We’d turned it into a copying machine because we didn’t understand ownership patterns.
// What we were doing (bad)
fn process_request(data: RequestData) -> Response {
let validated = validate_data(data.clone());
let enriched = enrich_data(data.clone());
let processed = process_data(data.clone());
build_response(validated, enriched, processed)
}
// What we should've done (good)
fn process_request(data: RequestData) -> Response {
let validated = validate_data(&data);
let enriched = enrich_data(&data);
let processed = process_data(&data);
build_response(validated, enriched, processed)
}
Switching to references instead of clones cut latency by 70%. One character change (&) per function.
This matters because: Rust gives you the tools for zero-copy performance. But the default instinct from garbage-collected languages is to copy everything. You have to unlearn that reflex.
The One Gotcha That Cost Four Hours
Error handling in Rust uses Result<T, E>. Seems simple. Then you try to return errors from different libraries.
Database returns sqlx::Error. HTTP client returns reqwest::Error. Your business logic returns custom errors. You can't just return err? because they're different types.
Solutions:
- Box all errors:
Result<T, Box<dyn std::error::Error>>(slow, type-erased) - Use
thiserrorcrate to define error enum (better, more boilerplate) - Use
anyhowfor quick prototyping (loses type safety)
I chose approach 1 initially. Didn’t realize boxing errors allocates heap memory for every error. Under load, error paths became slow paths.
Refactored to approach 2. Defined proper error enums. Error handling got fast again.
Nobody warns you: Rust’s type system is so strict that even error handling has performance implications.
What The Benchmarks Hide
Every Rust vs Node comparison shows throughput graphs. Requests per second. Latency percentiles.
Nobody shows:
- Time to implement auth middleware: Node (2 hours) vs Rust (2 days)
- Time to add a new endpoint: Node (30 min) vs Rust (3 hours)
- Time to debug a production issue: Node (logs + REPL) vs Rust (recompile with debug symbols, attach debugger)
- Time to onboard a new developer: Node (1 week) vs Rust (2 months)
The raw performance wins are real. The productivity costs are also real.
If your bottleneck is compute, Rust wins. If your bottleneck is engineering time, Node might still win.
When I Wasn’t Sure Until…
Three months post-migration, during a traffic surge (Black Friday equivalent), I watched our monitoring.
Old Node stack would’ve needed 60+ instances, cost us $800 that day in extra capacity, probably still hit some SLA breaches.
Rust stack: 4 instances. CPU never exceeded 40%. Latencies stayed flat at 8ms. Cost: $75.
That’s when I knew the pain was worth it.
Not because Rust is always better. Because for our specific problem — high-throughput API serving under unpredictable load — the performance characteristics were exactly what we needed.
The Use Cases Where This Makes Sense
Migrate to Rust when:
- Compute costs exceed developer costs
- You have predictable, stable product requirements
- Performance directly impacts business metrics (SLAs, user retention)
- Your team can absorb 3–6 months of reduced velocity
- You’re hitting physical limits of Node’s event loop
Stay in Node when:
- You’re pre-product-market fit and need to iterate fast
- Your bottleneck is database or network, not CPU
- Your team is small (<5 engineers)
- You’re mostly doing CRUD operations
- Compute costs are <$5k/month
The decision isn’t technical. It’s economic.
Try This Today
Profile your Node app’s event loop under realistic load. Not synthetic benchmarks — actual production traffic patterns.
const { performance } = require('perf_hooks');
setInterval(() => {
const start = performance.now();
setImmediate(() => {
const lag = performance.now() - start;
if (lag > 10) console.warn(`Event loop lag: ${lag}ms`);
});
}, 1000);
Run that for a week. If you’re seeing consistent lag >50ms during normal operation, you might have a GC problem. If lag is <10ms, your bottleneck is elsewhere — probably database or external APIs.
Don’t migrate to Rust because it’s trendy. Migrate because you’ve measured that your specific bottleneck is CPU-bound async operations with GC overhead.
Most apps don’t need Rust. Ours did. The migration delivered exactly what we needed: predictable, low-latency performance at 1/10th the infrastructure cost.
But we paid for it in development time, team training, ecosystem limitations, and six months of slower feature delivery.
That’s the ugly truth about Rust migrations. The performance gains are real. The costs are also real. Run your own numbers before making the call.
And if you do migrate? Budget triple the time you think it’ll take. You’ll need it.
Enjoyed the read? Let’s stay connected!
- 🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
- 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
- ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.
Your support means the world and helps me create more content you’ll love. ❤️
Top comments (0)