SSR on Rust: From Experiments to 95,000 RPS

#webdev #programming #devops #rust

Introduction
It all started one evening... I was tinkering with rewriting the front-end of a marketplace from React to Preact, using Brotli compression and native CSS, just to test out some extreme optimizations. In my quest for maximum performance and speed, I decided to experiment with porting the back-end to Rust, including compressing the database into Redis—but that's a story for another time. Anyway, these experiments led me to the idea of building an SSR engine on Rust, and benchmarks showed me hitting 95,000+ RPS on an M4 chip. That's pretty decent in itself; I'll dive into the details below.
Architecture of Rusty-SSR
Rust gives you more flexibility in managing threads and memory. At the core of Rusty-SSR is a pool of V8 isolates, thread pinning to cores, and a multi-tier caching system.

V8 Isolates Pool for Multithreading Instead of separate OS processes, we use lightweight V8 isolates within a single Rust process, one per thread.

let pool = V8Pool::new(V8PoolConfig {
    num_threads: num_cpus::get(), // Use all cores
    queue_capacity: 512,          // Queue for backpressure
    ..Default::default()
});

This avoids blocking: if one isolate is busy, others keep handling requests.

Thread Pinning to Cores Context switches can kill performance. To minimize them, each thread is pinned to a specific CPU core.

if let Some(core_id) = cores.get(idx) {
    if core_affinity::set_for_current(*core_id) {
        tracing::debug!("Worker {} pinned to core {:?}", id, core_id.id);
    }
}

This keeps the processor cache (L1/L2) hot. In the cloud, results may vary, so profiling is recommended.

Multi-Tier Caching Caching reduces rendering needs. Instead of a simple HashMap with locks, it's a two-level setup:

Hot Cache (L1): Thread-local for instant access without synchronization.
Cold Cache (L2): DashMap for shared access across threads.

Cache size is set in elements (pages), TTL in seconds (e.g., cache_ttl_secs(300)). Metrics are available via engine.cache_metrics() (hit-rate, hot/cold hits, etc.).
Data Prefetching
For speedup, SSE instructions preload data into the CPU cache—like warming up your coffee in advance so you don't wait.

fn prefetch_data(data: &str) {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    unsafe {
        use core::arch::x86_64::{_mm_prefetch, _MM_HINT_T0};
        _mm_prefetch(data.as_ptr() as *const i8, _MM_HINT_T0);
    }
}

Internal Structure of Hot Cache
The Hot Cache is split into an ultra-hot array (8 elements for super-fast access) and a HashMap (128 elements). Entries are promoted via LRU.
Rust#[repr(align(64))] // Align to cache line
pub struct HotCache {
ultra_hot: [Option; 8],
hot_map: HashMap,
// ...
}
Zero-Copy with Arc
HTML is stored as Arc to avoid copying between threads.
Rustlet html: Arc = Arc::from(rendered_html.as_str());
cache.insert(url, Arc::clone(&html)); // Only the Arc is cloned
This saves memory for large pages.
DashMap Optimization
The Cold Cache uses DashMap with 128 shards to reduce contention in multithreading. Testing showed a +19% throughput boost over the default 16 shards. Here's a breakdown of the results:

16 shards (default): 51M elem/s (baseline)
32 shards: 57M elem/s (+12%)
64 shards: 59M elem/s (+16%)
128 shards: 60.6M elem/s (+19%)
256 shards: 60.3M elem/s (+18%)

Reliability and Production Readiness

Queue with timeout (request_timeout) prevents deadlocks.
Error handling for bundle loading.
Full cache clearing, including thread-local.

Benchmarks
Tests on Apple M4 (10 cores) using wrk --latency -t10 -c400/1000 -d30s on loopback, demo HTML from the repo, warmed cache. Key metrics:

Throughput: 95,363 req/s (High throughput)
Latency p50: 0.46 ms (Median latency)
Latency p99: 4.60 ms (Under load)

I'm currently using this setup for my portfolio at https://portfolio-production-b677.up.railway.app/. It's still rough around the edges and mostly desktop-oriented, but it serves as a benchmark too—with complex content like animations and Three.js, yet loading is lightning-fast. The portfolio runs on the cheapest Redis plan.
In real-world scenarios, performance depends on network, databases, and browsers. But even modest improvements can cut infrastructure costs, which is good for the environment at least :)
Conclusion
Rust provides tools for building efficient web servers. This is my experience, which might be useful to others. The code is open under MIT. If you try it out, share your thoughts in the comments—I'd love to hear feedback.
Links

GitHub Repository https://github.com/babasha/Rusty-SSR

DEV Community

SSR on Rust: From Experiments to 95,000 RPS

Top comments (0)