Mayuresh Smita Suresh

Posted on Jan 7

How RAG Applications in Rust Achieve Sub-Second Response Times

#rust #ai #programming #startup

A Technical Deep Dive into Building Production-Grade AI Systems*

When I set out to build Tagnovate, an AI-powered NFC hospitality solution, I faced a challenge that plagues many AI startups: how do you deliver intelligent, context-aware responses fast enough that users don't notice the AI is thinking? The answer, I discovered, lies in an unlikely pairing: Retrieval Augmented Generation (RAG) and Rust.

Most RAG implementations today are built in Python, leveraging frameworks like LangChain or LlamaIndex. These work brilliantly for prototypes and many production systems. But when you're processing 10,000+ daily transactions for enterprise clients like Hilton Garden Inn, every millisecond counts. Here's how we achieved sub-100ms response times by rebuilding our RAG pipeline in Rust.

The Problem with Python-Based RAG

Python's Global Interpreter Lock (GIL) creates a fundamental bottleneck for concurrent AI workloads. When a hotel guest taps an NFC card expecting instant information about room service, pool hours, or local recommendations, they shouldn't wait 2-3 seconds while Python's garbage collector decides it's time for cleanup.

Our initial Python implementation averaged 800-1200ms per query. Acceptable for a demo, unacceptable for production. The breakdown looked something like this:

Component	Latency
Embedding generation	300-400ms
Vector similarity search	200-300ms
LLM inference	300-500ms
Python overhead	100-200ms

That last line—pure Python inefficiency—was the catalyst for change.

Why Rust Changes Everything

Rust offers zero-cost abstractions, no garbage collection pauses, and fearless concurrency. For AI workloads, this translates to predictable, consistent latency. No sudden 50ms spikes because the runtime decided to clean up memory.

The Candle framework from Hugging Face became our foundation. Unlike PyTorch, Candle runs inference without Python overhead. Combined with the ONNX Runtime bindings for Rust, we could execute our Gemma-3 embedding model directly on CPU with SIMD optimisations that Python simply cannot match.

Architecture That Delivers

Our production architecture for Tagnovate and MenuGo follows a three-stage pipeline:

Stage 1: Embedding Generation (15-25ms)

We use a quantised Gemma-3 model running through Candle. The key optimisation here is batching: when multiple queries arrive within a 10ms window, we batch them together. Rust's async runtime (Tokio) makes this trivial to implement without callback hell.

// Simplified batching logic
async fn process_embeddings(queries: Vec<String>) -> Vec<Embedding> {
    let batch = queries.into_iter()
        .map(|q| tokenize(&q))
        .collect::<Vec<_>>();

    model.forward_batch(&batch).await
}

Stage 2: Vector Search (8-15ms)

PostgreSQL with pgvector handles our similarity search. The critical insight: pre-computing IVF indexes during document ingestion pays dividends at query time. We maintain separate indexes for different content types (menus, FAQs, local recommendations), allowing us to scope searches and reduce the candidate set.

-- Scoped similarity search
SELECT content, embedding <=> $1 as distance
FROM documents
WHERE content_type = 'menu'
ORDER BY embedding <=> $1
LIMIT 5;

Stage 3: Response Generation (40-60ms)

Here's where most systems bottleneck. We use a distilled LLaMA model optimised for hospitality queries. The model runs on-device for common queries (cached responses) and falls back to API calls only for novel questions. This hybrid approach means 70% of queries never leave our infrastructure.

Real-World Results

After migrating from Python to Rust, our metrics transformed:

Metric	Python	Rust	Improvement
P50 latency	850ms	65ms	13x faster
P99 latency	2.1s	180ms	11x faster
Memory usage	Baseline	-60%	60% reduction
CPU variance	High	Predictable	Stable

For our hospitality clients, this translates to guests who actually use the AI features. When responses feel instant, engagement increases. When there's a noticeable delay, guests revert to calling the front desk—defeating the purpose of the technology.

The Code: Key Patterns

Here's a simplified version of our embedding service:

use candle_core::{Device, Tensor};
use candle_transformers::models::bert::BertModel;
use tokio::sync::mpsc;

pub struct EmbeddingService {
    model: BertModel,
    device: Device,
    batch_sender: mpsc::Sender<EmbeddingRequest>,
}

impl EmbeddingService {
    pub async fn embed(&self, text: &str) -> Result<Vec<f32>> {
        let (response_tx, response_rx) = oneshot::channel();

        self.batch_sender.send(EmbeddingRequest {
            text: text.to_string(),
            response: response_tx,
        }).await?;

        response_rx.await?
    }

    async fn batch_processor(
        mut receiver: mpsc::Receiver<EmbeddingRequest>,
        model: &BertModel,
    ) {
        let mut batch = Vec::new();
        let mut deadline = Instant::now() + Duration::from_millis(10);

        loop {
            tokio::select! {
                Some(request) = receiver.recv() => {
                    batch.push(request);
                    if batch.len() >= 32 {
                        process_batch(&mut batch, model).await;
                    }
                }
                _ = tokio::time::sleep_until(deadline) => {
                    if !batch.is_empty() {
                        process_batch(&mut batch, model).await;
                    }
                    deadline = Instant::now() + Duration::from_millis(10);
                }
            }
        }
    }
}

Lessons for Other Builders

Should you rewrite your RAG system in Rust? Not necessarily. The Python ecosystem is mature, well-documented, and perfectly adequate for many use cases. But if you're hitting scale where latency matters—where every 100ms of delay costs you users—Rust deserves serious consideration.

The learning curve is real. Rust's ownership model will challenge you. But the payoff is production systems that perform consistently, scale predictably, and don't wake you up at 3am because garbage collection decided to take a holiday.

At Tagnovate, we're now processing over 10,000 daily AI interactions with 99.9% uptime. The technology works. The question is whether you're ready to invest in building it right.

Key Takeaways

Python's GIL is a real bottleneck for concurrent AI workloads at scale
Rust's Candle framework provides PyTorch-like ergonomics without Python overhead
Batching requests dramatically improves throughput for embedding generation
Hybrid architectures (on-device + API fallback) reduce latency for common queries
The investment pays off when reliability and performance are product requirements

Want to see Rust-powered AI in action? Check out Tagnovate for hospitality AI or MenuGo for digital menu solutions.

About the Author

Mayuresh Shitole is the Founder and CTO of AmbiCube Pvt Ltd, where he builds AI-powered solutions for the hospitality industry. Winner of the TigerData Agentic Postgres Challenge, he holds an MSc in Computer Engineering from the University of Essex. Connect on LinkedIn or follow his technical writing on DEV.to.

DEV Community