DEV Community

Cover image for Building a RAG System in Rust with Qdrant, Rig, and gRPC πŸ¦€
Parikalp Bhardwaj
Parikalp Bhardwaj

Posted on

Building a RAG System in Rust with Qdrant, Rig, and gRPC πŸ¦€

With Qdrant, Rig, Tonic β€” and a healthy obsession with what's actually happening underneath.

Why I Built This

A few weeks ago I came across Our First Production-Ready RAG Dev Journey in Pure Rust by the rust-dd team β€” and something clicked.

Reading it, I realized two things:

  1. Rust is the right language for building AI systems β€” fast, safe, and built for the kind of infrastructure work AI actually needs.
  2. I'd been wanting to build something like this myself for months and kept finding excuses not to start.

So I started. This post is what came out of it. πŸ¦€


Most RAG tutorials hand you a framework, four function calls, and a working demo. You upload some documents, vectors get generated somewhere, an LLM answers your questions, and you walk away with a chatbot but no real understanding of what just happened.

This post takes the opposite approach. We'll build a small but complete RAG system in Rust using Qdrant for vector search, Rig as the AI framework, and Tonic for gRPC API, and a terminal chat client β€” and explain why every piece exists. By the end you'll understand:

  • 🧠 Why embeddings work and how semantic retrieval actually functions
  • πŸ—„οΈ Why vector databases exist and what problem ANN indexing solves
  • ⚑ Why async runtimes matter for retrieval pipelines
  • πŸ¦€ Why Rust is becoming genuinely interesting for AI infrastructure
  • πŸ”Œ How gRPC fits into modern AI service architectures

The full source is at github.com/Parikalp-Bhardwaj/qrag-rust. Clone it and read along.


πŸ€” Why RAG exists

LLMs are powerful, but they don't actually know anything about your data. They generate from patterns learned during training, which means:

  • 🚫 Hallucinations when asked about specifics they never saw
  • πŸ“… Outdated information β€” anything after the cutoff is invisible
  • πŸ”’ No access to private knowledge β€” your docs, your codebase, your wiki
  • πŸ“ Limited context windows β€” you can't paste everything in

A model trained in 2023 can't tell you what your internal API does or what changed in last week's release. That's not a model problem β€” it's a retrieval problem.

RAG fixes it by inserting a retrieval step before generation:

query
  β†’ retrieve relevant context from your data
    β†’ inject context into the prompt
      β†’ generate a grounded response
Enter fullscreen mode Exit fullscreen mode

This shifts the engineering focus. The LLM becomes one component among several, and the quality of the system depends on chunking strategy, embedding quality, retrieval accuracy, indexing, and latency. That's the territory most tutorials skip. We're going to live there.


🧩 The building blocks

Four pieces do most of the work in this project.

πŸ”Ž Qdrant β€” the vector database

In a normal database you search by exact values:

SELECT * FROM documents WHERE title = 'Rust';
Enter fullscreen mode Exit fullscreen mode

That works for keywords. It doesn't work for meaning. Consider:

Query: "How does Rust prevent race conditions?"

Relevant doc: "Rust provides memory safety and fearless concurrency through ownership and borrowing."

Zero word overlap. A SQL LIKE won't find this. But a human reading both knows they're talking about the same thing.

Qdrant stores documents as vectors β€” numerical fingerprints of meaning. Two semantically similar texts produce two vectors that sit close together in high-dimensional space. Search becomes "find the vectors nearest to my query vector." A Qdrant point looks like:

Point {
  id: "doc-1",
  vector: [0.12, -0.44, 0.91, ...],   // 1536 dimensions
  payload: {
    "file_path": "rust-notes.md",
    "chunk_index": 3,
    "text": "Rust ownership prevents memory bugs..."
  }
}
Enter fullscreen mode Exit fullscreen mode

Why a dedicated database? Naively, retrieval means comparing the query vector against every stored vector β€” O(n) per query. Fine for 100 documents, a disaster for 10 million. Qdrant uses Approximate Nearest Neighbor (ANN) indexing (HNSW under the hood) to find near matches without scanning everything. You trade a sliver of accuracy for a dramatic speedup. That's the entire reason vector databases exist as a category.

πŸ¦€ Rig β€” the AI application layer

Rig is a Rust framework for LLM apps. It handles the boring, provider-specific parts: embedding APIs, completion APIs, vector store glue, agent construction. Without Rig we'd be hand-writing HTTP clients for OpenRouter.

⚠️ Honest caveat: Rig reduces boilerplate, but it doesn't think for you. Chunking strategy, retrieval ranking, prompt construction β€” those are still your problem and they're what actually determine answer quality.

πŸš€ gRPC + Tonic β€” the service layer

gRPC is a high-performance RPC framework. Instead of POST /chat with JSON, you define services in Protocol Buffers and get strongly-typed clients and servers in any supported language. Tonic is the Rust implementation.

Why gRPC over REST? Two reasons:

  1. Typed contracts. The .proto file is the single source of truth. Client and server can't drift apart.
  2. Backend-to-backend fit. Real AI infrastructure looks like frontend β†’ API gateway β†’ RAG service β†’ vector DB β†’ LLM provider. gRPC is built for the internal hops.

⚑ Tokio β€” the async runtime

Every network call here is async β€” Qdrant queries, OpenRouter embeddings, LLM completions. Tokio lets thousands of them run concurrently on a small thread pool without us managing threads by hand. It's the quiet substrate underneath everything else.


πŸ—οΈ Architecture at a glance

The system has two phases that share one RagEngine.

Indexing (runs once, or whenever docs change):

./docs β†’ load β†’ chunk β†’ embed (OpenRouter) β†’ store (Qdrant)
Enter fullscreen mode Exit fullscreen mode

Query (runs every time someone asks):

question β†’ embed β†’ search Qdrant β†’ top-k chunks β†’ prompt β†’ LLM β†’ answer
Enter fullscreen mode Exit fullscreen mode

Same embedding model on both sides β€” that's what makes the geometry work. You can't embed documents with one model and queries with another and expect distances to mean anything.


πŸ“‚ Project layout

qrag-rust/
β”œβ”€β”€ Cargo.toml               # crate metadata + dependencies
β”œβ”€β”€ build.rs                 # compiles .proto β†’ Rust at build time
β”œβ”€β”€ docker-compose.yaml      # Qdrant container
β”œβ”€β”€ .env.example             # template for secrets
β”œβ”€β”€ proto/
β”‚   └── rag.proto            # gRPC service definition
β”œβ”€β”€ docs/                    # your knowledge base lives here
β”‚   β”œβ”€β”€ grpc.md
β”‚   β”œβ”€β”€ rust.md
β”‚   └── tokio.md
└── src/
    β”œβ”€β”€ main.rs              # boots the gRPC server
    β”œβ”€β”€ config.rs            # env-var configuration
    β”œβ”€β”€ document_loader.rs   # read .md, .txt, .pdf
    β”œβ”€β”€ chunker.rs           # split into ~120-word chunks
    β”œβ”€β”€ qdrant_store.rs      # embeddings + vector storage
    β”œβ”€β”€ llm.rs               # prompt + completion
    β”œβ”€β”€ rag.rs               # orchestration
    β”œβ”€β”€ grpc_service.rs      # tonic handlers
    └── bin/
        └── chat.rs          # terminal chat client
Enter fullscreen mode Exit fullscreen mode

Each file has one job. That's deliberate β€” RAG systems get complicated quickly, and clean seams are the only defense.


πŸ› οΈ Prerequisites

Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustc --version
Enter fullscreen mode Exit fullscreen mode

System packages

A surprising number of Rust crates compile native C/C++ underneath. On Ubuntu:

sudo apt update
sudo apt install -y \
    build-essential pkg-config libssl-dev \
    clang cmake protobuf-compiler poppler-utils
Enter fullscreen mode Exit fullscreen mode
Package Why you need it
build-essential GCC/G++ for native compilation
pkg-config Locates system libraries
libssl-dev OpenSSL headers β€” any HTTPS crate needs them
clang LLVM toolchain (bindgen, ML runtime crates)
cmake Used by several native ML libraries
protobuf-compiler The protoc binary that turns .proto into Rust
poppler-utils Provides pdftotext β€” we shell out to it for PDFs

Verify protoc:

protoc --version
Enter fullscreen mode Exit fullscreen mode

Docker + Qdrant

sudo apt install -y docker.io docker-compose-plugin
sudo systemctl enable --now docker
docker compose up -d   # uses the docker-compose.yaml in the repo
Enter fullscreen mode Exit fullscreen mode
Port What it serves
6333 REST API + dashboard at http://localhost:6333/dashboard
6334 gRPC API β€” what our Rust client talks to

OpenRouter key

We use OpenRouter so the same key serves both the embedding model and the completion model. Create a key, then:


πŸ” Setting up .env and Qdrant

Before you can run anything, two things need to be in place: a .env file with your API key, and a running Qdrant container.

1. Create the .env file

In the project root, create a file named .env (note the leading dot β€” it's a hidden file):

# Required β€” your OpenRouter API key
OPENROUTER_API_KEY=sk-or-v1-paste-your-key-here

# Optional β€” defaults shown
SERVER_ADDR=127.0.0.1
PORT=50051
QDRANT_URL=http://127.0.0.1:6334
QDRANT_COLLECTION=question
Enter fullscreen mode Exit fullscreen mode

Get a key at openrouter.ai/keys. Free tier is enough to test.

What each variable does:

Variable Default What it controls
OPENROUTER_API_KEY (required) Authenticates both embedding and LLM calls
SERVER_ADDR 127.0.0.1 Host the gRPC server binds to (use 0.0.0.0 for Docker)
PORT 50051 gRPC port your clients connect to
QDRANT_URL http://127.0.0.1:6334 Where Qdrant's gRPC endpoint lives
QDRANT_COLLECTION question Name of the collection that stores your vectors

⚠️ Never commit .env to git. Add it to .gitignore:

echo ".env" >> .gitignore
Enter fullscreen mode Exit fullscreen mode

If you've already committed it once, rotate the key β€” git history doesn't forget. Ship a .env.example with empty values instead so collaborators know what to set.

2. Start Qdrant with Docker

The repo includes a docker-compose.yaml:

services:
  qdrant:
    image: qdrant/qdrant:latest
    container_name: coderag-qdrant
    ports:
      - "6333:6333"   # REST API + dashboard
      - "6334:6334"   # gRPC API (this is what the Rust app uses)
    volumes:
      - qdrant_data:/qdrant/storage

volumes:
  qdrant_data:
Enter fullscreen mode Exit fullscreen mode

Start it:

docker compose up -d
Enter fullscreen mode Exit fullscreen mode

The -d flag runs it in the background. Verify it's up:

docker ps
# you should see coderag-qdrant in the list

curl http://localhost:6333/healthz
# should return: healthz check passed
Enter fullscreen mode Exit fullscreen mode

Open the dashboard in a browser:


πŸ“– Walking through every file

This is the part most tutorials skip. We're going file by file, top to bottom, explaining what each one does and why it's structured the way it is.

Cargo.toml β€” the manifest

[package]
name = "qrag-rust"
version = "0.1.0"
edition = "2024"

[[bin]]
name = "qrag-rust"
path = "src/main.rs"

[[bin]]
name = "chat"
path = "src/bin/chat.rs"

[dependencies]
anyhow = "1"
rig = "0.37.0"
tokio = { version = "1", features = ["full"] }
tonic = "0.12"
prost = "0.13"
rig-qdrant = "0.2"
qdrant-client = "1"
dotenvy = "0.15"
uuid = { version = "1", features = ["v4"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tracing = "0.1"
tracing-subscriber = "0.3"

[build-dependencies]
tonic-build = "0.12"
Enter fullscreen mode Exit fullscreen mode

The two [[bin]] entries are the important part. By default Cargo finds src/main.rs automatically, but the moment you add a binary under src/bin/, Cargo stops auto-discovering and you have to declare both explicitly. One is the server, one is the chat client. Both end up in target/release/ after cargo build --release.

tonic-build is a build-dependency, not a runtime dependency. It only runs at compile time to turn .proto files into Rust code.

build.rs β€” code generation at compile time

fn main() -> Result<(), Box<dyn std::error::Error>> {
    tonic_build::compile_protos("proto/rag.proto")?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Three lines, big effect. Every time you run cargo build, this script runs first. It reads proto/rag.proto, generates Rust structs and traits for every message and service, and writes them to Cargo's OUT_DIR. Later, tonic::include_proto!("rag") in main.rs pulls that generated code into our crate.

This is why you never check generated proto code into git β€” it's regenerated on every build.

proto/rag.proto β€” the service contract

syntax = "proto3";

package rag;

service RagService {
  rpc AskQuestion(AskQuestionRequest) returns (AskQuestionResponse);
  rpc Reindex(ReindexRequest) returns (ReindexResponse);
}

message AskQuestionRequest { string question = 1; }
message AskQuestionResponse {
  string answer = 1;
  repeated Source sources = 2;
}
message Source {
  string file_path = 1;
  uint32 chunks_indexed = 2;
  string preview = 3;
  float score = 4;
}
message ReindexRequest {}
message ReindexResponse {
  uint64 chunks_indexed = 1;
  string message = 2;
}
Enter fullscreen mode Exit fullscreen mode

This file is the single source of truth for the API. Two RPCs: Reindex rebuilds the vector index from ./docs, and AskQuestion is the actual query endpoint. The response carries the answer plus the source chunks that produced it β€” that's the difference between a RAG system and a black box. Always return your sources so users can verify the grounding.

The numbers (= 1, = 2) are field tags. They're how protobuf identifies fields on the wire. Once assigned, never change them β€” that's an instant breaking change for every existing client.

src/config.rs β€” environment-driven config

use std::env;

#[derive(Debug, Clone)]
pub struct Config {
    pub addr: String,
    pub port: u16,
    pub qdrant_url: String,
    pub qdrant_collection: String,
}

impl Config {
    pub fn from_env() -> Self {
        Self {
            addr: env::var("SERVER_ADDR")
                .unwrap_or_else(|_| "127.0.0.1".to_string()),
            port: env::var("PORT")
                .unwrap_or_else(|_| "50051".to_string())
                .parse::<u16>()
                .expect("PORT must be a valid number"),
            qdrant_url: env::var("QDRANT_URL")
                .unwrap_or_else(|_| "http://127.0.0.1:6334".to_string()),
            qdrant_collection: env::var("QDRANT_COLLECTION")
                .unwrap_or_else(|_| "question".to_string()),
        }
    }

    pub fn server_addr(&self) -> String {
        format!("{}:{}", self.addr, self.port)
    }
}
Enter fullscreen mode Exit fullscreen mode

Plain config struct, populated from environment variables with sensible defaults. Each setting is documented by its name and its default. SERVER_ADDR and PORT configure where the gRPC server binds. QDRANT_URL points to the Qdrant gRPC port (note: 6334, not the REST 6333). QDRANT_COLLECTION names the collection.

Defaults work out of the box for local dev. Production overrides come from .env or the deployment environment.

src/document_loader.rs β€” reading files from disk

use anyhow::{Context, Result};
use std::fs;
use std::path::{Path, PathBuf};
use std::process::Command;

#[derive(Debug, Clone)]
pub struct Document {
    pub file_path: String,
    pub content: String,
}

pub fn loader_documents_from_dir<P: AsRef<Path>>(dir: P) -> Result<Vec<Document>> {
    let mut documents = Vec::new();
    let entries = fs::read_dir(&dir)
        .with_context(|| format!("Failed to read directory: {:?}", dir.as_ref()))?;

    for entry in entries {
        let entry = entry?;
        let path: PathBuf = entry.path();
        if !path.is_file() { continue; }
        let Some(extension) = path.extension() else { continue; };
        let extension = extension.to_string_lossy().to_lowercase();

        let content = match extension.as_str() {
            "md" | "text" => fs::read_to_string(&path)
                .with_context(|| format!("Failed to read text file: {:?}", path))?,
            "pdf" => extract_pdf_text(&path)
                .with_context(|| format!("Failed to extract text from PDF: {:?}", path))?,
            _ => continue,
        };

        documents.push(Document {
            file_path: path.to_string_lossy().to_string(),
            content,
        });
    }
    Ok(documents)
}

fn extract_pdf_text(path: &Path) -> Result<String> {
    let output = Command::new("pdftotext")
        .arg(path).arg("-")
        .output()
        .with_context(|| format!("Failed to run pdftotext for {:?}", path))?;

    if !output.status.success() {
        anyhow::bail!("pdftotext failed for {:?}: {}",
            path, String::from_utf8_lossy(&output.stderr));
    }
    Ok(String::from_utf8_lossy(&output.stdout).to_string())
}
Enter fullscreen mode Exit fullscreen mode

This is the on-ramp. The loader walks ./docs, skips directories and unknown extensions, reads each supported file, and returns a Vec<Document> of (path, content) pairs.

Two design choices worth flagging. <P: AsRef<Path>> is a generic over anything that can become a Path β€” &str, String, PathBuf, &Path. Callers can pass whatever they have without converting. PDF extraction shells out to pdftotext (from poppler-utils) rather than pulling a heavyweight PDF crate. Not glamorous, but it's been battle-tested for years and avoids a 500-line dependency.

src/chunker.rs β€” splitting documents into pieces

use crate::document_loader::Document;

#[derive(Debug, Clone)]
pub struct DocumentChunk {
    pub id: String,
    pub file_path: String,
    pub chunk_index: usize,
    pub text: String,
}

pub fn chunk_retrieve(documents: Vec<Document>, max_word: usize) -> Vec<DocumentChunk> {
    let mut chunks = Vec::new();
    for docs in documents {
        let words: Vec<&str> = docs.content.split_whitespace().collect();
        for (chunk_idx, word_chunk) in words.chunks(max_word).enumerate() {
            let text = word_chunk.join(" ");
            if text.trim().is_empty() { continue; }
            chunks.push(DocumentChunk {
                id: uuid::Uuid::new_v4().to_string(),
                file_path: docs.file_path.clone(),
                chunk_index: chunk_idx,
                text,
            });
        }
    }
    chunks
}
Enter fullscreen mode Exit fullscreen mode

The single most underrated decision in RAG. Too large and retrieval becomes vague (the chunk has the answer plus a lot of noise); too small and meaning fragments across boundaries. We use a simple word-count chunker at 120 words per chunk β€” a sweet spot for technical docs.

Each chunk gets a fresh UUID (Qdrant uses it as the point ID), the source file path, and a chunk_index so we can show users where the answer came from. split_whitespace() is a small but important choice β€” it handles tabs, newlines, and runs of spaces correctly, where split(' ') would not.

Production systems often do smarter things: sentence-boundary chunking, semantic chunking (split where meaning shifts), or sliding-window chunks with overlap so context isn't lost at boundaries. All worth experimenting with once your baseline works.

src/qdrant_store.rs β€” the heart of retrieval

This is the longest file in the project. It does three things: create or reset the Qdrant collection, embed chunks and upsert them, and search by question. Let's break it into pieces.

The struct and embedding model:

const EMBEDDING_MODEL: &str = "openai/text-embedding-3-small";

#[derive(Clone)]
pub struct QdrantStore {
    client: Qdrant,
    openrouter_client: OpenRouterAiClient,
    collection_name: String,
}

#[derive(Embed, Clone)]
struct ChunkEmbedding {
    id: String,
    file_path: String,
    chunk_index: usize,
    #[embed]
    text: String,
}
Enter fullscreen mode Exit fullscreen mode

QdrantStore owns two clients: one to Qdrant, one to OpenRouter for embeddings. ChunkEmbedding is a Rig pattern β€” the #[derive(Embed)] macro plus the #[embed] field attribute tells Rig "embed the text field; everything else is metadata that rides along."

Constructing the store:

pub fn new(qdrant_url: &str, collection_name: &str) -> Result<Self, anyhow::Error> {
    dotenvy::dotenv().ok();

    let client = Qdrant::from_url(qdrant_url)
        .build()
        .context("Failed to create Qdrant client")?;

    let openrouter_client = OpenRouterAiClient::from_env()?;

    Ok(Self {
        client,
        openrouter_client,
        collection_name: collection_name.to_string(),
    })
}
Enter fullscreen mode Exit fullscreen mode

The constructor wires up both clients in one place. Qdrant::from_url(...) builds the gRPC client pointed at http://127.0.0.1:6334 (or wherever QDRANT_URL says). OpenRouterAiClient::from_env() reads OPENROUTER_API_KEY straight from the environment β€” that's why dotenvy::dotenv().ok() runs first, so values from your .env file are loaded before Rig looks for them.

Both clients are cheap to clone (they wrap connection pools internally), which is why we can hand QdrantStore around freely via #[derive(Clone)] later.

Why pass qdrant_url and collection_name as &str instead of taking a Config struct? Loose coupling. QdrantStore doesn't need to know about the rest of the app's configuration.

Collection creation:

pub async fn ensure_collection(&self) -> Result<()> {
    let exists = self.client.collection_exists(&self.collection_name).await?;
    if exists { return Ok(()); }

    self.client.create_collection(
        CreateCollectionBuilder::new(&self.collection_name)
            .vectors_config(VectorParamsBuilder::new(1536, Distance::Cosine))
    ).await?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

1536 matches openai/text-embedding-3-small. Get this wrong and every upsert fails with a dimension mismatch β€” a mistake everyone makes once. Cosine distance is the standard choice for normalized text embeddings.

reset_collection (omitted for space β€” see repo) deletes and recreates. It's what Reindex calls to start fresh.

Embedding and upserting chunks:

pub async fn upsert_chunks(&self, chunks: &[DocumentChunk]) -> Result<usize> {
    if chunks.is_empty() { return Ok(0); }

    let embedding_model = self.openrouter_client.embedding_model(EMBEDDING_MODEL);
    let mut builder = EmbeddingsBuilder::new(embedding_model);

    for chunk in chunks {
        builder = builder.document(ChunkEmbedding {
            id: chunk.id.clone(),
            file_path: chunk.file_path.clone(),
            chunk_index: chunk.chunk_index,
            text: chunk.text.clone(),
        })?;
    }

    let embedded_doc = builder.build().await?;

    let points: Vec<PointStruct> = embedded_doc.into_iter().map(|(doc, embeddings)| {
        let vector: Vec<f32> = embeddings.first().vec
            .iter().map(|&v| v as f32).collect();

        PointStruct::new(
            doc.id.clone(),
            vector,
            Payload::try_from(json!({
                "file_path": doc.file_path,
                "chunk_index": doc.chunk_index,
                "text": doc.text,
            })).expect("valid Qdrant payload"),
        )
    }).collect();

    self.client.upsert_points(
        UpsertPointsBuilder::new(&self.collection_name, points)
    ).await?;

    Ok(chunks.len())
}
Enter fullscreen mode Exit fullscreen mode

The flow: build up a batch of ChunkEmbeddings, hand them to Rig's EmbeddingsBuilder, get back (doc, embeddings) pairs. We cast f64 β†’ f32 because Qdrant expects 32-bit floats. The payload carries the original text and metadata β€” that's what search will later return to us so we can hand it to the LLM.

Searching:

pub async fn search(&self, question: &str, top_k: usize) -> Result<Vec<RetrievedChunk>> {
    let embedding_model = self.openrouter_client.embedding_model(EMBEDDING_MODEL);

    let query = QueryPointsBuilder::new(&self.collection_name)
        .with_payload(true).limit(top_k as u64).build();

    let vector_store = QdrantVectorStore::new(
        self.client.clone(), embedding_model, query);

    let request = VectorSearchRequest::builder()
        .query(question).samples(top_k as u64).build();

    let results = vector_store.top_n::<serde_json::Value>(request).await?;

    let mut chunk = Vec::new();
    for (score, _id, payload) in results {
        let file_path = payload.get("file_path")
            .and_then(|v| v.as_str()).unwrap_or_default().to_string();
        let chunk_index = payload.get("chunk_index")
            .and_then(|v| v.as_u64()).unwrap_or_default() as usize;
        let text = payload.get("text")
            .and_then(|v| v.as_str()).unwrap_or_default().to_string();

        chunk.push(RetrievedChunk {
            file_path, chunk_index, text, score: score as f32
        });
    }
    Ok(chunk)
}
Enter fullscreen mode Exit fullscreen mode

Rig handles embedding the question internally β€” that's what vector_store.top_n(...) does under the hood. Results come back as (score, id, payload) tuples. We destructure them and pull file_path, chunk_index, and text out of the JSON payload with safe defaults so a malformed point can't crash the whole query.

src/llm.rs β€” building the prompt and calling the model

#[derive(Clone)]
pub struct LlmService {
    client: openrouter::Client,
}

impl LlmService {
    pub fn new() -> Result<Self> {
        let client = openrouter::Client::from_env()
            .context("Failed to create Rig OpenRouter client from OPENROUTER_API_KEY")?;
        Ok(Self { client })
    }

    pub async fn answer_question(
        &self,
        question: &str,
        chunks: &[RetrievedChunk],
    ) -> Result<String> {
        if chunks.is_empty() {
            return Ok(format!("could not find relevant content for: {}", question));
        }

        let context = build_context(chunks);

        let prompt = format!(r#"
            You are a helpful Rust AI assistant.
            Answer the question using only the provided document context.

            Rules:
            - Be clear and concise.
            - If the context is not enough, say so.
            - Mention the source file when useful.
            - Do not invent facts outside the context.

            Context:
            {}

            Question:
            {}

            Answer:
            "#, context, question);

        let agent = self.client
            .agent("openai/gpt-4o-mini")
            .preamble("You answer questions using retrieved chunks as grounded context.")
            .build();

        let response = agent.prompt(prompt).await
            .context("Failed to generate answer with Rig agent")?;
        Ok(response)
    }
}

fn build_context(chunks: &[RetrievedChunk]) -> String {
    let mut context = String::new();
    for chunk in chunks {
        context.push_str(&format!(
            "\nSource: {}\nChunk: {}\nScore: {:.4}\nText: {}\n",
            chunk.file_path, chunk.chunk_index, chunk.score, chunk.text
        ));
    }
    context
}
Enter fullscreen mode Exit fullscreen mode

This is where retrieval finally meets generation. build_context formats each retrieved chunk into a labeled block so the model can tell them apart and cite them. The main prompt drops that block in, asks the question, and adds explicit grounding rules.

The "Do not invent facts outside the context" line matters more than people think. Without it, models cheerfully fill gaps with training-data plausibilities β€” which is exactly the hallucination problem RAG is supposed to fix. The "If the context is not enough, say so" rule is equally important: it tells the model that "I don't know" is an acceptable answer.

If retrieval returns nothing (empty chunks), we short-circuit and never call the LLM β€” saves a token bill and produces a clearer "no relevant content found" message.

src/rag.rs β€” orchestration

#[derive(Clone)]
pub struct RagEngine {
    file_dir: String,
    qdrant_store: QdrantStore,
    llm: LlmService,
}

#[derive(Debug)]
pub struct RagAnswer {
    pub answer: String,
    pub sources: Vec<RetrievedChunk>,
}

impl RagEngine {
    pub fn new(file_dir: impl Into<String>, qdrant_store: QdrantStore, llm: LlmService) -> Self {
        Self { file_dir: file_dir.into(), qdrant_store, llm }
    }

    pub async fn initialize(&self) -> Result<()> {
        self.qdrant_store.ensure_collection().await
    }

    pub async fn reindex_docs(&self) -> Result<usize> {
        let load = loader_documents_from_dir(&self.file_dir)?;
        let chunks = chunk_retrieve(load, 120);
        self.qdrant_store.reset_collection().await?;
        let indexed = self.qdrant_store.upsert_chunks(&chunks).await?;
        Ok(indexed)
    }

    pub async fn ask_question(&self, question: String) -> Result<RagAnswer> {
        let chunk = self.qdrant_store.search(&question, 3).await?;
        let answer = self.llm.answer_question(&question, &chunk).await?;
        Ok(RagAnswer { answer, sources: chunk })
    }
}
Enter fullscreen mode Exit fullscreen mode

RagEngine is the conductor. It doesn't know how PDFs are parsed, how chunks become vectors, or how the LLM is called β€” it just sequences the pieces. That's the entire point: when you want to swap Qdrant for Weaviate or OpenRouter for a local model, you change one file and everything else is fine.

top_k = 3 is conservative. Bigger k means more context and higher cost; smaller means crisper but riskier. Tune it for your corpus.

#[derive(Clone)] is what makes the engine cheap to hand to the gRPC handler. The expensive things inside (Qdrant client, OpenRouter client) are themselves cheaply cloneable β€” they're just Arc-wrapped connection pools.

src/grpc_service.rs β€” the network surface

#[derive(Clone)]
pub struct RagGrpcService {
    engine: RagEngine,
}

#[tonic::async_trait]
impl RagService for RagGrpcService {
    async fn reindex(&self, _req: Request<ReindexRequest>)
        -> Result<Response<ReindexResponse>, Status>
    {
        let engine = self.engine.clone();
        let handle = tokio::spawn(async move {
            engine.reindex_docs().await
                .map_err(|err| Status::internal(err.to_string()))
        });
        let chunks_indexed = handle.await
            .map_err(|err| Status::internal(err.to_string()))??;

        Ok(Response::new(ReindexResponse {
            chunks_indexed: chunks_indexed as u64,
            message: format!("Indexed {} chunk into qdrant", chunks_indexed),
        }))
    }

    async fn ask_question(&self, req: Request<AskQuestionRequest>)
        -> Result<Response<AskQuestionResponse>, Status>
    {
        let question = req.into_inner().question;
        if question.trim().is_empty() {
            return Err(Status::invalid_argument("question cannot be empty"));
        }

        let rag_answer = self.engine.ask_question(question).await
            .map_err(|err| Status::internal(err.to_string()))?;

        let sources = rag_answer.sources.into_iter().map(|c| Source {
            file_path: c.file_path,
            chunks_indexed: c.chunk_index as u32,
            preview: c.text.chars().take(180).collect(),
            score: c.score,
        }).collect();

        Ok(Response::new(AskQuestionResponse {
            answer: rag_answer.answer,
            sources,
        }))
    }
}
Enter fullscreen mode Exit fullscreen mode

Tonic handlers are thin β€” they unpack the request, validate, call the engine, pack the response. Two details worth pointing out.

reindex uses tokio::spawn because indexing can take a while (hundreds of OpenRouter embedding calls). Spawning lets it run on a dedicated task; the handler awaits the join handle and propagates errors. For a production system you'd probably make this fire-and-forget with a separate status endpoint, but for our scale, await-and-wait is fine.

The preview: c.text.chars().take(180).collect() truncates the chunk to 180 characters before sending it over the wire. Full chunks could be hundreds of words; the preview is enough to let the user verify the source without bloating the response.

src/main.rs β€” boot

#[tokio::main]
async fn main() -> Result<(), anyhow::Error> {
    dotenvy::dotenv().ok();

    tracing_subscriber::fmt()
        .with_target(false).compact().init();

    let config = Config::from_env();
    let addr = config.server_addr();

    let qdrant_store = QdrantStore::new(&config.qdrant_url, &config.qdrant_collection)?;
    let llm = LlmService::new()?;
    let engine = RagEngine::new("./docs", qdrant_store, llm);
    engine.initialize().await?;

    let grpc_service = RagGrpcService::new(engine);
    info!("CodeRAG-rs gRPC server running on {}", addr);

    Server::builder()
        .add_service(RagServiceServer::new(grpc_service))
        .serve(addr.parse()?)
        .await?;

    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Linear and easy to follow: load .env, set up logging, build config, construct the three services in dependency order (QdrantStore β†’ LlmService β†’ RagEngine), call initialize() to make sure the Qdrant collection exists, then start the tonic server. The #[tokio::main] attribute turns this into an async runtime entry point.

src/bin/chat.rs β€” the terminal client

#[tokio::main]
async fn main() -> Result<()> {
    let addr = std::env::var("RAG_SERVER")
        .unwrap_or_else(|_| "http://127.0.0.1:50051".to_string());

    let mut client = RagServiceClient::connect(addr.clone()).await
        .with_context(|| format!("could not reach RAG server at {addr}"))?;

    println!("Type a question and press Enter. Commands: /reindex, /quit");

    let stdin = io::stdin();
    let mut stdout = io::stdout();
    let mut input = stdin.lock();
    let mut line = String::new();

    loop {
        print!("you > ");
        stdout.flush().ok();
        line.clear();
        if input.read_line(&mut line)? == 0 { break; }   // Ctrl-D

        let q = line.trim();
        if q.is_empty() { continue; }
        if q == "/quit" || q == "/exit" { break; }
        if q == "/reindex" {
            let r = client.reindex(ReindexRequest {}).await?.into_inner();
            println!("indexed {} chunks β€” {}\n", r.chunks_indexed, r.message);
            continue;
        }

        let r = client.ask_question(AskQuestionRequest {
            question: q.to_string()
        }).await?.into_inner();

        println!("\nbot > {}\n", r.answer);
        for s in r.sources {
            let preview: String = s.preview.chars().take(80).collect();
            println!("  - {} (score {:.3})\n    {}…", s.file_path, s.score, preview);
        }
    }
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

This is a separate binary that connects to the running server over gRPC. The same tonic::include_proto!("rag") macro that gave the server its types gives the client a strongly-typed RagServiceClient. The loop reads from stdin, dispatches commands (/reindex, /quit), and otherwise sends the input as a question.

Two binaries from one codebase. Production-friendly: you can deploy the server in Docker and ship the chat client to developers' laptops, and they share types automatically.


▢️ Running it end-to-end

Build everything

git clone https://github.com/Parikalp-Bhardwaj/qrag-rust
cd qrag-rust
cargo build
Enter fullscreen mode Exit fullscreen mode

The first build is slow (tonic, qdrant-client, and rig pull a lot of crates). Subsequent builds are quick.

Start Qdrant and the server

docker compose up -d
cargo run --bin qrag-rust
Enter fullscreen mode Exit fullscreen mode

You should see:

INFO CodeRAG-rs gRPC server running on 127.0.0.1:50051
Enter fullscreen mode Exit fullscreen mode

Index your docs (via grpcurl)

grpcurl -plaintext -d '{}' \
  -import-path proto -proto rag.proto \
  127.0.0.1:50051 rag.RagService/Reindex
Enter fullscreen mode Exit fullscreen mode

Expected response:

{
  "chunksIndexed": "386",
  "message": "Indexed 386 chunk into qdrant"
}
Enter fullscreen mode Exit fullscreen mode

Ask a question

grpcurl -plaintext -d '{"question":"What is tokio?"}' \
  -import-path proto -proto rag.proto \
  127.0.0.1:50051 rag.RagService/AskQuestion
Enter fullscreen mode Exit fullscreen mode

Expected response:

{
  "answer": "Tokio is an asynchronous runtime for Rust that enables programs to run many async tasks concurrently. It is useful for background jobs, concurrent I/O, network servers, and worker tasks. The `tokio::spawn` function is utilized to start a new asynchronous task that runs independently within the Tokio runtime. (Source: ./docs/tokio.md)",
  "sources": [
    {
      "filePath": "./docs/tokio.md",
      "preview": "# Tokio Runtime Tokio is an asynchronous runtime for Rust. It allows programs to run many async tasks concurrently. tokio::spawn is used to start a new asynchronous task. The spawn",
      "score": 0.5813062
    },
    {
      "filePath": "./docs/Rust-for-Network-Programming-and-Automation.pdf",
      "chunksIndexed": 230,
      "preview": "resources like sockets and orchestrating the event loop that drives the application. It provides an API for registering interest in I/O events (e.g., data arriving on a socket or a",
      "score": 0.51516765
    },
    {
      "filePath": "./docs/Rust-for-Network-Programming-and-Automation.pdf",
      "chunksIndexed": 233,
      "preview": "With the help of Tokio's abstractions and asynchronous I/O, you can create scalable network apps that can manage any number of connections at once. Hyper: High-Level HTTP Library H",
      "score": 0.48589033
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

You'll get an answer plus the source chunks that produced it.

Or use the chat CLI

In a second terminal:

cargo run --bin chat
Enter fullscreen mode Exit fullscreen mode
you > what is TCP

bot > TCP, or Transmission Control Protocol, is a core protocol within the TCP/IP suite designed to facilitate reliable data transmission over networks. It ensures that data packets are transmitted accurately and in the correct order by using a process known as the three-way handshake for establishing connections. TCP also incorporates mechanisms for error detection, correction, and retransmission, ensuring data integrity and reliability even in complex network environments. This makes TCP essential for applications that require robust communication.

(Source: ./docs/Rust-for-Network-Programming-and-Automation.pdf)

  sources:
    - ./docs/Rust-for-Network-Programming-and-Automation.pdf  (score 0.546)
      which is then encapsulated in an IP packet, and finally in a link layer frame. T…
    - ./docs/Rust-for-Network-Programming-and-Automation.pdf  (score 0.540)
      requirements. Its robustness ensures reliable data transmission even in complex …
    - ./docs/Rust-for-Network-Programming-and-Automation.pdf  (score 0.526)
      as OSPF (Open Shortest Path First) and BGP (Border Gateway Protocol), help route…
Enter fullscreen mode Exit fullscreen mode

Commands: /reindex to rebuild, /quit to exit.


⚠️ Gotchas worth knowing

A handful of papercuts you'll save time on:

  • Vector dimension mismatch. If you initialize the collection at the wrong dimension, every upsert fails. Drop the collection (docker compose down -v) and let it recreate. text-embedding-3-small is 1536-dim.
  • Don't commit .env. If you already did, rotate the key β€” git history is forever.
  • OpenRouter for embeddings. OpenRouter is primarily a completion router but does pass through OpenAI's embedding models. If embeddings start erroring, point the embedding client directly at OpenAI and keep OpenRouter for completion.

πŸ”­ Where to go from here

The system you just built is the simplest thing that works. Things to try next, in rough order of payoff:

  1. Better chunking. Sentence-aware splits, sliding-window overlap, or semantic chunking. Probably the single biggest quality lever.
  2. Reranking. After Qdrant returns top-20, use a cross-encoder to rerank down to top-3. Higher precision at the cost of latency.
  3. Hybrid retrieval. Combine vector search with BM25 keyword search. Catches exact-match queries that pure vector search misses.
  4. Streaming responses. gRPC supports server streaming β€” tokens can land in the client as the LLM produces them.
  5. Per-tenant collections. One Qdrant collection per user or per workspace.
  6. Observability. Trace every step β€” embedding latency, search latency, LLM latency. RAG systems get slow in surprising places.

If you want a structured course on RAG fundamentals in Rust β€” chunking strategies, retrieval evaluation, embedding selection β€” check out Foundations of RAG Systems with Rust on CodeSignal. Pair it with this post: course for the why, this build for the how.


πŸ’­ Closing thoughts

Modern AI engineering isn't really about models anymore. The models are a commodity you call over HTTP. The interesting work is everywhere else β€” splitting documents intelligently, indexing vectors at scale, routing requests with low latency, streaming results without blocking, keeping memory and concurrency under control.

That's systems engineering. And it's exactly the territory Rust is built for. Strong types, async without garbage collection, predictable performance, and a tooling ecosystem (tonic, tokio, rig, qdrant-client) that's quietly catching up to anything Python has for AI infrastructure.

You don't need Rust to build RAG. You need Rust when RAG turns into a real system someone depends on.

Build the thing. Read the source. Break it. That's where the understanding comes from. πŸ¦€


πŸ™Œ Thanks for reading

If you made it this far β€” seriously, thank you. πŸ™

I'd love to hear what you're building, what didn't work for you, or what you'd do differently. Feel free to drop a comment, open an issue on the repo, or reach out on LinkedIn β€” always happy to talk Rust, AI, RAG, or anything in between.


πŸ”— Links

Top comments (0)