TL;DR: Build a powerful semantic search system in Rust using Rig and LanceDB. We'll guide you step-by-step through creating, storing, and searching vector embeddings efficiently with hands-on examples. Perfect for building RAG systems, semantic search engines, and more.
Introduction
Semantic search is transforming the way we find and understand information. Unlike traditional keyword search, it captures the intent behind your queries, offering a more nuanced retrieval process. However, building these systems can feel daunting, often involving complex embeddings, vector databases, and similarity search algorithms.
That's where LanceDB comes in.
Why LanceDB?
LanceDB is an open-source vector database tailored for AI applications and vector search. It provides:
- Embedded Database: Works directly in your application without needing external servers.
- High Performance: Leverages Arrow format for efficient data storage and retrieval.
- Scalable: Handles terabyte-scale datasets efficiently.
- Vector Indexing: Supports both exact and approximate nearest neighbor searches out of the box.
Combined with Rig's embedding and LLM capabilities, you can create a powerful, efficient semantic search solution with minimal code.
Let's dive in!
You can find the full source code for this project in our GitHub repo.
Prerequisites
Before we begin, make sure you have:
- Rust installed (rust-lang.org)
- An OpenAI API key (platform.openai.com)
- Basic familiarity with Rust and asynchronous programming
Project Setup
To start, create a new Rust project:
cargo new vector_search
cd vector_search
Update your Cargo.toml
to add the necessary dependencies:
[dependencies]
rig-core = "0.4.0"
rig-lancedb = "0.1.1"
lancedb = "0.10.0"
tokio = { version = "1.40.0", features = ["full"] }
anyhow = "1.0.89"
futures = "0.3.30"
serde = { version = "1.0.210", features = ["derive"] }
serde_json = "1.0.128"
arrow-array = "52.2.0"
Here’s a quick overview of each dependency:
-
rig-core
andrig-lancedb
: The core libraries for embedding generation and vector search. -
lancedb
: The embedded vector database. -
tokio
: Asynchronous runtime support. -
arrow-array
: To work with Arrow's columnar format, which LanceDB uses internally. - Others for error handling, serialization, and futures support.
Now, create a .env
file to store your OpenAI API key:
echo "OPENAI_API_KEY=your_key_here" > .env
Building the Search System
We’ll break this into manageable steps. First, let’s create a utility function to handle data conversion between Rig's embeddings and LanceDB's format.
Create src/utils.rs
:
use std::sync::Arc;
use arrow_array::{
types::Float64Type, ArrayRef, FixedSizeListArray,
RecordBatch, StringArray
};
use lancedb::arrow::arrow_schema::{DataType, Field, Fields, Schema};
use rig::embeddings::DocumentEmbeddings;
// Define the schema for our LanceDB table
pub fn schema(dims: usize) -> Schema {
Schema::new(Fields::from(vec![
Field::new("id", DataType::Utf8, false),
Field::new("content", DataType::Utf8, false),
Field::new(
"embedding",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float64, true)),
dims as i32,
),
false,
),
]))
}
This schema function defines the structure of our table:
-
id
: A unique identifier for each document. -
content
: The text content of the document. -
embedding
: The vector representation of the content. -
dims
parameter: Represents the size of embedding vectors (e.g., 1536 for OpenAI'sada-002
model).
Next, add the conversion function to convert DocumentEmbeddings
into RecordBatch
for LanceDB:
pub fn as_record_batch(
records: Vec<DocumentEmbeddings>,
dims: usize,
) -> Result<RecordBatch, lancedb::arrow::arrow_schema::ArrowError> {
let id = StringArray::from_iter_values(
records
.iter()
.flat_map(|record| (0..record.embeddings.len())
.map(|i| format!("{}-{i}", record.id)))
.collect::<Vec<_>>(),
);
let content = StringArray::from_iter_values(
records
.iter()
.flat_map(|record| {
record
.embeddings
.iter()
.map(|embedding| embedding.document.clone())
})
.collect::<Vec<_>>(),
);
let embedding = FixedSizeListArray::from_iter_primitive::<Float64Type, _, _>(
records
.into_iter()
.flat_map(|record| {
record
.embeddings
.into_iter()
.map(|embedding| embedding.vec.into_iter().map(Some).collect::<Vec<_>>())
.map(Some)
.collect::<Vec<_>>()
})
.collect::<Vec<_>>(),
dims as i32,
);
RecordBatch::try_from_iter(vec![
("id", Arc::new(id) as ArrayRef),
("content", Arc::new(content) as ArrayRef),
("embedding", Arc::new(embedding) as ArrayRef),
])
}
This function is crucial as it converts our Rust data structures into Arrow's columnar format, which LanceDB uses internally:
- Creates string arrays for IDs and content.
- Converts embeddings into fixed-size lists.
- Assembles everything into a
RecordBatch
.
With our utility functions ready, let’s build the main search functionality in src/main.rs
. We’ll implement this step-by-step, explaining each part along the way.
Setting Up Dependencies
First, let’s import the required libraries:
use anyhow::Result;
use arrow_array::RecordBatchIterator;
use lancedb::{index::vector::IvfPqIndexBuilder, DistanceType};
use rig::{
embeddings::{DocumentEmbeddings, EmbeddingModel, EmbeddingsBuilder},
providers::openai::{Client, TEXT_EMBEDDING_ADA_002},
vector_store::VectorStoreIndex,
};
use rig_lancedb::{LanceDbVectorStore, SearchParams};
use serde::Deserialize;
use std::{env, sync::Arc};
mod utils;
use utils::{as_record_batch, schema};
These imports bring in:
- Rig’s embedding and vector storage tools.
- LanceDB’s database capabilities.
- Arrow data structures for efficient processing.
- Utilities for serialization, error handling, and async programming.
Defining Data Structures
We’ll create a simple struct to represent our search results:
#[derive(Debug, Deserialize)]
struct SearchResult {
content: String,
}
This struct maps to database records, representing the content we want to retrieve.
Generating Embeddings
Generating document embeddings is a core part of our system. Let’s implement this function:
async fn create_embeddings(client: &Client) -> Result<Vec<DocumentEmbeddings>> {
let model = client.embedding_model(TEXT_EMBEDDING_ADA_002);
// Set up dummy data to meet the 256 row requirement for IVF-PQ indexing
let dummy_doc = "Let there be light".to_string();
let dummy_docs = vec![dummy_doc; 256];
// Generate embeddings for the data
let embeddings = EmbeddingsBuilder::new(model)
// First add our real documents
.simple_document(
"doc1",
"Rust provides zero-cost abstractions and memory safety without garbage collection.",
)
.simple_document(
"doc2",
"Python emphasizes code readability with significant whitespace.",
)
// Add dummy documents to meet minimum requirement using enumerate to generate unique IDs
.simple_documents(
dummy_docs
.into_iter()
.enumerate()
.map(|(i, doc)| (format!("doc{}", i + 3), doc))
.collect(),
)
.build()
.await?;
Ok(embeddings)
}
This function handles:
- Initializing the OpenAI embedding model.
- Creating embeddings for our real documents.
- Adding dummy data to meet LanceDB’s indexing requirements.
Configuring the Vector Store
Now, let’s set up LanceDB and configure it with appropriate indexing and search parameters:
async fn setup_vector_store<M: EmbeddingModel>(
embeddings: Vec<DocumentEmbeddings>,
model: M,
) -> Result<LanceDbVectorStore<M>> {
// Initialize LanceDB
let db = lancedb::connect("data/lancedb-store").execute().await?;
// Drop the existing table if it exists - important for development
if db
.table_names()
.execute()
.await?
.contains(&"documents".to_string())
{
db.drop_table("documents").await?;
}
// Create table with embeddings
let record_batch = as_record_batch(embeddings, model.ndims())?;
let table = db
.create_table(
"documents",
RecordBatchIterator::new(vec![Ok(record_batch)], Arc::new(schema(model.ndims()))),
)
.execute()
.await?;
// Create an optimized vector index using IVF-PQ
table
.create_index(
&["embedding"],
lancedb::index::Index::IvfPq(
IvfPqIndexBuilder::default().distance_type(DistanceType::Cosine),
),
)
.execute()
.await?;
// Configure search parameters
let search_params = SearchParams::default().distance_type(DistanceType::Cosine);
// Create and return vector store
Ok(LanceDbVectorStore::new(table, model, "id", search_params).await?)
}
This setup function:
- Connects to the LanceDB database.
- Manages table creation and deletion.
- Sets up vector indexing for efficient similarity search.
Putting It All Together
Finally, the main function orchestrates the entire process:
#[tokio::main]
async fn main() -> Result<()> {
// Initialize OpenAI client
let openai_api_key = env::var("OPENAI_API_KEY").expect("OPENAI_API_KEY not set");
let openai_client = Client::new(&openai_api_key);
let model = openai_client.embedding_model(TEXT_EMBEDDING_ADA_002);
// Create embeddings (includes both real and dummy documents)
let embeddings = create_embeddings(&openai_client).await?;
println!("Created embeddings for {} documents", embeddings.len());
// Set up vector store
let store = setup_vector_store(embeddings, model).await?;
println!("Vector store initialized successfully");
// Perform a semantic search
let query = "Tell me about safe programming languages";
let results = store.top_n::<SearchResult>(query, 2).await?;
println!("\nSearch Results for: {}\n", query);
for (score, id, result) in results {
println!(
"Score: {:.4}\nID: {}\nContent: {}\n",
score, id, result.content
);
}
Ok(())
}
Understanding Vector Search Methods
Vector search systems need to balance accuracy and performance, especially as datasets grow. LanceDB provides two approaches to handle this: Exact Nearest Neighbor (ENN) and Approximate Nearest Neighbor (ANN) searches.
ENN vs ANN
-
Exact Nearest Neighbor (ENN):
- Searches exhaustively across all vectors.
- Guarantees finding the true nearest neighbors.
- Works well for small datasets.
- No minimum data requirement.
- Slower, but more accurate.
-
Approximate Nearest Neighbor (ANN):
- Uses indexing to speed up searches (like IVF-PQ).
- Returns approximate results.
- Suited for larger datasets.
- Faster but slightly less accurate.
Choosing the Right Approach
Use ENN when:
- Dataset is small (< 1,000 vectors).
- Exact matches are crucial.
- Performance isn’t a major concern.
Use ANN when:
- Dataset is larger.
- You can tolerate minor approximations.
- Fast search speed is needed.
In our tutorial, we use ANN for scalability. For smaller datasets, ENN will be more suitable.
Tip: Start with ENN during development. Transition to ANN as your data and performance needs grow. Check out the ENN example.
Running the System
To run the project:
cargo run
Expected output:
Created embeddings for 258 documents
Vector store initialized successfully
Search Results for: Tell me about safe programming languages
Score: 0.3982
ID: doc2-0
Content: Python emphasizes code readability with significant whitespace.
Score: 0.4369
ID: doc1-0
Content: Rust provides zero-cost abstractions and memory safety without garbage collection.
Next Steps
If you’re ready to build more with Rig, here are some practical examples:
1. Build a RAG System
Want to give your LLM access to custom knowledge? Check out our tutorial on Building a RAG System with Rig in Under 100 Lines of Code.
2. Create an AI Agent
Ready to build more interactive AI applications? See our Agent Example.
3. Join the Community
Stay Connected
I’m always excited to hear from developers! If you’re interested in Rust, LLMs, or building intelligent assistants, join our Discord. Let’s build something amazing together!
And don’t forget: Build something with Rig, share your feedback, and get a chance to win $100.
Ad Astra,
Tachi
Co-Founder @ Playgrounds Analytics
This tutorial is part of our "Build with Rig" series. Follow our Website for more.
Top comments (2)
Cool! Looking forward to new features & integrations (twitter, streaming, etc)
Amazing!! Also folks, I came across this post and thought it might be helpful for you all! Rag Retrieval.