DEV Community

Cover image for Building Scalable Embedding Services with Rust and gRPC
Salim Laimeche
Salim Laimeche

Posted on

Building Scalable Embedding Services with Rust and gRPC

Let's go ! We have to make some cool stuff
In this article, I will present to you a way to make a microservice who have the task to create vectors from markdown then return them

Ready, steady, GO !!!

Understanding the .proto File for gRPC-Based Embedding Service

When building a gRPC service, the .proto file serves as the core contract that defines the API structure. It specifies the service methods, their input and output message types, and how data is serialized. Here’s the .proto file for our embedding service:

syntax = "proto3";
package chunk_embed;

service ChunkEmbed {
    rpc ChunkEmbedMessage(ChunkEmbedRequest) returns (ChunkEmbedResponse);
    rpc EmbedMarkdown(EmbedMarkdownRequest) returns (EmbedMarkdownResponse);
}

message ChunkEmbedRequest {
    repeated string chunks = 1;
}

message ChunkEmbedResponse {
    repeated float embeddings = 1;
}

message EmbedMarkdownRequest {
    string markdown = 1;
}

message EmbedMarkdownResponse {
    repeated EmbeddingFromMarkdown embeddings = 1;
}

message EmbeddingFromMarkdown  {
    repeated float embedding =  1;
    bytes chunk = 2;
}
Enter fullscreen mode Exit fullscreen mode

Key Components of the .proto File:

  1. Service Definition:

    • The ChunkEmbed service defines two remote procedure calls (RPCs):
      • ChunkEmbedMessage: Accepts a list of text chunks (chunks) and returns their embeddings as a flat array of floats.
      • EmbedMarkdown: Accepts a markdown string, processes it into chunks, and returns embeddings for each chunk.
  2. Message Types:

    • ChunkEmbedRequest:
      • Input for ChunkEmbedMessage. It contains a repeated string field named chunks, representing a list of text segments.
    • ChunkEmbedResponse:
      • Output for ChunkEmbedMessage. It contains a repeated float field named embeddings, representing a flat array of embeddings.
    • EmbedMarkdownRequest:
      • Input for EmbedMarkdown. It contains a single string field named markdown, representing markdown content to process.
    • EmbedMarkdownResponse:
      • Output for EmbedMarkdown. It contains a list of EmbeddingFromMarkdown objects.
    • EmbeddingFromMarkdown:
      • Holds both the embedding (a repeated float) and the corresponding markdown chunk (bytes chunk).
  3. Serialization and Data Flow:

    • Protobuf serialization ensures lightweight and efficient data transfer over the network.
    • Messages like ChunkEmbedRequest and EmbedMarkdownRequest specify how the client should format requests, while ChunkEmbedResponse and EmbedMarkdownResponse detail the server’s expected outputs.

This .proto file is the foundation for implementing the service in Rust or any gRPC-supported language. It ensures consistent communication between the client and server, making it crucial for building scalable, interoperable systems.

Rust build.rs for Compiling Protobuf

The build.rs file is a build script that runs before compiling the main Rust code. It is essential for generating the gRPC types and ensuring that the .proto file definitions are translated into Rust code that the application can use. Here is the provided build.rs file:

use std::error::Error;
use std::{env, path::PathBuf};

fn main() -> Result<(), Box<dyn Error>> {
    let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());

    tonic_build::configure()
        .file_descriptor_set_path(out_dir.join("chunk_embed_descriptor.bin"))
        .compile_protos(&["proto/chunk_embed.proto"], &["proto"])?;

    tonic_build::compile_protos("proto/chunk_embed.proto")?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Breakdown of the Code:

  1. Setting the Output Directory:
   let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
Enter fullscreen mode Exit fullscreen mode
  • The OUT_DIR environment variable is used to determine where the generated files should be placed. This directory is automatically managed by Cargo for build artifacts.
  1. File Descriptor Set:
   tonic_build::configure()
       .file_descriptor_set_path(out_dir.join("chunk_embed_descriptor.bin"))
       .compile_protos(&["proto/chunk_embed.proto"], &["proto"])?;
Enter fullscreen mode Exit fullscreen mode
  • The file_descriptor_set_path function generates a .bin file containing the serialized descriptors for the Protobuf definitions. This is useful for reflection or dynamic use of the service (e.g., in tests or external tools).
  1. Compiling the Protos:
   tonic_build::compile_protos("proto/chunk_embed.proto")?;
Enter fullscreen mode Exit fullscreen mode
  • compile_protos is the core function of tonic-build that converts the .proto file into Rust modules. It takes:
    • The .proto file to compile.
    • The path to look for imports (in this case, the proto directory).
  1. Error Handling:
    • The function returns a Result type to ensure proper error handling during the build process. Any failure in generating the Protobuf definitions will prevent the application from compiling.

Why Is This Important?

  • Automation: build.rs ensures that any updates to the .proto file are automatically reflected in the generated Rust code during compilation. This avoids manual regeneration and reduces the chance of mismatch between the service contract and implementation.
  • Compatibility: By compiling .proto files at build time, you ensure that your gRPC service can handle cross-language communication while adhering to the service definition.
  • Descriptor Generation: The .bin file can be used for additional features like gRPC reflection, debugging, or dynamic client generation.

How to Use

  1. Place the .proto file in the proto directory (relative to your project root).
  2. Ensure build.rs is in the root of your project.
  3. Add the following to your Cargo.toml:
   [build-dependencies]
   tonic-build = "0.9"
Enter fullscreen mode Exit fullscreen mode
  1. Run cargo build to compile the .proto file and generate the gRPC types.

Output

After running the build process:

  • The generated code will be located in OUT_DIR (e.g., target/debug/build/...).
  • You can access the Rust module for chunk_embed using:
  pub mod chunk_embed {
      tonic::include_proto!("chunk_embed");
  }
Enter fullscreen mode Exit fullscreen mode

This makes the gRPC types and services directly available in your Rust code!

Building a Markdown Chunking Client Using HuggingFace Models in Rust

This implementation provides a complete Rust client for chunking markdown documents into logical sections using HuggingFace's inference API. It employs an exponential backoff strategy for retries, ensuring robust handling of potential API failures. Let's break down the functionality.


Key Features of the Code:

  1. Markdown Chunking with HuggingFace:

    • The service takes a markdown document, sends it to a HuggingFace language model, and receives logical "chunks" separated by the delimiter ---.
  2. Retry Logic:

    • A retry mechanism is implemented with exponential backoff to handle transient errors from the API.
  3. Async Design:

    • Leveraging tokio for async requests and testing.

Code Walkthrough

Structs for Requests and Responses

#[derive(Serialize)]
struct ChunkRequest {
    model: String,
    messages: Vec<Message>,
    temperature: f32,
}

#[derive(Serialize, Deserialize, Debug)]
struct Message {
    role: String,
    content: String,
}

#[derive(Deserialize, Debug)]
struct ChunkResponse {
    choices: Vec<Choice>,
}

#[derive(Deserialize, Debug)]
struct Choice {
    message: Message,
}
Enter fullscreen mode Exit fullscreen mode
  • ChunkRequest: Represents the JSON payload sent to the HuggingFace API.
  • Message: Mimics a conversational exchange (system, user, and assistant roles).
  • ChunkResponse and Choice: Represent the structure of the API response.

The HuggingFaceClient Implementation

pub struct HuggingFaceClient;

impl HuggingFaceClient {
    pub fn new() -> Self {
        HuggingFaceClient
    }

    pub async fn chunk_markdown(&self, client: &Client, api_key: &str, markdown: &str) -> Result<Vec<String>> {
        retry_with_backoff(|| async {
            let request = ChunkRequest {
                model: "meta-llama/Llama-3.3-70B-Instruct".to_string(),
                messages: vec![
                    Message {
                        role: "system".to_string(),
                        content: "You are a helpful assistant that chunks markdown content into logical sections. Each chunk should be a complete thought or section.".to_string(),
                    },
                    Message {
                        role: "user".to_string(),
                        content: format!("Please chunk the following markdown into logical sections. Separate each section with '---':\n\n{}", markdown),
                    },
                ],
                temperature: 0.7,
            };

            let response = client
                .post("https://api-inference.huggingface.co/models/meta-llama/Llama-3.3-70B-Instruct/v1/chat/completions")
                .header("Authorization", format!("Bearer {}", api_key))
                .json(&request)
                .send()
                .await?;

            if !response.status().is_success() {
                let error_text = response.text().await?;
                return Err(anyhow!("API request failed: {}", error_text));
            }

            let chunk_response: ChunkResponse = response.json().await?;

            let content = chunk_response.choices
                .first()
                .ok_or_else(|| anyhow!("No choices in response"))?
                .message
                .content
                .clone();

            let chunks: Vec<String> = content
                .split("---")
                .map(|s| s.trim().to_string())
                .filter(|s| !s.is_empty())
                .collect();

            if chunks.is_empty() {
                Ok(vec![content])
            } else {
                Ok(chunks)
            }
        }).await
    }
}
Enter fullscreen mode Exit fullscreen mode
Highlights:
  1. chunk_markdown Function:

    • Sends a markdown document to HuggingFace's inference API.
    • Parses the response to extract logical chunks delimited by ---.
  2. API Request:

    • The Client from reqwest sends the POST request to HuggingFace's API with the necessary headers and JSON payload.
  3. Chunk Parsing:

    • The response content is split by the delimiter --- into logical sections.
  4. Resilience:

    • The function retries the request up to 3 times with exponential backoff on failure.

Retry Logic with Exponential Backoff

async fn retry_with_backoff<F, Fut, T>(mut f: F) -> Result<T, anyhow::Error>
where
    F: FnMut() -> Fut,
    Fut: std::future::Future<Output = Result<T, anyhow::Error>>,
{
    let mut attempts = 0;
    let max_attempts = 3;
    let mut delay = Duration::from_millis(100);

    loop {
        match f().await {
            Ok(result) => return Ok(result),
            Err(e) if attempts < max_attempts => {
                println!("Attempt {} failed: {}. Retrying...", attempts + 1, e);
                attempts += 1;
                sleep(delay).await;
                delay *= 2; // Exponential backoff
            }
            Err(e) => return Err(e),
        }
    }
}
Enter fullscreen mode Exit fullscreen mode
  • Implements a simple retry mechanism with exponential backoff for resilience.
  • If the request fails up to 3 times, it returns an error.

Testing the Client

The #[tokio::test] macro is used to write an async test for the chunk_markdown function.

#[tokio::test]
async fn test_chunk() -> Result<(), Box<dyn std::error::Error>> {
    dotenv().ok();
    let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30))
        .build()?;
    let api_key = std::env::var("HF_API_KEY").expect("HF_API_KEY must be set");
,   //can't give you markdown in a markdown article ^^
    let markdown = "your_markdown_here";
    println!("Markdown: {}", markdown);
    let hf_client = HuggingFaceClient::new();
    let chunks = hf_client.chunk_markdown(&client, &api_key, markdown).await?;

    println!("Number of chunks: {}", chunks.len());
    for (i, chunk) in chunks.iter().enumerate() {
        println!("Chunk {}: {}", i + 1, chunk);
    }

    assert!(!chunks.is_empty(), "Expected at least one chunk");
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode
Highlights:
  1. Environment Variable:

    • HF_API_KEY is loaded from the .env file for security.
  2. Markdown Example:

    • A markdown document is tested to ensure chunking is functional.
  3. Assertions:

    • The test asserts that at least one chunk is returned.

Implementing the gRPC Embedding Service in Rust

This ChunkEmbedService is the core gRPC service that handles two primary functions:

  1. Generating embeddings for a list of text chunks.
  2. Chunking markdown content into logical sections and embedding each section.

Let's break down its key components and functionality.


Service Structure

The ChunkEmbedService leverages two main dependencies:

  • FastEmbed: An embedding model that provides high-performance embeddings for text data.
  • HuggingFaceClient: A client for interacting with the HuggingFace API to chunk markdown into logical sections.

Key Features

1. Service Struct Definition

The service is encapsulated in the ChunkEmbedService struct, which contains a reference to the FastEmbed model:

pub struct ChunkEmbedService {
    embedder: Arc<FastEmbed>,
}

impl ChunkEmbedService {
    pub fn new(embedder: Arc<FastEmbed>) -> Self {
        ChunkEmbedService { embedder }
    }
}
Enter fullscreen mode Exit fullscreen mode

The Arc<FastEmbed> ensures thread-safe sharing of the embedding model across multiple requests.


2. gRPC Method: chunk_embed_message

This method accepts a list of text chunks and generates embeddings for each chunk. The embeddings are then flattened and returned as a single array.

#[tonic::async_trait]
impl ChunkEmbed for ChunkEmbedService {
    async fn chunk_embed_message(
        &self,
        request: Request<ChunkEmbedRequest>,
    ) -> Result<Response<ChunkEmbedResponse>, Status> {
        let chunks = request.into_inner().chunks;

        // Generate embeddings with the model
        let embeddings_result: Result<Vec<Vec<f64>>, Status> = self
            .embedder
            .embed_documents(&chunks)
            .await
            .map_err(|e| Status::internal(format!("Embedding error: {}", e)));

        // Handle errors
        let embeddings = embeddings_result?;

        // Flatten and convert embeddings to f32
        let embeddings: Vec<f32> = embeddings
            .into_iter()
            .flatten()
            .map(|e| e as f32)
            .collect();

        // Construct the response
        let response = ChunkEmbedResponse { embeddings };

        Ok(Response::new(response))
    }
}
Enter fullscreen mode Exit fullscreen mode

Highlights:

  • The embed_documents method is called asynchronously to process the text chunks.
  • Any errors are handled and mapped to a gRPC Status::internal error.
  • The result is flattened and converted to f32 to comply with the ChunkEmbedResponse schema.

3. gRPC Method: embed_markdown

This method processes markdown content by:

  1. Chunking it into logical sections using the HuggingFaceClient.
  2. Generating embeddings for each chunk.
async fn embed_markdown(
    &self,
    request: Request<EmbedMarkdownRequest>,
) -> Result<Response<EmbedMarkdownResponse>, Status> {
    dotenv().ok();
    let client = reqwest::Client::builder()
        .timeout(Duration::from_secs(30))
        .build();
    let markdown = request.into_inner().markdown;
    let api_key = std::env::var("HF_API_KEY").expect("HF_API_KEY must be set");
    let hf_client = HuggingFaceClient::new();
    let chunks = hf_client.chunk_markdown(&client.unwrap(), &api_key, &markdown).await.unwrap();

    // Generate embeddings with the model
    let embeddings_result: Result<Vec<Vec<f64>>, Status> = self
        .embedder
        .embed_documents(&chunks)
        .await
        .map_err(|e| Status::internal(format!("Embedding error: {}", e)));

    // Create response with embeddings and original chunks
    let result: Vec<EmbeddingFromMarkdown> = match embeddings_result {
        Ok(embeddings) => embeddings
            .iter()
            .enumerate()
            .map(|(index, embedding)| {
                EmbeddingFromMarkdown {
                    embedding: embedding.iter().map(|&e| e as f32).collect(),
                    chunk: chunks
                        .get(index)
                        .expect("Chunk not found for index")
                        .as_bytes()
                        .to_vec(),
                }
            })
            .collect(),
        Err(e) => return Err(Status::internal(format!("Embedding error: {}", e))),
    };

    Ok(Response::new(EmbedMarkdownResponse { embeddings: result }))
}
Enter fullscreen mode Exit fullscreen mode

Highlights:

  • The HuggingFaceClient is used to chunk the markdown into logical sections.
  • Each chunk is embedded using the FastEmbed model.
  • The response contains both the embeddings and the original chunks, making it useful for downstream applications.

Resilient API Design

The service is designed with resilience in mind:

  • Error Handling: Any errors during embedding or chunking are caught and returned as gRPC Status errors.
  • Timeouts: The reqwest::Client is configured with a timeout to prevent long waits on external APIs.
  • Dependency Decoupling: The use of Arc<FastEmbed> and the modular HuggingFaceClient ensures that the service remains flexible and scalable.

Example Workflow

  1. Chunk Embedding:

    • The client sends a ChunkEmbedRequest containing a list of text chunks.
    • The service responds with flattened embeddings as a ChunkEmbedResponse.
  2. Markdown Embedding

    • The client sends an EmbedMarkdownRequest containing markdown content.
    • The service responds with EmbedMarkdownResponse, which includes embeddings and the corresponding markdown chunks.

Conclusion: Building a Robust gRPC Server with main.rs

The main.rs file serves as the entry point for our gRPC server, tying together all the components we've built to create a production-ready embedding service. This file encapsulates the core logic for initializing the server, configuring services, and handling requests efficiently. Let’s break down the key takeaways and best practices demonstrated in this implementation.


Key Components of main.rs

  1. Service Initialization The ChunkEmbedService is initialized with a thread-safe Arc<FastEmbed> instance, ensuring that the embedding model can be shared across multiple requests without duplication. This design promotes resource efficiency and scalability.
   let fast_embed = Arc::new(FastEmbed::try_new()?);
   let embed_service = ChunkEmbedService::new(fast_embed);
Enter fullscreen mode Exit fullscreen mode
  1. gRPC Reflection Setup The inclusion of tonic_reflection enables service discovery and runtime inspection, which is invaluable for debugging and maintaining the server in production environments.
   let descriptor = tonic_reflection::server::Builder::configure()
       .register_encoded_file_descriptor_set(CHUNK_EMBED_DESCRIPTOR)
       .build_v1()?;
Enter fullscreen mode Exit fullscreen mode
  1. Server Configuration The server is configured to listen on all network interfaces (0.0.0.0) at port 50051, the standard port for gRPC services. This setup ensures the server is accessible from external clients while maintaining simplicity.
   let addr = "0.0.0.0:50051".parse()?;
Enter fullscreen mode Exit fullscreen mode
  1. Asynchronous Runtime The use of #[tokio::main] ensures that the server runs asynchronously, enabling it to handle multiple concurrent requests efficiently. This is critical for maintaining high performance under load.
   #[tokio::main]
   async fn main() -> Result<(), Box<dyn std::error::Error>> {
Enter fullscreen mode Exit fullscreen mode
  1. Error Handling The Result type and Box<dyn std::error::Error> provide robust error handling, ensuring that any issues during server initialization or operation are gracefully managed and logged.

Best Practices Demonstrated

  1. Thread Safety with Arc

    By wrapping the FastEmbed instance in an Arc, the server ensures that the embedding model is shared safely across threads, preventing data races and improving performance.

  2. Modular Design

    The separation of concerns is evident in the use of the service module, which encapsulates the gRPC service implementation. This modular approach makes the codebase easier to maintain and extend.

  3. Production-Ready Configuration

    The server is designed with production use in mind, featuring:

    • gRPC reflection for service discovery.
    • Asynchronous request handling for scalability.
    • Clear logging for monitoring and debugging.
  4. Error Propagation

    The use of ? for error propagation ensures that any failures during server setup or operation are immediately handled, preventing silent failures and improving reliability.

Kreya screenshot


Next Steps for Production Deployment

While the current implementation provides a solid foundation, there are additional steps you can take to prepare the server for a production environment:

  1. Add Logging and Metrics

    Integrate a logging framework like tracing or log to capture detailed logs. Additionally, consider adding metrics collection using tools like Prometheus to monitor server performance.

  2. Implement Security Measures

    • Enable TLS for secure communication.
    • Add authentication and authorization mechanisms to restrict access to the service.
  3. Configure Health Checks

    Implement gRPC health checks to allow load balancers and orchestration tools (e.g., Kubernetes) to monitor the server’s status.

  4. Optimize Resource Usage

    • Set connection limits and timeouts to prevent resource exhaustion.
    • Use connection pooling for downstream services like HuggingFace API.
  5. Automate Deployment

    Use containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) to streamline deployment and scaling.


Final Thoughts

The main.rs file is the backbone of our gRPC server, bringing together all the components we’ve built into a cohesive and efficient system. By following best practices in error handling, resource management, and modular design, this implementation provides a robust foundation for building scalable and maintainable gRPC services in Rust.

As you continue to develop and deploy this server, remember to:

  • Monitor performance and resource usage.
  • Keep dependencies up to date.
  • Document API changes and server configurations.
  • Regularly test and validate the system under load.

With these considerations in mind, you’re well-equipped to build and deploy high-performance gRPC services that meet the demands of modern applications.

I hope you take pleasure to read than me to coding and write this article, I am happy to discuss about coding, rust and AI, you can find me on linkedin and Twitter :

See ya !

Speedy emails, satisfied customers

Postmark Image

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs