Let's go ! We have to make some cool stuff
In this article, I will present to you a way to make a microservice who have the task to create vectors from markdown then return them
Ready, steady, GO !!!
Understanding the .proto
File for gRPC-Based Embedding Service
When building a gRPC service, the .proto
file serves as the core contract that defines the API structure. It specifies the service methods, their input and output message types, and how data is serialized. Here’s the .proto
file for our embedding service:
syntax = "proto3";
package chunk_embed;
service ChunkEmbed {
rpc ChunkEmbedMessage(ChunkEmbedRequest) returns (ChunkEmbedResponse);
rpc EmbedMarkdown(EmbedMarkdownRequest) returns (EmbedMarkdownResponse);
}
message ChunkEmbedRequest {
repeated string chunks = 1;
}
message ChunkEmbedResponse {
repeated float embeddings = 1;
}
message EmbedMarkdownRequest {
string markdown = 1;
}
message EmbedMarkdownResponse {
repeated EmbeddingFromMarkdown embeddings = 1;
}
message EmbeddingFromMarkdown {
repeated float embedding = 1;
bytes chunk = 2;
}
Key Components of the .proto
File:
-
Service Definition:
- The
ChunkEmbed
service defines two remote procedure calls (RPCs):-
ChunkEmbedMessage
: Accepts a list of text chunks (chunks
) and returns their embeddings as a flat array of floats. -
EmbedMarkdown
: Accepts a markdown string, processes it into chunks, and returns embeddings for each chunk.
-
- The
-
Message Types:
-
ChunkEmbedRequest
:- Input for
ChunkEmbedMessage
. It contains arepeated string
field namedchunks
, representing a list of text segments.
- Input for
-
ChunkEmbedResponse
:- Output for
ChunkEmbedMessage
. It contains arepeated float
field namedembeddings
, representing a flat array of embeddings.
- Output for
-
EmbedMarkdownRequest
:- Input for
EmbedMarkdown
. It contains a single string field namedmarkdown
, representing markdown content to process.
- Input for
-
EmbedMarkdownResponse
:- Output for
EmbedMarkdown
. It contains a list ofEmbeddingFromMarkdown
objects.
- Output for
-
EmbeddingFromMarkdown
:- Holds both the embedding (a
repeated float
) and the corresponding markdown chunk (bytes chunk
).
- Holds both the embedding (a
-
-
Serialization and Data Flow:
- Protobuf serialization ensures lightweight and efficient data transfer over the network.
- Messages like
ChunkEmbedRequest
andEmbedMarkdownRequest
specify how the client should format requests, whileChunkEmbedResponse
andEmbedMarkdownResponse
detail the server’s expected outputs.
This .proto
file is the foundation for implementing the service in Rust or any gRPC-supported language. It ensures consistent communication between the client and server, making it crucial for building scalable, interoperable systems.
Rust build.rs
for Compiling Protobuf
The build.rs
file is a build script that runs before compiling the main Rust code. It is essential for generating the gRPC types and ensuring that the .proto
file definitions are translated into Rust code that the application can use. Here is the provided build.rs
file:
use std::error::Error;
use std::{env, path::PathBuf};
fn main() -> Result<(), Box<dyn Error>> {
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
tonic_build::configure()
.file_descriptor_set_path(out_dir.join("chunk_embed_descriptor.bin"))
.compile_protos(&["proto/chunk_embed.proto"], &["proto"])?;
tonic_build::compile_protos("proto/chunk_embed.proto")?;
Ok(())
}
Breakdown of the Code:
- Setting the Output Directory:
let out_dir = PathBuf::from(env::var("OUT_DIR").unwrap());
- The
OUT_DIR
environment variable is used to determine where the generated files should be placed. This directory is automatically managed by Cargo for build artifacts.
- File Descriptor Set:
tonic_build::configure()
.file_descriptor_set_path(out_dir.join("chunk_embed_descriptor.bin"))
.compile_protos(&["proto/chunk_embed.proto"], &["proto"])?;
- The
file_descriptor_set_path
function generates a.bin
file containing the serialized descriptors for the Protobuf definitions. This is useful for reflection or dynamic use of the service (e.g., in tests or external tools).
- Compiling the Protos:
tonic_build::compile_protos("proto/chunk_embed.proto")?;
-
compile_protos
is the core function oftonic-build
that converts the.proto
file into Rust modules. It takes:- The
.proto
file to compile. - The path to look for imports (in this case, the
proto
directory).
- The
-
Error Handling:
- The function returns a
Result
type to ensure proper error handling during the build process. Any failure in generating the Protobuf definitions will prevent the application from compiling.
- The function returns a
Why Is This Important?
-
Automation:
build.rs
ensures that any updates to the.proto
file are automatically reflected in the generated Rust code during compilation. This avoids manual regeneration and reduces the chance of mismatch between the service contract and implementation. -
Compatibility: By compiling
.proto
files at build time, you ensure that your gRPC service can handle cross-language communication while adhering to the service definition. -
Descriptor Generation: The
.bin
file can be used for additional features like gRPC reflection, debugging, or dynamic client generation.
How to Use
- Place the
.proto
file in theproto
directory (relative to your project root). - Ensure
build.rs
is in the root of your project. - Add the following to your
Cargo.toml
:
[build-dependencies]
tonic-build = "0.9"
- Run
cargo build
to compile the.proto
file and generate the gRPC types.
Output
After running the build process:
- The generated code will be located in
OUT_DIR
(e.g.,target/debug/build/...
). - You can access the Rust module for
chunk_embed
using:
pub mod chunk_embed {
tonic::include_proto!("chunk_embed");
}
This makes the gRPC types and services directly available in your Rust code!
Building a Markdown Chunking Client Using HuggingFace Models in Rust
This implementation provides a complete Rust client for chunking markdown documents into logical sections using HuggingFace's inference API. It employs an exponential backoff strategy for retries, ensuring robust handling of potential API failures. Let's break down the functionality.
Key Features of the Code:
-
Markdown Chunking with HuggingFace:
- The service takes a markdown document, sends it to a HuggingFace language model, and receives logical "chunks" separated by the delimiter
---
.
- The service takes a markdown document, sends it to a HuggingFace language model, and receives logical "chunks" separated by the delimiter
-
Retry Logic:
- A retry mechanism is implemented with exponential backoff to handle transient errors from the API.
-
Async Design:
- Leveraging
tokio
for async requests and testing.
- Leveraging
Code Walkthrough
Structs for Requests and Responses
#[derive(Serialize)]
struct ChunkRequest {
model: String,
messages: Vec<Message>,
temperature: f32,
}
#[derive(Serialize, Deserialize, Debug)]
struct Message {
role: String,
content: String,
}
#[derive(Deserialize, Debug)]
struct ChunkResponse {
choices: Vec<Choice>,
}
#[derive(Deserialize, Debug)]
struct Choice {
message: Message,
}
-
ChunkRequest
: Represents the JSON payload sent to the HuggingFace API. -
Message
: Mimics a conversational exchange (system, user, and assistant roles). -
ChunkResponse
andChoice
: Represent the structure of the API response.
The HuggingFaceClient
Implementation
pub struct HuggingFaceClient;
impl HuggingFaceClient {
pub fn new() -> Self {
HuggingFaceClient
}
pub async fn chunk_markdown(&self, client: &Client, api_key: &str, markdown: &str) -> Result<Vec<String>> {
retry_with_backoff(|| async {
let request = ChunkRequest {
model: "meta-llama/Llama-3.3-70B-Instruct".to_string(),
messages: vec![
Message {
role: "system".to_string(),
content: "You are a helpful assistant that chunks markdown content into logical sections. Each chunk should be a complete thought or section.".to_string(),
},
Message {
role: "user".to_string(),
content: format!("Please chunk the following markdown into logical sections. Separate each section with '---':\n\n{}", markdown),
},
],
temperature: 0.7,
};
let response = client
.post("https://api-inference.huggingface.co/models/meta-llama/Llama-3.3-70B-Instruct/v1/chat/completions")
.header("Authorization", format!("Bearer {}", api_key))
.json(&request)
.send()
.await?;
if !response.status().is_success() {
let error_text = response.text().await?;
return Err(anyhow!("API request failed: {}", error_text));
}
let chunk_response: ChunkResponse = response.json().await?;
let content = chunk_response.choices
.first()
.ok_or_else(|| anyhow!("No choices in response"))?
.message
.content
.clone();
let chunks: Vec<String> = content
.split("---")
.map(|s| s.trim().to_string())
.filter(|s| !s.is_empty())
.collect();
if chunks.is_empty() {
Ok(vec![content])
} else {
Ok(chunks)
}
}).await
}
}
Highlights:
-
chunk_markdown
Function:- Sends a markdown document to HuggingFace's inference API.
- Parses the response to extract logical chunks delimited by
---
.
-
API Request:
- The
Client
fromreqwest
sends the POST request to HuggingFace's API with the necessary headers and JSON payload.
- The
-
Chunk Parsing:
- The response content is split by the delimiter
---
into logical sections.
- The response content is split by the delimiter
-
Resilience:
- The function retries the request up to 3 times with exponential backoff on failure.
Retry Logic with Exponential Backoff
async fn retry_with_backoff<F, Fut, T>(mut f: F) -> Result<T, anyhow::Error>
where
F: FnMut() -> Fut,
Fut: std::future::Future<Output = Result<T, anyhow::Error>>,
{
let mut attempts = 0;
let max_attempts = 3;
let mut delay = Duration::from_millis(100);
loop {
match f().await {
Ok(result) => return Ok(result),
Err(e) if attempts < max_attempts => {
println!("Attempt {} failed: {}. Retrying...", attempts + 1, e);
attempts += 1;
sleep(delay).await;
delay *= 2; // Exponential backoff
}
Err(e) => return Err(e),
}
}
}
- Implements a simple retry mechanism with exponential backoff for resilience.
- If the request fails up to 3 times, it returns an error.
Testing the Client
The #[tokio::test]
macro is used to write an async test for the chunk_markdown
function.
#[tokio::test]
async fn test_chunk() -> Result<(), Box<dyn std::error::Error>> {
dotenv().ok();
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.build()?;
let api_key = std::env::var("HF_API_KEY").expect("HF_API_KEY must be set");
, //can't give you markdown in a markdown article ^^
let markdown = "your_markdown_here";
println!("Markdown: {}", markdown);
let hf_client = HuggingFaceClient::new();
let chunks = hf_client.chunk_markdown(&client, &api_key, markdown).await?;
println!("Number of chunks: {}", chunks.len());
for (i, chunk) in chunks.iter().enumerate() {
println!("Chunk {}: {}", i + 1, chunk);
}
assert!(!chunks.is_empty(), "Expected at least one chunk");
Ok(())
}
Highlights:
-
Environment Variable:
-
HF_API_KEY
is loaded from the.env
file for security.
-
-
Markdown Example:
- A markdown document is tested to ensure chunking is functional.
-
Assertions:
- The test asserts that at least one chunk is returned.
Implementing the gRPC Embedding Service in Rust
This ChunkEmbedService
is the core gRPC service that handles two primary functions:
- Generating embeddings for a list of text chunks.
- Chunking markdown content into logical sections and embedding each section.
Let's break down its key components and functionality.
Service Structure
The ChunkEmbedService
leverages two main dependencies:
-
FastEmbed
: An embedding model that provides high-performance embeddings for text data. -
HuggingFaceClient
: A client for interacting with the HuggingFace API to chunk markdown into logical sections.
Key Features
1. Service Struct Definition
The service is encapsulated in the ChunkEmbedService
struct, which contains a reference to the FastEmbed
model:
pub struct ChunkEmbedService {
embedder: Arc<FastEmbed>,
}
impl ChunkEmbedService {
pub fn new(embedder: Arc<FastEmbed>) -> Self {
ChunkEmbedService { embedder }
}
}
The Arc<FastEmbed>
ensures thread-safe sharing of the embedding model across multiple requests.
2. gRPC Method: chunk_embed_message
This method accepts a list of text chunks and generates embeddings for each chunk. The embeddings are then flattened and returned as a single array.
#[tonic::async_trait]
impl ChunkEmbed for ChunkEmbedService {
async fn chunk_embed_message(
&self,
request: Request<ChunkEmbedRequest>,
) -> Result<Response<ChunkEmbedResponse>, Status> {
let chunks = request.into_inner().chunks;
// Generate embeddings with the model
let embeddings_result: Result<Vec<Vec<f64>>, Status> = self
.embedder
.embed_documents(&chunks)
.await
.map_err(|e| Status::internal(format!("Embedding error: {}", e)));
// Handle errors
let embeddings = embeddings_result?;
// Flatten and convert embeddings to f32
let embeddings: Vec<f32> = embeddings
.into_iter()
.flatten()
.map(|e| e as f32)
.collect();
// Construct the response
let response = ChunkEmbedResponse { embeddings };
Ok(Response::new(response))
}
}
Highlights:
- The
embed_documents
method is called asynchronously to process the text chunks. - Any errors are handled and mapped to a gRPC
Status::internal
error. - The result is flattened and converted to
f32
to comply with theChunkEmbedResponse
schema.
3. gRPC Method: embed_markdown
This method processes markdown content by:
- Chunking it into logical sections using the
HuggingFaceClient
. - Generating embeddings for each chunk.
async fn embed_markdown(
&self,
request: Request<EmbedMarkdownRequest>,
) -> Result<Response<EmbedMarkdownResponse>, Status> {
dotenv().ok();
let client = reqwest::Client::builder()
.timeout(Duration::from_secs(30))
.build();
let markdown = request.into_inner().markdown;
let api_key = std::env::var("HF_API_KEY").expect("HF_API_KEY must be set");
let hf_client = HuggingFaceClient::new();
let chunks = hf_client.chunk_markdown(&client.unwrap(), &api_key, &markdown).await.unwrap();
// Generate embeddings with the model
let embeddings_result: Result<Vec<Vec<f64>>, Status> = self
.embedder
.embed_documents(&chunks)
.await
.map_err(|e| Status::internal(format!("Embedding error: {}", e)));
// Create response with embeddings and original chunks
let result: Vec<EmbeddingFromMarkdown> = match embeddings_result {
Ok(embeddings) => embeddings
.iter()
.enumerate()
.map(|(index, embedding)| {
EmbeddingFromMarkdown {
embedding: embedding.iter().map(|&e| e as f32).collect(),
chunk: chunks
.get(index)
.expect("Chunk not found for index")
.as_bytes()
.to_vec(),
}
})
.collect(),
Err(e) => return Err(Status::internal(format!("Embedding error: {}", e))),
};
Ok(Response::new(EmbedMarkdownResponse { embeddings: result }))
}
Highlights:
- The
HuggingFaceClient
is used to chunk the markdown into logical sections. - Each chunk is embedded using the
FastEmbed
model. - The response contains both the embeddings and the original chunks, making it useful for downstream applications.
Resilient API Design
The service is designed with resilience in mind:
-
Error Handling: Any errors during embedding or chunking are caught and returned as gRPC
Status
errors. -
Timeouts: The
reqwest::Client
is configured with a timeout to prevent long waits on external APIs. -
Dependency Decoupling: The use of
Arc<FastEmbed>
and the modularHuggingFaceClient
ensures that the service remains flexible and scalable.
Example Workflow
-
Chunk Embedding:
- The client sends a
ChunkEmbedRequest
containing a list of text chunks. - The service responds with flattened embeddings as a
ChunkEmbedResponse
.
- The client sends a
-
Markdown Embedding
- The client sends an
EmbedMarkdownRequest
containing markdown content. - The service responds with
EmbedMarkdownResponse
, which includes embeddings and the corresponding markdown chunks.
- The client sends an
Conclusion: Building a Robust gRPC Server with main.rs
The main.rs
file serves as the entry point for our gRPC server, tying together all the components we've built to create a production-ready embedding service. This file encapsulates the core logic for initializing the server, configuring services, and handling requests efficiently. Let’s break down the key takeaways and best practices demonstrated in this implementation.
Key Components of main.rs
-
Service Initialization
The
ChunkEmbedService
is initialized with a thread-safeArc<FastEmbed>
instance, ensuring that the embedding model can be shared across multiple requests without duplication. This design promotes resource efficiency and scalability.
let fast_embed = Arc::new(FastEmbed::try_new()?);
let embed_service = ChunkEmbedService::new(fast_embed);
-
gRPC Reflection Setup
The inclusion of
tonic_reflection
enables service discovery and runtime inspection, which is invaluable for debugging and maintaining the server in production environments.
let descriptor = tonic_reflection::server::Builder::configure()
.register_encoded_file_descriptor_set(CHUNK_EMBED_DESCRIPTOR)
.build_v1()?;
-
Server Configuration
The server is configured to listen on all network interfaces (
0.0.0.0
) at port50051
, the standard port for gRPC services. This setup ensures the server is accessible from external clients while maintaining simplicity.
let addr = "0.0.0.0:50051".parse()?;
-
Asynchronous Runtime
The use of
#[tokio::main]
ensures that the server runs asynchronously, enabling it to handle multiple concurrent requests efficiently. This is critical for maintaining high performance under load.
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
-
Error Handling
The
Result
type andBox<dyn std::error::Error>
provide robust error handling, ensuring that any issues during server initialization or operation are gracefully managed and logged.
Best Practices Demonstrated
Thread Safety with
Arc
By wrapping theFastEmbed
instance in anArc
, the server ensures that the embedding model is shared safely across threads, preventing data races and improving performance.Modular Design
The separation of concerns is evident in the use of theservice
module, which encapsulates the gRPC service implementation. This modular approach makes the codebase easier to maintain and extend.-
Production-Ready Configuration
The server is designed with production use in mind, featuring:- gRPC reflection for service discovery.
- Asynchronous request handling for scalability.
- Clear logging for monitoring and debugging.
Error Propagation
The use of?
for error propagation ensures that any failures during server setup or operation are immediately handled, preventing silent failures and improving reliability.
Next Steps for Production Deployment
While the current implementation provides a solid foundation, there are additional steps you can take to prepare the server for a production environment:
Add Logging and Metrics
Integrate a logging framework liketracing
orlog
to capture detailed logs. Additionally, consider adding metrics collection using tools like Prometheus to monitor server performance.-
Implement Security Measures
- Enable TLS for secure communication.
- Add authentication and authorization mechanisms to restrict access to the service.
Configure Health Checks
Implement gRPC health checks to allow load balancers and orchestration tools (e.g., Kubernetes) to monitor the server’s status.-
Optimize Resource Usage
- Set connection limits and timeouts to prevent resource exhaustion.
- Use connection pooling for downstream services like HuggingFace API.
Automate Deployment
Use containerization (e.g., Docker) and orchestration tools (e.g., Kubernetes) to streamline deployment and scaling.
Final Thoughts
The main.rs
file is the backbone of our gRPC server, bringing together all the components we’ve built into a cohesive and efficient system. By following best practices in error handling, resource management, and modular design, this implementation provides a robust foundation for building scalable and maintainable gRPC services in Rust.
As you continue to develop and deploy this server, remember to:
- Monitor performance and resource usage.
- Keep dependencies up to date.
- Document API changes and server configurations.
- Regularly test and validate the system under load.
With these considerations in mind, you’re well-equipped to build and deploy high-performance gRPC services that meet the demands of modern applications.
I hope you take pleasure to read than me to coding and write this article, I am happy to discuss about coding, rust and AI, you can find me on linkedin and Twitter :
See ya !
Top comments (0)