DEV Community

Cemsina Güzel
Cemsina Güzel

Posted on

How I Built a Native BGE-Small Embedding Engine That Beats Generic Runtimes in My Benchmarks

When building retrieval systems, semantic search engines, and RAG applications, generating embeddings is often the first step in the pipeline.

Most developers use one of the following approaches:

  • PyTorch models through SentenceTransformers
  • Hugging Face Transformers
  • ONNX Runtime based solutions such as FastEmbed

These are excellent tools, but they all share one characteristic:

They execute a generic model runtime.

I started wondering:

What if we stop trying to support every possible model and instead optimize for a single model?

The Idea

Instead of executing a generic computation graph, I built a specialized implementation of BAAI/bge-small-en-v1.5.

The implementation directly performs the model's forward pass in native C.

The goal was not flexibility.

The goal was:

  • lower latency
  • lower memory usage
  • simpler deployment
  • zero runtime dependencies

This project eventually became FastTextEmbed.

Why BGE-Small?

BGE-small is one of the most widely used embedding models for:

  • RAG systems
  • semantic search
  • retrieval pipelines
  • vector databases

It provides a strong quality-to-size ratio and is often sufficient for production retrieval workloads.

Because of its popularity, it was a good candidate for specialization.

Design Goals

I intentionally avoided:

  • PyTorch
  • ONNX Runtime
  • generic graph execution

The engine focuses on one task:

Generate BGE-small embeddings as efficiently as possible.

Current bindings include:

  • Python
  • Node.js
  • Go
  • Rust
  • C

Benchmark Results

I benchmarked FastTextEmbed against:

  • FastEmbed
  • SentenceTransformers
  • Transformers
  • Optimum

Across Apple Silicon, ARM Ubuntu, and AMD EPYC systems.

In my tests, FastTextEmbed achieved:

  • the highest throughput
  • the lowest p50 latency
  • the lowest peak memory usage

The complete benchmark results are available in the repository.

Quality Verification

Performance means nothing if the vectors are different.

To verify correctness, I compared outputs against ONNX Runtime.

The resulting embeddings achieved cosine similarity greater than 0.9998 in my tests.

This indicates that the specialized implementation closely reproduces the reference outputs.

Tradeoffs

Specialization always has tradeoffs.

FastTextEmbed does not attempt to become another embedding framework.

It currently focuses on:

  • BAAI/bge-small-en-v1.5
  • retrieval workloads
  • deployment simplicity

If you need hundreds of interchangeable models, existing frameworks remain a better choice.

If you need one popular model running with minimal overhead, specialization becomes interesting.

What I Learned

The biggest lesson from this project is that generic infrastructure has a cost.

Frameworks provide flexibility, portability, and model support.

But if your workload revolves around a single model, there can be significant performance gains from specializing the implementation.

Repository

GitHub:

https://github.com/cemsina/fasttextembed

I'd love feedback on the implementation approach, benchmarks, and potential production use cases.

Top comments (0)