Cemsina Güzel

Posted on Jun 3

How I Built a Native BGE-Small Embedding Engine That Beats Generic Runtimes in My Benchmarks

#c #machinelearning #nlp #performance

When building retrieval systems, semantic search engines, and RAG applications, generating embeddings is often the first step in the pipeline.

Most developers use one of the following approaches:

PyTorch models through SentenceTransformers
Hugging Face Transformers
ONNX Runtime based solutions such as FastEmbed

These are excellent tools, but they all share one characteristic:

They execute a generic model runtime.

I started wondering:

What if we stop trying to support every possible model and instead optimize for a single model?

The Idea

Instead of executing a generic computation graph, I built a specialized implementation of BAAI/bge-small-en-v1.5.

The implementation directly performs the model's forward pass in native C.

The goal was not flexibility.

The goal was:

lower latency
lower memory usage
simpler deployment
zero runtime dependencies

This project eventually became FastTextEmbed.

Why BGE-Small?

BGE-small is one of the most widely used embedding models for:

RAG systems
semantic search
retrieval pipelines
vector databases

It provides a strong quality-to-size ratio and is often sufficient for production retrieval workloads.

Because of its popularity, it was a good candidate for specialization.

Design Goals

I intentionally avoided:

PyTorch
ONNX Runtime
generic graph execution

The engine focuses on one task:

Generate BGE-small embeddings as efficiently as possible.

Current bindings include:

Python
Node.js
Go
Rust
C

Benchmark Results

I benchmarked FastTextEmbed against:

FastEmbed
SentenceTransformers
Transformers
Optimum

Across Apple Silicon, ARM Ubuntu, and AMD EPYC systems.

In my tests, FastTextEmbed achieved:

the highest throughput
the lowest p50 latency
the lowest peak memory usage

The complete benchmark results are available in the repository.

Quality Verification

Performance means nothing if the vectors are different.

To verify correctness, I compared outputs against ONNX Runtime.

The resulting embeddings achieved cosine similarity greater than 0.9998 in my tests.

This indicates that the specialized implementation closely reproduces the reference outputs.

Tradeoffs

Specialization always has tradeoffs.

FastTextEmbed does not attempt to become another embedding framework.

It currently focuses on:

BAAI/bge-small-en-v1.5
retrieval workloads
deployment simplicity

If you need hundreds of interchangeable models, existing frameworks remain a better choice.

If you need one popular model running with minimal overhead, specialization becomes interesting.

What I Learned

The biggest lesson from this project is that generic infrastructure has a cost.

Frameworks provide flexibility, portability, and model support.

But if your workload revolves around a single model, there can be significant performance gains from specializing the implementation.

Repository

GitHub:

https://github.com/cemsina/fasttextembed

I'd love feedback on the implementation approach, benchmarks, and potential production use cases.

DEV Community