When building retrieval systems, semantic search engines, and RAG applications, generating embeddings is often the first step in the pipeline.
Most developers use one of the following approaches:
- PyTorch models through SentenceTransformers
- Hugging Face Transformers
- ONNX Runtime based solutions such as FastEmbed
These are excellent tools, but they all share one characteristic:
They execute a generic model runtime.
I started wondering:
What if we stop trying to support every possible model and instead optimize for a single model?
The Idea
Instead of executing a generic computation graph, I built a specialized implementation of BAAI/bge-small-en-v1.5.
The implementation directly performs the model's forward pass in native C.
The goal was not flexibility.
The goal was:
- lower latency
- lower memory usage
- simpler deployment
- zero runtime dependencies
This project eventually became FastTextEmbed.
Why BGE-Small?
BGE-small is one of the most widely used embedding models for:
- RAG systems
- semantic search
- retrieval pipelines
- vector databases
It provides a strong quality-to-size ratio and is often sufficient for production retrieval workloads.
Because of its popularity, it was a good candidate for specialization.
Design Goals
I intentionally avoided:
- PyTorch
- ONNX Runtime
- generic graph execution
The engine focuses on one task:
Generate BGE-small embeddings as efficiently as possible.
Current bindings include:
- Python
- Node.js
- Go
- Rust
- C
Benchmark Results
I benchmarked FastTextEmbed against:
- FastEmbed
- SentenceTransformers
- Transformers
- Optimum
Across Apple Silicon, ARM Ubuntu, and AMD EPYC systems.
In my tests, FastTextEmbed achieved:
- the highest throughput
- the lowest p50 latency
- the lowest peak memory usage
The complete benchmark results are available in the repository.
Quality Verification
Performance means nothing if the vectors are different.
To verify correctness, I compared outputs against ONNX Runtime.
The resulting embeddings achieved cosine similarity greater than 0.9998 in my tests.
This indicates that the specialized implementation closely reproduces the reference outputs.
Tradeoffs
Specialization always has tradeoffs.
FastTextEmbed does not attempt to become another embedding framework.
It currently focuses on:
- BAAI/bge-small-en-v1.5
- retrieval workloads
- deployment simplicity
If you need hundreds of interchangeable models, existing frameworks remain a better choice.
If you need one popular model running with minimal overhead, specialization becomes interesting.
What I Learned
The biggest lesson from this project is that generic infrastructure has a cost.
Frameworks provide flexibility, portability, and model support.
But if your workload revolves around a single model, there can be significant performance gains from specializing the implementation.
Repository
GitHub:
https://github.com/cemsina/fasttextembed
I'd love feedback on the implementation approach, benchmarks, and potential production use cases.
Top comments (0)