<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Cemsina Güzel</title>
    <description>The latest articles on DEV Community by Cemsina Güzel (@cemsina).</description>
    <link>https://dev.to/cemsina</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3966672%2F4bdfa417-2a14-4f58-92a0-b663f36dbd84.jpg</url>
      <title>DEV Community: Cemsina Güzel</title>
      <link>https://dev.to/cemsina</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cemsina"/>
    <language>en</language>
    <item>
      <title>How I Built a Native BGE-Small Embedding Engine That Beats Generic Runtimes in My Benchmarks</title>
      <dc:creator>Cemsina Güzel</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:49:03 +0000</pubDate>
      <link>https://dev.to/cemsina/how-i-built-a-native-bge-small-embedding-engine-that-beats-generic-runtimes-in-my-benchmarks-1jdl</link>
      <guid>https://dev.to/cemsina/how-i-built-a-native-bge-small-embedding-engine-that-beats-generic-runtimes-in-my-benchmarks-1jdl</guid>
      <description>&lt;p&gt;When building retrieval systems, semantic search engines, and RAG applications, generating embeddings is often the first step in the pipeline.&lt;/p&gt;

&lt;p&gt;Most developers use one of the following approaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PyTorch models through SentenceTransformers&lt;/li&gt;
&lt;li&gt;Hugging Face Transformers&lt;/li&gt;
&lt;li&gt;ONNX Runtime based solutions such as FastEmbed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are excellent tools, but they all share one characteristic:&lt;/p&gt;

&lt;p&gt;They execute a generic model runtime.&lt;/p&gt;

&lt;p&gt;I started wondering:&lt;/p&gt;

&lt;p&gt;What if we stop trying to support every possible model and instead optimize for a single model?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Instead of executing a generic computation graph, I built a specialized implementation of BAAI/bge-small-en-v1.5.&lt;/p&gt;

&lt;p&gt;The implementation directly performs the model's forward pass in native C.&lt;/p&gt;

&lt;p&gt;The goal was not flexibility.&lt;/p&gt;

&lt;p&gt;The goal was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lower latency&lt;/li&gt;
&lt;li&gt;lower memory usage&lt;/li&gt;
&lt;li&gt;simpler deployment&lt;/li&gt;
&lt;li&gt;zero runtime dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This project eventually became FastTextEmbed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why BGE-Small?
&lt;/h2&gt;

&lt;p&gt;BGE-small is one of the most widely used embedding models for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RAG systems&lt;/li&gt;
&lt;li&gt;semantic search&lt;/li&gt;
&lt;li&gt;retrieval pipelines&lt;/li&gt;
&lt;li&gt;vector databases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It provides a strong quality-to-size ratio and is often sufficient for production retrieval workloads.&lt;/p&gt;

&lt;p&gt;Because of its popularity, it was a good candidate for specialization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Goals
&lt;/h2&gt;

&lt;p&gt;I intentionally avoided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PyTorch&lt;/li&gt;
&lt;li&gt;ONNX Runtime&lt;/li&gt;
&lt;li&gt;generic graph execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine focuses on one task:&lt;/p&gt;

&lt;p&gt;Generate BGE-small embeddings as efficiently as possible.&lt;/p&gt;

&lt;p&gt;Current bindings include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;li&gt;Node.js&lt;/li&gt;
&lt;li&gt;Go&lt;/li&gt;
&lt;li&gt;Rust&lt;/li&gt;
&lt;li&gt;C&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;p&gt;I benchmarked FastTextEmbed against:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FastEmbed&lt;/li&gt;
&lt;li&gt;SentenceTransformers&lt;/li&gt;
&lt;li&gt;Transformers&lt;/li&gt;
&lt;li&gt;Optimum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Across Apple Silicon, ARM Ubuntu, and AMD EPYC systems.&lt;/p&gt;

&lt;p&gt;In my tests, FastTextEmbed achieved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the highest throughput&lt;/li&gt;
&lt;li&gt;the lowest p50 latency&lt;/li&gt;
&lt;li&gt;the lowest peak memory usage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete benchmark results are available in the repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quality Verification
&lt;/h2&gt;

&lt;p&gt;Performance means nothing if the vectors are different.&lt;/p&gt;

&lt;p&gt;To verify correctness, I compared outputs against ONNX Runtime.&lt;/p&gt;

&lt;p&gt;The resulting embeddings achieved cosine similarity greater than 0.9998 in my tests.&lt;/p&gt;

&lt;p&gt;This indicates that the specialized implementation closely reproduces the reference outputs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;Specialization always has tradeoffs.&lt;/p&gt;

&lt;p&gt;FastTextEmbed does not attempt to become another embedding framework.&lt;/p&gt;

&lt;p&gt;It currently focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BAAI/bge-small-en-v1.5&lt;/li&gt;
&lt;li&gt;retrieval workloads&lt;/li&gt;
&lt;li&gt;deployment simplicity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need hundreds of interchangeable models, existing frameworks remain a better choice.&lt;/p&gt;

&lt;p&gt;If you need one popular model running with minimal overhead, specialization becomes interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;The biggest lesson from this project is that generic infrastructure has a cost.&lt;/p&gt;

&lt;p&gt;Frameworks provide flexibility, portability, and model support.&lt;/p&gt;

&lt;p&gt;But if your workload revolves around a single model, there can be significant performance gains from specializing the implementation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;GitHub:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/cemsina/fasttextembed" rel="noopener noreferrer"&gt;https://github.com/cemsina/fasttextembed&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I'd love feedback on the implementation approach, benchmarks, and potential production use cases.&lt;/p&gt;

</description>
      <category>c</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
