<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Martin</title>
    <description>The latest articles on DEV Community by Martin (@martin_saas).</description>
    <link>https://dev.to/martin_saas</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3692382%2Fcdad79a3-7abd-4dfb-8095-4e2c91391aab.png</url>
      <title>DEV Community: Martin</title>
      <link>https://dev.to/martin_saas</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/martin_saas"/>
    <language>en</language>
    <item>
      <title>I Built a Multilingual Vector Search Engine in Go for $0 (without OpenAI)</title>
      <dc:creator>Martin</dc:creator>
      <pubDate>Sun, 04 Jan 2026 09:43:01 +0000</pubDate>
      <link>https://dev.to/martin_saas/i-built-a-multilingual-vector-search-engine-in-go-for-0-without-openai-2lhj</link>
      <guid>https://dev.to/martin_saas/i-built-a-multilingual-vector-search-engine-in-go-for-0-without-openai-2lhj</guid>
      <description>&lt;p&gt;The standard advice for building semantic search in 2025 is boring: "Just send it to OpenAI."&lt;/p&gt;

&lt;p&gt;You sign up, you get an API key, you send your customer’s private data to &lt;code&gt;text-embedding-3-small&lt;/code&gt;, and you pay a monthly bill. It works, but it feels like cheating. It also adds network latency and a dependency I can't control.&lt;/p&gt;

&lt;p&gt;I am a solo dev building a platform that records and analyzes phone calls. I wanted my users to be able to search their call history not just for keywords like "billing," but for concepts like &lt;em&gt;"customer is frustrated about the price."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;My backend is written in &lt;strong&gt;Go&lt;/strong&gt;. The machine learning ecosystem lives in &lt;strong&gt;Python&lt;/strong&gt;. I have a Kubernetes cluster running on bare metal. I didn't want to pay OpenAI rent for something my CPU could do for free.&lt;/p&gt;

&lt;p&gt;Here is how I bridged the gap, optimized the build with &lt;code&gt;uv&lt;/code&gt;, and built a cross-lingual search engine using nothing but Redis and a 1GB Docker container.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Two Language" Problem
&lt;/h2&gt;

&lt;p&gt;I write Go. I like Go. It’s fast, typed, and deploys as a single binary. But Go’s machine learning story is sparse. There are bindings for PyTorch and TensorFlow, but they are heavy, hard to compile, and painful to manage in production.&lt;/p&gt;

&lt;p&gt;Python is where the models are.&lt;/p&gt;

&lt;p&gt;So I needed a way to get strings &lt;em&gt;out&lt;/em&gt; of Go, into a Python process to run the math, and get a vector (a list of 384 floats) back.&lt;/p&gt;

&lt;p&gt;I could have used gRPC. I could have used a REST API. I could have used NATS (which I use for everything else). But for a high-throughput, low-latency, strictly internal loop, I chose the dumbest, fastest thing available: &lt;strong&gt;Redis Lists.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The "Poor Man's IPC"
&lt;/h3&gt;

&lt;p&gt;The architecture is dead simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Go&lt;/strong&gt; pushes a JSON object to a Redis List (&lt;code&gt;RPUSH&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Go&lt;/strong&gt; sits and waits (&lt;code&gt;BLPOP&lt;/code&gt;) on a specific response key.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Python&lt;/strong&gt; acts as a worker, pops the item, runs the model, and pushes the result back.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the Go side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// Generate a unique ID for this request so we can find the answer later&lt;/span&gt;
&lt;span class="n"&gt;requestID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;TaskItem&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;RequestID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;      &lt;span class="n"&gt;requestID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;Sentence&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;       &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Push to the "To-Do" list&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RPush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"embeddings:sentence_requests"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;taskJSON&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c"&gt;// Wait for the specific answer key (blocking pop)&lt;/span&gt;
&lt;span class="c"&gt;// It's like a function call, but over TCP&lt;/span&gt;
&lt;span class="n"&gt;resultQueueKey&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="s"&gt;"embeddings:results:"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;requestID&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BLPop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Second&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resultQueueKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It turns Redis into a synchronous function call interface. It’s surprisingly robust.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Python Worker (Optimized with &lt;code&gt;uv&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Python package management is usually a nightmare. I decided to use &lt;code&gt;uv&lt;/code&gt;, the new Rust-based package manager. It is absurdly fast.&lt;/p&gt;

&lt;p&gt;My Dockerfile uses a multi-stage build to keep the final image clean. I am not downloading the model at runtime (which is flaky and slow). I baked the model files directly into the image so the container can start without internet access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Dockerfile&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.13-slim-bookworm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=ghcr.io/astral-sh/uv:0.4.9 /uv /bin/uv&lt;/span&gt;

&lt;span class="c"&gt;# "uv sync" is the new "pip install"&lt;/span&gt;
&lt;span class="c"&gt;# --frozen ensures we stick to the lockfile exactly&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;uv &lt;span class="nb"&gt;sync&lt;/span&gt; &lt;span class="nt"&gt;--frozen&lt;/span&gt; &lt;span class="nt"&gt;--no-install-project&lt;/span&gt; &lt;span class="nt"&gt;--no-dev&lt;/span&gt;

&lt;span class="c"&gt;# Copy the local model files so we don't download them on startup&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; models /app/models&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The "Batching" Trick via Lua
&lt;/h3&gt;

&lt;p&gt;The naive way to write the Python worker is to loop &lt;code&gt;BLPOP&lt;/code&gt;, process one sentence, and repeat. But Vector models (Transformers) love batches. Processing 8 sentences at once is much faster than processing 1 sentence 8 times.&lt;/p&gt;

&lt;p&gt;Redis &lt;code&gt;BLPOP&lt;/code&gt; only grabs one item. So I wrote a Lua script to grab a "buffer" of items atomically if they exist.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- fetch_batch.lua&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'LRANGE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'LTRIM'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;worker.py&lt;/code&gt;, I try to grab a batch. If the queue is empty, I block. If the queue has items, I use the Lua script to drain up to 8 items at once. This keeps the CPU fed constantly without hammering Redis.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model: &lt;code&gt;multilingual-e5-small&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;I chose &lt;code&gt;intfloat/multilingual-e5-small&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Size:&lt;/strong&gt; 384 dimensions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Disk:&lt;/strong&gt; ~500MB.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Speed:&lt;/strong&gt; fast enough to run on a standard CPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The "multilingual" part is where the magic happens. The model maps text to a vector space based on &lt;em&gt;meaning&lt;/em&gt;, not language.&lt;/p&gt;

&lt;p&gt;I tested this with a phone call recorded in &lt;strong&gt;English&lt;/strong&gt;. The customer was asking for a refund. I went to my search bar and typed in &lt;strong&gt;Spanish&lt;/strong&gt;: &lt;em&gt;"Quiero mi dinero"&lt;/em&gt; (I want my money).&lt;/p&gt;

&lt;p&gt;Qdrant (my vector database) returned the English call as the #1 result with a score of &lt;code&gt;0.88&lt;/code&gt;. I wrote zero translation code. The math just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Doing Data Science in Go
&lt;/h2&gt;

&lt;p&gt;I wanted a dashboard visualization showing "Topic Clusters"—bubbles representing what people are talking about (Shipping, Billing, Support).&lt;/p&gt;

&lt;p&gt;Usually, you'd send the data back to Python to run &lt;code&gt;scikit-learn&lt;/code&gt;. But I already had the vectors in Go memory from Qdrant. Why serialize them again?&lt;/p&gt;

&lt;p&gt;I decided to write &lt;strong&gt;K-Means Clustering&lt;/strong&gt; in Go. It turns out, K-Means is just a few &lt;code&gt;for&lt;/code&gt; loops.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// pkg/clustering/kmeans.go&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;KMeans&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="p"&gt;[][]&lt;/span&gt;&lt;span class="kt"&gt;float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxIterations&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Cluster&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// 1. Pick random centroids&lt;/span&gt;
    &lt;span class="c"&gt;// 2. Assign every point to closest centroid (Cosine Distance)&lt;/span&gt;
    &lt;span class="c"&gt;// 3. Move centroid to the average of its points&lt;/span&gt;
    &lt;span class="c"&gt;// 4. Repeat until converged&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is it as optimized as SciPy? No. Does it cluster 1,000 call summaries in under 5 milliseconds? Yes.&lt;/p&gt;

&lt;p&gt;I added a "safe threshold" estimator to dynamically guess how aggressive the clustering should be based on the data variance. Now my dashboard generates dynamic topic bubbles on the fly, entirely within the API binary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;This entire pipeline—embedding generation, vector storage, and clustering—runs on the same Hetzner server as my database.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;OpenAI Cost:&lt;/strong&gt; $0.00.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Network Latency:&lt;/strong&gt; &amp;lt; 2ms (localhost).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Privacy:&lt;/strong&gt; 100% Local.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are often told that AI features require massive infrastructure and expensive APIs. Sometimes, all you need is a Python script, a Redis list, and the confidence to ignore the hype.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I wrote this up with slightly better syntax highlighting on my engineering blog for &lt;a href="https://audiotext.live/blog/posts/multilingual-vector-search-go-python-redis/" rel="noopener noreferrer"&gt;AudioText Live&lt;/a&gt;, where I'm documenting the process of building this stack.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>go</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
