DEV Community

Cover image for I Built a Multilingual Vector Search Engine in Go for $0 (without OpenAI)
Martin
Martin

Posted on • Originally published at audiotext.live

I Built a Multilingual Vector Search Engine in Go for $0 (without OpenAI)

The standard advice for building semantic search in 2025 is boring: "Just send it to OpenAI."

You sign up, you get an API key, you send your customer’s private data to text-embedding-3-small, and you pay a monthly bill. It works, but it feels like cheating. It also adds network latency and a dependency I can't control.

I am a solo dev building a platform that records and analyzes phone calls. I wanted my users to be able to search their call history not just for keywords like "billing," but for concepts like "customer is frustrated about the price."

My backend is written in Go. The machine learning ecosystem lives in Python. I have a Kubernetes cluster running on bare metal. I didn't want to pay OpenAI rent for something my CPU could do for free.

Here is how I bridged the gap, optimized the build with uv, and built a cross-lingual search engine using nothing but Redis and a 1GB Docker container.

The "Two Language" Problem

I write Go. I like Go. It’s fast, typed, and deploys as a single binary. But Go’s machine learning story is sparse. There are bindings for PyTorch and TensorFlow, but they are heavy, hard to compile, and painful to manage in production.

Python is where the models are.

So I needed a way to get strings out of Go, into a Python process to run the math, and get a vector (a list of 384 floats) back.

I could have used gRPC. I could have used a REST API. I could have used NATS (which I use for everything else). But for a high-throughput, low-latency, strictly internal loop, I chose the dumbest, fastest thing available: Redis Lists.

The "Poor Man's IPC"

The architecture is dead simple:

  1. Go pushes a JSON object to a Redis List (RPUSH).
  2. Go sits and waits (BLPOP) on a specific response key.
  3. Python acts as a worker, pops the item, runs the model, and pushes the result back.

Here is the Go side:

// Generate a unique ID for this request so we can find the answer later
requestID := uuid.New().String()
task := TaskItem{
    RequestID:      requestID,
    Sentence:       sentence,
}

// Push to the "To-Do" list
err := c.redis.RPush(ctx, "embeddings:sentence_requests", taskJSON).Err()

// Wait for the specific answer key (blocking pop)
// It's like a function call, but over TCP
resultQueueKey := "embeddings:results:" + requestID
result, err := c.redis.BLPop(ctx, 1*time.Second, resultQueueKey).Result()
Enter fullscreen mode Exit fullscreen mode

It turns Redis into a synchronous function call interface. It’s surprisingly robust.

The Python Worker (Optimized with uv)

Python package management is usually a nightmare. I decided to use uv, the new Rust-based package manager. It is absurdly fast.

My Dockerfile uses a multi-stage build to keep the final image clean. I am not downloading the model at runtime (which is flaky and slow). I baked the model files directly into the image so the container can start without internet access.

# Dockerfile
FROM python:3.13-slim-bookworm AS builder
COPY --from=ghcr.io/astral-sh/uv:0.4.9 /uv /bin/uv

# "uv sync" is the new "pip install"
# --frozen ensures we stick to the lockfile exactly
RUN uv sync --frozen --no-install-project --no-dev

# Copy the local model files so we don't download them on startup
COPY models /app/models
Enter fullscreen mode Exit fullscreen mode

The "Batching" Trick via Lua

The naive way to write the Python worker is to loop BLPOP, process one sentence, and repeat. But Vector models (Transformers) love batches. Processing 8 sentences at once is much faster than processing 1 sentence 8 times.

Redis BLPOP only grabs one item. So I wrote a Lua script to grab a "buffer" of items atomically if they exist.

-- fetch_batch.lua
local items = redis.call('LRANGE', KEYS[1], 0, ARGV[1] - 1)
if #items > 0 then
    redis.call('LTRIM', KEYS[1], #items, -1)
end
return items
Enter fullscreen mode Exit fullscreen mode

In worker.py, I try to grab a batch. If the queue is empty, I block. If the queue has items, I use the Lua script to drain up to 8 items at once. This keeps the CPU fed constantly without hammering Redis.

The Model: multilingual-e5-small

I chose intfloat/multilingual-e5-small.

  • Size: 384 dimensions.
  • Disk: ~500MB.
  • Speed: fast enough to run on a standard CPU.

The "multilingual" part is where the magic happens. The model maps text to a vector space based on meaning, not language.

I tested this with a phone call recorded in English. The customer was asking for a refund. I went to my search bar and typed in Spanish: "Quiero mi dinero" (I want my money).

Qdrant (my vector database) returned the English call as the #1 result with a score of 0.88. I wrote zero translation code. The math just works.

Doing Data Science in Go

I wanted a dashboard visualization showing "Topic Clusters"—bubbles representing what people are talking about (Shipping, Billing, Support).

Usually, you'd send the data back to Python to run scikit-learn. But I already had the vectors in Go memory from Qdrant. Why serialize them again?

I decided to write K-Means Clustering in Go. It turns out, K-Means is just a few for loops.

// pkg/clustering/kmeans.go
func KMeans(vectors [][]float64, k int, maxIterations int) []Cluster {
    // 1. Pick random centroids
    // 2. Assign every point to closest centroid (Cosine Distance)
    // 3. Move centroid to the average of its points
    // 4. Repeat until converged
}
Enter fullscreen mode Exit fullscreen mode

Is it as optimized as SciPy? No. Does it cluster 1,000 call summaries in under 5 milliseconds? Yes.

I added a "safe threshold" estimator to dynamically guess how aggressive the clustering should be based on the data variance. Now my dashboard generates dynamic topic bubbles on the fly, entirely within the API binary.

The Result

This entire pipeline—embedding generation, vector storage, and clustering—runs on the same Hetzner server as my database.

  • OpenAI Cost: $0.00.
  • Network Latency: < 2ms (localhost).
  • Privacy: 100% Local.

We are often told that AI features require massive infrastructure and expensive APIs. Sometimes, all you need is a Python script, a Redis list, and the confidence to ignore the hype.


I wrote this up with slightly better syntax highlighting on my engineering blog for AudioText Live, where I'm documenting the process of building this stack.

Top comments (0)