Yuvraj Angad Singh

Posted on Mar 25

Embeddings shouldn't need a notebook

#ai #cli #python #machinelearning

I kept running into the same annoyance whenever I needed embeddings. The retrieval part of a RAG pipeline was hard enough already. Generating vectors should've been the easy part.

But every time I needed to embed something, the workflow looked like this:

Open a notebook or write a throwaway script
Import the SDK, set up the client
Figure out the right model name (was it text-embedding-004 or text-embedding-3-small?)
Write the call, handle the response format
Copy the vector out of the output

For text that's annoying. For images or audio it's worse. Different SDKs, different input formats, different response shapes.

I kept thinking: I can curl an API in seconds. I can jq a JSON response without writing a script. Why can't I just embed something from the terminal?

The tool I wanted

Something like httpie but for embeddings. Type a command, get a vector back.

vemb text "hello world"
# {"model": "gemini-embedding-2-preview", "dimensions": 3072, "values": [0.0123, -0.0456, ...]}

vemb text "hello world" --compact
# [0.0123, -0.0456, 0.0789, ...]

Embed an image:

vemb image photo.jpg

Embed a PDF:

vemb pdf report.pdf

Compare two files:

vemb similar photo1.jpg photo2.jpg
# 0.8734

No notebooks, no scripts, no boilerplate. Just the vector.

Why Gemini Embedding 2

I looked at OpenAI's embedding models first. Their embeddings endpoint is text-only. If you want to embed images, you're stitching together separate models and separate vector spaces. No clean way to compare text against images with a single embedding call.

Google released Gemini Embedding 2 (public preview, March 2026). One model that handles text, images, audio, video, and PDFs natively. Same vector space for everything. You can embed a photo and a text description and compare them directly with cosine similarity.

That's what made the CLI possible. One model, one API, all input types.

Building it

The whole thing is ~400 lines of Python. Two files: embed.py (core logic) and cli.py (Click commands).

The interesting parts:

Auto-detection: vemb embed guesses the file type from the extension. JPEGs, PNGs, MP3s, WAVs, MP4s, PDFs all work with the same command.

Batch mode: vemb embed *.jpg --jsonl embeds every file and outputs one JSON object per line.

Directory search: vemb search ./photos/ "dark moody sunset" embeds the query, embeds every file in the directory (with caching), and ranks by cosine similarity.

# search a folder of images by text description
vemb search ./photos/ "dark moody sunset" --top 5

0.8234    ./photos/sunset-beach.png
0.7891    ./photos/evening-skyline.png
0.7654    ./photos/golden-hour.png
0.6123    ./photos/cloudy-morning.png
0.5987    ./photos/overcast-street.png

Example use cases

Retrieval experiments: embed a few chunks, check similarity scores, tune the chunking. No notebook needed.

Image search: I keep a folder of reference mockups. vemb search ./mockups/ "login screen" finds the right ones instantly.

Checking if two files are semantically close: vemb similar draft-v1.pdf draft-v2.pdf tells me how much the content actually changed between versions.

Cross-modal search: embed a text query against a folder of images and get ranked results. One model, one vector space means text and images are directly comparable.

Try it

pipx install vemb
export GEMINI_API_KEY=your_key   # free at aistudio.google.com/apikey
vemb text "hello world"

Source and docs: github.com/yuvrajangadsingh/vemb

The API key is free, the model is free tier. If you're building anything with embeddings and you're tired of opening notebooks for a one-line operation, try it.

DEV Community