alopezia_ai

Posted on Mar 2

How I Added AI Image Search to a Marketplace Bot (And Why It Changed Everything)

#ai #python #telegram #machinelearning

My users were frustrated. Not in a "this feature doesn't work" way — in a "I literally don't know what words to use" way.

A buyer would join the marketplace bot, open the search, and just... stare. They knew exactly what they wanted. They'd seen it at a friend's place, or spotted it in a photo. But they couldn't find the right keywords. "Decorative thing"? "Round wooden thingy"? Every search returned garbage, or nothing.

That's the moment I decided to build image search. And it turned out to be one of the best decisions I made for the project.

The Problem With Text Search in Marketplaces

Text search is great when users know the vocabulary. But in a secondhand or small-seller marketplace, that's rarely the case.

Sellers list things the way they think about them. Buyers search the way they think about them. Those two vocabularies almost never match perfectly. A seller lists "mid-century credenza." A buyer searches "brown dresser with legs." Even with fuzzy matching and morphological analysis (which I'd already implemented), there's a fundamental gap.

Photos don't have this problem. A photo of a thing is the thing. No translation required.

What I Actually Built

The core idea is simple: when a buyer sends a photo of something they're looking for, the bot finds visually similar products from existing listings.

Here's the user flow:

Buyer taps "Search by photo" in the bot
Sends any image — a photo they took, a screenshot, something from Pinterest
Bot processes it, searches the catalog, and returns the most visually similar listings
Buyer taps through results, finds what they want, contacts the seller

That's it. No keywords. No categories to navigate. Just "I want something that looks like this."

The Tech Stack (Without Getting Too Deep)

I'm using SigLIP — a vision-language model from Google — to convert images into vector embeddings. Think of an embedding as a list of ~1000 numbers that mathematically describes what an image looks like. Images of similar things have similar numbers.

When a seller uploads a product photo, the bot:

Runs the image through SigLIP's vision encoder
Gets back a 1152-dimensional vector
Stores it in Qdrant (a vector database built for this exact use case)

When a buyer searches by photo:

Same process — their photo becomes a vector
Qdrant finds the closest stored vectors using cosine similarity
Returns the matching products, ranked by visual similarity

The whole thing runs in under a second on modest hardware (I'm on a $9/month VPS with 2 CPUs and ~6GB RAM, for context).

# Simplified version of what happens during search
async def search_by_image(photo_bytes: bytes) -> list[Product]:
    # Convert photo to embedding
    embedding = await image_service.encode_image(photo_bytes)

    # Find similar products in vector DB
    results = await qdrant_client.search(
        collection_name="products",
        query_vector=embedding,
        limit=10,
        score_threshold=0.75  # Only return reasonably similar results
    )

    return [await get_product(r.id) for r in results]

The "Why Does This Actually Work?" Part

SigLIP was trained on billions of image-text pairs. It learned to understand visual concepts at a semantic level — not just pixel patterns. So when you send a photo of a red ceramic vase, it doesn't just find other red ceramic vases by color matching. It finds things that are conceptually vases, even if they're different colors, shapes, or photographed differently.

This matters a lot in marketplace context. Sellers photograph products in wildly different conditions — different lighting, backgrounds, angles. Classic image similarity algorithms would fail here. Semantic embeddings handle it gracefully.

I also use the text encoder from SigLIP to power the regular text search and NSFW filtering — same model, different head. That's why you can search with text like "vintage lamp" and get reasonable results even if sellers didn't use those exact words.

The ONNX Optimization That Made It Viable

Running SigLIP in production on a budget VPS isn't trivial. The original PyTorch models are huge and slow.

I converted both the vision and text encoders to ONNX format, then applied INT8 dynamic quantization using onnxruntime:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="vision_model.onnx",
    model_output="vision_int8.onnx",
    weight_type=QuantType.QUInt8,
    op_types_to_quantize=["MatMul", "Gemm"]
)

Results:

Vision model: 355MB → 97MB (3.7x smaller)
Text model: 421MB → 176MB (2.4x smaller)
Combined RAM savings: ~500MB
Speed: 2x faster inference

I benchmarked the quantized models against FP32 on 30 real product photos. NSFW detection agreement was 100%. Category classification agreement was 93% (2 edge cases where "cookware" was classified as "furniture" — acceptable). The quality loss is negligible, the resource savings are massive.

What Changed After Launching This

The honest answer: engagement patterns shifted noticeably.

Before image search, most users who couldn't find what they wanted just left. The bounce rate on the search flow was high. After launch, those same users had a fallback that actually worked.

More interestingly, image search surfaced unexpected results that users liked. Someone uploads a photo of a specific chair style and discovers a similar piece they'd never have thought to search for. That serendipity is hard to manufacture with keyword search.

Sellers noticed too. Products that previously had no text description beyond "nice chair" started getting found because the visual index doesn't care about descriptions.

The Stuff That Was Harder Than Expected

Thumbnail generation. When displaying search results, I needed consistent thumbnails. Sellers upload wildly different aspect ratios. Getting crops that looked good without cutting off the actual product required more iteration than I expected.

Score thresholds. Setting the right similarity threshold is more art than science. Too strict, and you return zero results even when good matches exist. Too loose, and you show obviously irrelevant products. I ended up at 0.75 cosine similarity after testing with real data, but it took a while to get there.

Cold start. Image search is useless with an empty product catalog. The feature became genuinely useful only after a few hundred indexed products. Building the feature before having that scale required some faith.

NSFW filtering. You can't let users upload anything and have it shown to others unfiltered. I repurpose the same SigLIP text encoder to score images against safe/unsafe text prompts before indexing. It's not perfect, but it catches obvious cases without a separate moderation model.

The Stack, For the Curious

Bot framework: aiogram 3 (Python async Telegram bot framework)
Database: PostgreSQL + SQLAlchemy 2
Vector search: Qdrant
ML model: SigLIP Base (Google), converted to ONNX + INT8 quantized
Text search: Elasticsearch with morphological analysis
Caching: Redis
Infra: Single VPS, Docker Compose

The entire thing — bot, databases, ML inference, search — runs on that $9/month server. Cost efficiency matters when you're building something that hasn't proven revenue yet.

Try It Now

If you want to see this in action, the bot is live at @k4pi_bot.

k4pi is an AI-powered marketplace bot for buying and selling on Telegram. You can list products, browse listings, and — yes — search by photo. Send it a picture of something you're looking for and watch it find visually similar items from real listings.

Open @k4pi_bot on Telegram

It's free to use. If you're building something similar or just curious how the pieces fit together, feel free to reach out. The architecture decisions here weren't obvious at the start, and I'm happy to share what I learned.

Building in public. The mistakes were educational.

Top comments (1)

Haskell Thurber • Mar 3

Really interesting approach with image-based search! The visual similarity matching is a much better UX than keyword-only filtering.

I'm curious about your bot's message handling throughput — do you batch the image processing or handle each request synchronously? I ran into similar architectural decisions building a Telegram Mini App for anonymous messaging and found that queue-based processing with fallback responses made the experience feel faster even when the backend was under load.

What's your latency like for the image similarity results?